FluentPS: A Parameter Server Design with Low-frequency Synchronization for Distributed Deep Learning

Published in IEEE Cluster, 2019

Recommended citation: https://i.cs.hku.hk/~clwang/papers/2019-Cluster-FluentPS_cluster_v3.0.pdf

Xin Yao, Xueyu Wu, Cho-Li Wang FluentPS: A Parameter Server Design with Low-frequency Synchronization for Distributed Deep Learning

Download paper here

Abstract: With pursuing high accuracy on big datasets, current research prefers designing complex neural networks, which need to maximize data parallelism for short training time. Many distributed deep learning systems, such as MXNet and Petuum, widely use parameter server framework with relaxed synchronization models. Although these models could cost less on each synchronization, its frequency is still high among many workers, e.g., the soft barrier introduced by Stale Synchronous Parallel (SSP) model. In this paper, we introduce our parameter server design, namely FluentPS, which can reduce frequent synchronization and optimize communication overhead in a large-scale cluster. Different from using a single scheduler to manage all parameters’ synchronization in some previous designs, our system allows each server to independently adjust schemes for synchronizing its parameter shard and overlaps the push and pull processes of different servers. We also explore two methods to improve the SSP model: (1) lazy execution of buffered pull requests to reduce the synchronization frequency and (2) a probability-based strategy to pause the fast worker at a probability under SSP condition, which avoids unnecessary waiting of fast workers. We evaluate ResNet-56 with the same large batch size at different cluster scales. While guaranteeing robust convergence, FluentPS gains up to 6x speedup and reduce 93.7% communication costs than PS-Lite. The raw SSP model causes up to 131x delayed pull requests than our improved synchronization model, which can provide fine-tuned staleness controls and achieve higher accuracy.