In contrast to construct a static communication logic before the training job starts, TSEngine adaptively optimizes communication logic in a dynamically scheduling manner.
As shown in Figure 1, In TSEngine, a global coordinator online schedules each model transmission between workers and the parameter server based on its real-time perception of the network.
Figure 1 Overview of TSEngine
MXNET is a popular, distributed machine learningframework. The PS-LITE library is an implementation of a parameter server system on MXNET.
In our work, we implement TSEngine as an independent communication scheduling layer of PS-LITE, which is located between theKV-APP layer and the VAN layer, as shown in Figure 2.
Figure 2
-
Replace the source code of "3rdparty/ps-lite" and "src/kvstore" in MXNET(version 1.2.1) with modified codes in our repository.
-
Deployment method is same as https://mxnet.apache.org/api/faq/distributed_training.html.
-
Add two experiment variables for TSEngine on Scheduler, Server and Workerv in the startup script, as following:
a. ENABLE_TS = 1; // Enable the TSEngine function
b. MAX_GREED_RATE=[0,1]; // The maximum probability for greedy selection, which is a hyper-parameter given by the user
Author: Huaman Zhou(hmzhou@std.uestc.edu.cn), Weibo Cai(17719609705@163.com)