Trainer ParameterServer Design Doc related discussion #2106

dzhwinter · 2017-05-11T20:59:04Z

maybe open an issue is the way to make us agree on some complicated point , list these interface in detail.
Trainer和ParameterServer交互的接口
备选方案：1

    /* get Parameters */
    int32_t PullParameters(<map<block_id/*pname*/>*Tensor params>);
    
    /* set Parameters */
    int32_t UpdateParameters(<map<block_id/*pname*/>*Tensor params>，float32 leaningrate)

@helin

备选方案：2

    /* get Parameters */
    int32_t PullParameters(<map<string/*pname*/>*Tensor params);
      /* set Parameters */
    int32_t UpdateParameters(<map<string/*pname*/>*Tensor params)
    /* setup Update method，calculcate the learningrate policy in optimizer/updater module */
    void setUpdater(UpdaterBase)

其他深度学习框架使用了相似实现：mxnet ,

    /*   Given a key, assume x(left Tensor) is the received (pushed) value and y(right Tensor) is the value stored on the store node. The store updates y by `h(x, &y)`. The default h is ASSIGN, namely `*y = x`. */
    typedef std::function<void(Tensor& , Tensor*)> UpdaterFunction
    void setUpdater(UpdaterFunction)
    e.g.
    auto updater = [](Tensor& x, Tensor*y) {
      y -= x;
    }

@dzhwinter
ParameterClient_CAPI C++调用Go ParameterClient library的接口函数

选1or2，update的方式
对于稀疏更新和正则化等更新的支持，是在ParameterServer端实现，还是在Trainer端实现？ *more important

helinwang · 2017-05-11T21:53:24Z

@dzhwinter Paddle现在不支持动态调节learning rate，要不就像你说的一样不要在UpdateParamter（我觉得SendGradient这个名字更直接）的时候发learning rate了。
部分接口：

func InitUpdater(protoBuf []byte) error
func SendGradient(grads []Gradient) error
func GetParameter(names []string) ([]Parameter, error)

helinwang · 2017-05-11T23:33:25Z

刚刚向徐老师@emailweixu 请教了一下：

Paddle使用PServer的更新方式有两种：

在本地多次执行优化算法（比如Adam加上L1正则），然后把模型的差异（diff）发给PServer，PServer收到trainer的diff之后乘以步长更新PServer的模型。
每次本地执行优化算法，算出梯度发给PServer，PServer收到梯度之后在PServer上做优化。

PServer上做优化时，Sparse参数更新还需要多做一件事情：
每次PServer收到Gradient的时候，会更新对应的Sparse参数部分。在更新（以及读）的时候会“补偿”对应的L1/L2 regularization。
为什么要补偿：
每次模型更新的时候，L1/L2都应该被施加在所有的参数上（包涵Gradient = 0的参数），但是Sparse更新的时候因为总模型非常大，不适合每次都更新Sparse参数的时候同时更新所有的模型，所以对应的更新被delay到了对那个Sparse参数的读和写的时候。

总结以上，我理解是两种更新方式最终应该都需要支持，第一种更适合异步SGD，PServer实现也比较简单。感觉我们可以第一期异步的时候只支持第一种。

wangkuiyi · 2017-05-12T01:25:40Z

I agree that we implement Algorithm 1. as our first stage.

typhoonzero · 2017-05-12T05:46:10Z

@helinwang 在1中不明白的是：

在localUpdater_不为null时，sendType = PARAMETER_DELTA即第一种方式，这种方式里在finishBatch的时候并不会获取更新后的parameters： https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/trainer/RemoteParameterUpdater.cpp#L213 更新后的模型什么时候下发到trainer呢？

而且根据目前的实现来看，v2 python API是直接使用第2种方法的。

helinwang · 2017-05-12T06:19:06Z

@typhoonzero 赞细致！我理解是在这个时候才会做一次收发：https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/trainer/RemoteParameterUpdater.h#L220 。具体收发代码在哪，明天找一下。

计划的是第一版只做SGD，没有其他优化算法，所以PServer上需要的实现，对1和2是一样的（diff或者grad直接乘以步长加到模型上）。
用1的好处是在同样的pserver实现，我们可以支持其他优化算法（本地已经有了），比本来打算的“v2 Python API的第二种方法，并且只支持最基本的SGD，没有L1 / L2 loss”要强大一点。当然如果保持现有的Python v2，只支持最基本的SGD，没有L1 / L2 loss，实现起来就更容易了（只用实现PServer，不用改Python）。因为PServer上需要的实现是一样的，两者不冲突。

helinwang · 2017-05-12T21:26:33Z

@typhoonzero 我又找了一下，关于“更新后的模型什么时候下发到trainer呢”，应该在这里：https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/trainer/RemoteParameterUpdater.cpp#L558 ，判断如果不需要remote update则返回，需要的话则发送，我理解的是如果有发送就有收到更新的parameter（https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/trainer/RemoteParameterUpdater.cpp#L537 ）。

以上代码都属于ConcurrentRemoteParameterUpdater，你之前粘贴的属于RemoteParameterUpdater，貌似最新的是ConcurrentRemoteParameterUpdater: https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/trainer/TrainerInternal.cpp#L256

typhoonzero · 2017-05-13T00:57:05Z

RemoteParameterUpdater的下载parameter在 https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/trainer/RemoteParameterUpdater.cpp#L332

ConcurrentRemoteParameterUpdater是在使用v1方式启动分布式训练时使用的默认updater。但是在python v2API中默认还是使用的RemoteParameterUpdater。而且并没有创建localUpdater_，所以V2 python API目前应该只能实现momentum SGD优化算法，其他优化算法虽然可以配置运行，但由于并不是上传delta而是上传grads，有可能会有问题？

helinwang mentioned this issue May 12, 2017

Design Doc: The Client Library of Parameter Server #2075

Merged

typhoonzero mentioned this issue May 13, 2017

Use ConcurrentRemoteParameterUpdater, and use localUpdater_ for other optimization algorithms #2121

Closed

qingqing01 closed this as completed Aug 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer ParameterServer Design Doc related discussion #2106

Trainer ParameterServer Design Doc related discussion #2106

dzhwinter commented May 11, 2017 •

edited

Loading

helinwang commented May 11, 2017 •

edited

Loading

helinwang commented May 11, 2017 •

edited

Loading

wangkuiyi commented May 12, 2017

typhoonzero commented May 12, 2017 •

edited

Loading

helinwang commented May 12, 2017 •

edited

Loading

helinwang commented May 12, 2017 •

edited

Loading

typhoonzero commented May 13, 2017

Trainer ParameterServer Design Doc related discussion #2106

Trainer ParameterServer Design Doc related discussion #2106

Comments

dzhwinter commented May 11, 2017 • edited Loading

helinwang commented May 11, 2017 • edited Loading

helinwang commented May 11, 2017 • edited Loading

wangkuiyi commented May 12, 2017

typhoonzero commented May 12, 2017 • edited Loading

helinwang commented May 12, 2017 • edited Loading

helinwang commented May 12, 2017 • edited Loading

typhoonzero commented May 13, 2017

dzhwinter commented May 11, 2017 •

edited

Loading

helinwang commented May 11, 2017 •

edited

Loading

helinwang commented May 11, 2017 •

edited

Loading

typhoonzero commented May 12, 2017 •

edited

Loading

helinwang commented May 12, 2017 •

edited

Loading

helinwang commented May 12, 2017 •

edited

Loading