-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update 05_ddp.md #525
base: master
Are you sure you want to change the base?
Update 05_ddp.md #525
Conversation
sbp示例代码中,DistributedSampler封装使dataloader进行分布式数据划分
代码没问题后。记得英文版( |
我发现原文档中已经提示过 sampler 的问题 https://docs.oneflow.org/master/parallelism/05_ddp.html#distributedsampler 所以最开始的例子还是保持不变吧,在 05_ddp.html#distributedsampler 那节做修改,添加一个有 distributedsampler,所以单机单卡和分布式训练效果会一致的例子。 |
这一条要求好像还没有更新 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在好像还有些 comment 的意见没有处理,请查看下,不管修改不修改,都回复下。
另外,如果是 ready for review 的状态,请提供下在线预览或者编译效果截图。
@@ -88,6 +91,8 @@ | |||
y = y.to_global(placement=PLACEMENT, sbp=S0) | |||
``` | |||
|
|||
- 需要注意的是,在进行分布式并行训练时,代码中规定的`BATCH_SIZE`为每一台机器的本地值而非`GLOBAL_BATCH_SIZE`,故上述代码单机双卡`BATCH_SIZE=64`的训练效果与单机单卡`BATCH_SIZE=128`一致。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 需要注意的是,在进行分布式并行训练时,代码中规定的`BATCH_SIZE`为每一台机器的本地值而非`GLOBAL_BATCH_SIZE`,故上述代码单机双卡`BATCH_SIZE=64`的训练效果与单机单卡`BATCH_SIZE=128`一致。 | |
- 需要注意的是,在进行分布式并行训练时,代码中规定的 `BATCH_SIZE` 为每一台机器的本地值而非`GLOBAL_BATCH_SIZE`,故上述代码单机双卡 `BATCH_SIZE=64` 的训练效果与单机单卡 `BATCH_SIZE=128` 一致。 |
中英文之间、中文和数字之间要有空格。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
其实我觉得这句不用加这里,因为它如果懂 global tensor,应该自己懂这个道理。
如果真要解释,是不是把 global tensor 那篇文章多做解释,解释下各种 sbp 下,to global 后的 global tensor 的形状。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
其实我觉得这句不用加这里,因为它如果懂 global tensor,应该自己懂这个道理。
如果真要解释,是不是把 global tensor 那篇文章多做解释,解释下各种 sbp 下,to global 后的 global tensor 的形状。
好的,global tensor的文档中已经有相应的tensor形状变化的解释以及例子。因为客户在微信聊天记录里问了一下这个batch_size=64
是local还是global,我想着这里再解释一遍。
我的理解是,上面的那个sbp例子是有问题的,因为没有加distributedsampler,导致训练时两张卡拿到了一样的数据,global_tensor就没起效果,客户跑了之后就产生了疑问。 |
sbp示例代码中,DistributedSampler封装使dataloader进行分布式数据划分