-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2018 milestones #9108
Comments
Should we specify a TensorFlow version (e.g., the latest release v1.6.0) for performance comparison? Otherwise if after we surpass TensorFlow's performance, they release a new well-optimized version, we would be in an awkward position. Not saying that we should not aim at being better than the latest TensorFlow, my point is maybe we should focus on a fixed target first. |
Is there a need for model parallelism other than large embedding lookups. If so we may want to change to "Support large embedding lookups as well as data parallelism" |
Currently, we use the parameter server architecture (via send/recv operator) for parameter update, it's a completely different architecture with all-reduce. From my understanding their theoretical network throughput consumption and time consumption for each step are similar. We already support parameter server architecture, what is the reason that we need to support another approach with similar performance?
If we use CSP for cluster training, it looks more like the parameter server architecture than the all-reduce architecture. |
I agree with @helinwang that MPI AllReduce is NOT part of the milestones of PaddlePaddle. I am open to someone out of PaddlePaddle team to try that approach, but it makes sense only if they could do it when they run PaddlePaddle jobs in containers. |
Thanks to @reyoung and @helinwang and others for this list. I tried to summarize the milestones as the follows. @PaddlePaddle/paddle team please comment note from @panyx0718 about eng time: 6 fulltime months could mean 3 people spend 2 months fulltime working on something
|
This is a solid list, thanks to everyone who worked on it. (more comments coming soon) |
@helinwang @wangkuiyi The reason for supporting MPI all-reduce is that latest openmpi implement can support GPU direct if the hardware supports it. This is the fastest way to implement very high performance distributed GPU training. Anyway, we can still try the time-consuming way: implement GPU direct using CUDA libs directly. What do you think? |
The principle is that we need to make sure that the distributed training system is reasonably easy to use. Given that most AI systems would depend on not only PaddlePaddle, but application-specific third party software, e.g., OpenCV for vision, we’d prefer to run AI applications inside containers.
Another invariant is that we don’t want to loss the capability of fault recovery.
Given above two rules, I would like to see an application of MPI with PaddlePaddle Fluid. I think the capability of starting a distributed job using MPI is more urgent than the speed-up, because it seems that many teams inside Baidu are using MPI.
On Mar 15, 2018, at 6:50 PM, 武毅 <notifications@github.com<mailto:notifications@github.com>> wrote:
@helinwang<https://github.com/helinwang> @wangkuiyi<https://github.com/wangkuiyi> The reason for supporting MPI all-reduce is that latest openmpi implement can support GPU direct if the hardware supports it. This is the fastest way to implement very high performance distributed training. Anyway, we can still try the time-consuming way: implement GPU direct using CUDA libs directly. What do you think?
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub<#9108 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AbwsoChN6ePfsLVvjT8Bau2lEFFv0nC2ks5texpggaJpZM4Srsy->.
|
@wangkuiyi Thanks for the milestone. I suggest we add 2 more columns: 1. the number of full time engineers and 2. the number of months spent developing them. For example: |
Good point! @panyx0718 Please feel free to add these columns. |
@typhoonzero @wangkuiyi @PaddleCI thanks for the comments! Good to know the MPI need from Baidu, as well as the GPU direct support from openmpi. Given that we already have NCCL all-reduce, the development time for integrating openmpi (or NCCL2) maybe not that high, plus the additional benefit of already tuned communication and GPU direct support. That could save us a lot of effort. Fault recovery can be added by checkpointing, maybe fault tolerance on MPI can be added with some special peer-aware logic that creates a new MPI communicator when some node left or joined. |
@wangkuiyi Done |
您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持! |
Fluid supports multi-GPUs and cluster, and high usability
Deadline:
KPI:
Fluid distributed computing
Deadline:
KPI:
Compatible with ONNX
Deadline:
KPI:
ProgramDesc
can be converted to ONNX model files.ProgramDesc
, and make Fluid can train ONNX model.Support CSP program model and imperative programming
Deadline:
KPI:
ProgramDesc
, and an interpreter will execute theProgramDesc
.ProgramDesc
includesIfElse
operator andWhile
, and supportsauto diff
.ProgramDesc
. Deeply integrate withVisualDL
to give a GUI.coroutines
,channel
,select
) inProgramDesc
. Use CSP to implement multi GPUs and cluster training.The text was updated successfully, but these errors were encountered: