-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add interface to launch parallel dygraph by multiprocessing #26044
Merged
chenwhql
merged 38 commits into
PaddlePaddle:develop
from
chenwhql:dygraph/add_multiprocess_run_interface
Aug 28, 2020
Merged
Changes from all commits
Commits
Show all changes
38 commits
Select commit
Hold shift + click to select a range
97b8bdc
add dygraph parallel run interface
chenwhql 00b56d5
polish implement & unified env property name
chenwhql 17f7fe9
add print config arg
chenwhql 07c86aa
refactor init_parallel_env function
chenwhql 4c955a1
Compatible with multiprocessing and launch modes
chenwhql 523e007
set default trainer start port
chenwhql 8101b03
support run in python 2
chenwhql d3b9a06
polish python2 support code
chenwhql 48c46ff
remove python2 support
chenwhql b06d400
refine launch import
chenwhql e1df353
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
chenwhql 2c7b3fd
polish dome design details
chenwhql 39fddff
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
chenwhql d26f495
refactor api implemention & path
chenwhql bf985cc
use new method _set_expected_place
chenwhql 7939384
add spawn unittest framework & mnist test
chenwhql 95c0367
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
chenwhql 04580d8
add more unittests & doc
chenwhql 131afd4
fix unittest failed
chenwhql e170f10
polish english doc
chenwhql 0ef215d
self review and polish details
chenwhql b27cfee
refactor code by reviewer's comments
chenwhql f50f343
fix unittest failed
chenwhql 11221a8
fix parallel_env unittest
chenwhql 0980c23
fix several typos
chenwhql af50518
fix error introduced when fixing typos
chenwhql a378140
add unpublic note for start_processes
chenwhql cca82b6
polish details by xiaoguang's comment
chenwhql 82223a6
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
chenwhql d39331c
verify correctly when spawn nprocs=-1
chenwhql 10df04c
resolve collective api conflict
chenwhql 3a2d7e8
refactor spawn & init_parallel_env design
chenwhql 0582c4b
polish doc details
chenwhql 9ceaeff
open spawn unittests
chenwhql 4b7d810
try to fix doc compile error
chenwhql 4261e22
try to fix unknown doc format error
chenwhql cad6872
add skip unittest when not gpu
chenwhql 377c919
resolve develop conflict
chenwhql File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,184 @@ | ||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except jin compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
import os | ||
import six | ||
import warnings | ||
|
||
from paddle import compat as cpt | ||
|
||
# deprecated module import | ||
from paddle.fluid import core | ||
from paddle.fluid.framework import _set_expected_place | ||
from paddle.fluid.dygraph import parallel_helper | ||
from paddle.fluid.dygraph.parallel import ParallelEnv | ||
|
||
__all__ = ["init_parallel_env"] | ||
|
||
ParallelStrategy = core.ParallelStrategy | ||
|
||
|
||
def init_parallel_env(backend='nccl'): | ||
""" | ||
Initialize parallel training environments in dynamic mode. | ||
|
||
Args: | ||
backend(str, optional): The backend to communication between multiple devices. | ||
Now only support ``nccl`` . Default value is ``nccl`` . | ||
|
||
Returns: | ||
None | ||
|
||
Examples: | ||
.. code-block:: python | ||
|
||
import paddle | ||
import paddle.nn as nn | ||
import paddle.optimizer as opt | ||
import paddle.distributed as dist | ||
|
||
class LinearNet(nn.Layer): | ||
def __init__(self): | ||
super(LinearNet, self).__init__() | ||
self._linear1 = nn.Linear(10, 10) | ||
self._linear2 = nn.Linear(10, 1) | ||
|
||
def forward(self, x): | ||
return self._linear2(self._linear1(x)) | ||
|
||
def train(): | ||
# 1. enable dynamic mode | ||
paddle.disable_static() | ||
|
||
# 2. initialize parallel environment | ||
dist.init_parallel_env() | ||
|
||
# 3. create data parallel layer & optimizer | ||
layer = LinearNet() | ||
dp_layer = paddle.DataParallel(layer) | ||
|
||
loss_fn = nn.MSELoss() | ||
adam = opt.Adam( | ||
learning_rate=0.001, parameters=dp_layer.parameters()) | ||
|
||
# 4. run layer | ||
inputs = paddle.randn([10, 10], 'float32') | ||
outputs = dp_layer(inputs) | ||
labels = paddle.randn([10, 1], 'float32') | ||
loss = loss_fn(outputs, labels) | ||
|
||
loss = dp_layer.scale_loss(loss) | ||
loss.backward() | ||
dp_layer.apply_collective_grads() | ||
|
||
adam.step() | ||
adam.clear_grad() | ||
|
||
if __name__ == '__main__': | ||
dist.spawn(train) | ||
""" | ||
|
||
# 1. input check | ||
if not isinstance(backend, six.string_types): | ||
raise TypeError("input `backend` type error, expected type is str, " | ||
"but received type is %s." % type(backend)) | ||
if cpt.to_text(backend) != 'nccl': | ||
raise ValueError( | ||
"backend `%s` is not supported, now only supports `nccl` backend." % | ||
backend) | ||
|
||
# 2. check env | ||
def _check_var_exists(var_name): | ||
var = os.environ.get(var_name, None) | ||
if var is None: | ||
raise ValueError("paddle.distributed initialize error, " | ||
"environment variable %s is needed, but not set." % | ||
var_name) | ||
|
||
_check_var_exists("FLAGS_selected_gpus") | ||
_check_var_exists("PADDLE_TRAINER_ID") | ||
_check_var_exists("PADDLE_CURRENT_ENDPOINT") | ||
_check_var_exists("PADDLE_TRAINERS_NUM") | ||
_check_var_exists("PADDLE_TRAINER_ENDPOINTS") | ||
|
||
# 3. init ParallelStrategy | ||
strategy = ParallelStrategy() | ||
if cpt.to_text(backend) == 'nccl': | ||
if parallel_helper._is_parallel_ctx_initialized(): | ||
warnings.warn("The parallel environment has been initialized.") | ||
strategy.nranks = ParallelEnv().world_size | ||
strategy.local_rank = ParallelEnv().rank | ||
strategy.trainer_endpoints = ParallelEnv().trainer_endpoints | ||
strategy.current_endpoint = ParallelEnv().current_endpoint | ||
if strategy.nranks < 2: | ||
return | ||
# NOTE(chenweihang): [ why config global place here? ] | ||
# the dygraph mode will be set to default mode, | ||
# users will not call `dygraph.guard` or `enable_dygraph` | ||
# directly, if they want to switch default place, | ||
# they need to call a function to change default place, | ||
# here just set correctly place to users | ||
place = core.CUDAPlace(ParallelEnv().device_id) | ||
_set_expected_place(place) | ||
willthefrog marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# init nccl context | ||
parallel_helper._set_parallel_ctx( | ||
core.NCCLParallelContext(strategy, place)) | ||
parallel_helper._init_parallel_ctx() | ||
|
||
|
||
def get_rank(): | ||
""" | ||
Returns the rank of current trainer. | ||
|
||
Its value is equal to the value of the environment variable ``PADDLE_TRAINER_ID`` . | ||
The default value is 0. | ||
|
||
Returns: | ||
(int) The rank of current trainer. | ||
|
||
Examples: | ||
.. code-block:: python | ||
|
||
import paddle | ||
import paddle.distributed as dist | ||
|
||
# execute this command in terminal: export PADDLE_TRAINER_ID=0 | ||
print("The rank is %d" % dist.get_rank()) | ||
# The rank is 0 | ||
""" | ||
return ParallelEnv().rank | ||
|
||
|
||
def get_world_size(): | ||
""" | ||
The number of trainers (number of processes participating in current job). | ||
|
||
Its value is equal to the value of the environment variable ``PADDLE_TRAINERS_NUM`` . | ||
The default value is 1. | ||
|
||
Returns: | ||
(int) The number of trainers. | ||
|
||
Examples: | ||
.. code-block:: python | ||
|
||
import paddle | ||
import paddle.distributed as dist | ||
|
||
# execute this command in terminal: export PADDLE_TRAINERS_NUM=4 | ||
print("The world_size is %d" % dist.get_world_size()) | ||
# The world_size is 4 | ||
""" | ||
return ParallelEnv().world_size |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NCCL is an underlying communication library, I don't think it's necessary to let users know we have different backends here. If we want to support operating system such as windows that doesn't support NCCL, it's better to detect the operating system inside the init function to use other communication library, such as gloo. I highly recommend to remove backend argument currently for simplicity of usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx, I think it is okay to remove it, we can discuss removing this argument by cherry-pick