From 159967f678ec26673843e848b65b0067104629b9 Mon Sep 17 00:00:00 2001 From: Kavya Srinet Date: Tue, 7 Nov 2017 11:03:49 -0800 Subject: [PATCH 1/8] Adding operator assignment --- doc/design/multi_device_training.md | 72 +++++++++++++++++++++++++++++ 1 file changed, 72 insertions(+) create mode 100644 doc/design/multi_device_training.md diff --git a/doc/design/multi_device_training.md b/doc/design/multi_device_training.md new file mode 100644 index 0000000000000..150400c0a4eda --- /dev/null +++ b/doc/design/multi_device_training.md @@ -0,0 +1,72 @@ +# Multi-device training in PaddlePaddle + +## Why Multi-device +On a typical device or system, there are multiple computing devices. In PaddlePaddle, the supported device types as of now, are CPU and GPU (we will be adding FPGA in the future too). + +If a PaddlePaddle operation has both CPU and GPU implementations, we decide which kernel to execute based on the device type. +Training deep learning models can be resource intensive. Even with a very powerful GPU, some models can take really long to train. This is obvious with deep learning models, especially recurrent models where the execution of each step depends on the execution and output of previous step. + +We also need to support multi-device during inference. When using FPGA for inference, we might want to use the CPU for some operator computations, GPU for some and FPGA for the others, since FPGA does not support all operators at the moment. In this setup, we also need multi-device support to facilitate the above scenario. + +We will have various combinations of devices in our PaddlePaddle usage setup. For example: + +1. DGX-1 : x64 CPU + CUDA +2. Intel Nervana system: x64 CPU + Xeon Phi +3. PX2/3 : ARM CPU + CUDA +4. Some servers : x64 CPU + FPGA + +Hence multi-device support will help us facilitate execution in the discussed device combinations as well. If we could come up with a way to optimize the usage of multiple heterogeneous devices that are available, we can achieve significant speedups during training as well as inference. + +There are two ways we could achieve this: +1. Data Parallelism +2. Model Parallelism + +### Data Parallelism +Data parallelism works by partitioning the training data over all the devices and hence distributes the workload over multiple devices. Each device has a copy of the complete model and only has access to 1/d of the total training data (if there are d devices in total). The updates from each device (gradients, parameter updates etc.) are communicated across all the devices once the device has an update. + +### Model Parallelism +Model parallelism on the other hand, works by keeping a part of the model on each available device. This is useful when the model is too large to keep on one device or when there are parts of the model that can be executed independently ( in parallel). In this setup, each device will train a part of the model and pass on its updates to the next device. + +Here, we want to explore the model parallelism setup, where different parts of the same model reside on different devices and communicate to each other by sending updates. + +### Components of Model Parallelism +Let us look at a very simple example of model parallelism in the figure below: +
+ +Here we have four GPUs, (say GPU: 0, GPU: 1, GPU: 2 and GPU: 3) and each GPU is executing a separate operator viz. Op1, Op2, Op3 and Op4. All the operators together are a part of the same model. + +### Operator placement policy +Apart from distributing data and model components on different computing devices, PaddlePaddle should support a feature of letting the users explicitly decide which operator(operation: MatMul etc) should run on which device (computation device: CPU, GPU, FPGA etc.). This can be modeled once we design the python API. + +There are various ways of addressing this setup: +1. Pick CPU by default. +2. Pick GPU by default, if the device has a GPU. +3. Pick the first GPU by default, if the device has multiple GPUs. +4. Provide the functionality to support explicit assignment of device for operations, using some configuration options when setting up the devices. TensorFlow supports this very elegantly as mentioned [here](https://www.tensorflow.org/tutorials/using_gpu#manual_device_placement) + +We can discuss this in more detail when designing the Python API. + +### Copy operator +Now to pass on the updates from GPU: 0 to GPU: 1, we need to somehow copy the updates made by GPU: 0 and move them to GPU: 1 . This can be done in two ways: +1. Copy updates from GPU: 0 to CPU. Then copy updates from CPU to GPU: 1. This is shown as follows: +
+ +2. Copy updates directly from GPU: 0 to GPU: 1, shown as follows: +
+ +The first approach above requires two memcpy operations, one from GPU to CPU and another one from CPU to GPU. The second approach however requires just one memcpy operation (one GPU to another). + +We have some low level implementations of CUDA memcpy [here](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/memory/memcpy.h) and the next step would be to write C++ operators to expose these functions. + +### Python API +To enable users to use the setup we talked about above, we need to set up a Python API that the users can use. We specifically need two components here: +1. The Python API: This is the interface that the users of PaddlePaddle will use to set up model/data parallelism. +2. ProgramDesc: Behind the scenes we need a module that can convert the python API to a ProgramDesc (which is a collection of repeated OpDescs). The ProgramDesc then will be sent to the Executor, which creates the Ops and eventually runs the Ops. + +We need to design the above two components as well as propose how the Python API will be parsed into ProgramDesc. +These components will be addressed in the following design documents, one for each component. + +### C API +We need to define the C API to support the functionality discussed in this design document, as well, which would be used during inference. +We will address the design of the API in the following design documents. +C-API is usually used in inference. We may also need to add C-API document for multi-device support. From c7ce315f2e7cb87640662850e5544994a2ade967 Mon Sep 17 00:00:00 2001 From: Kavya Srinet Date: Wed, 8 Nov 2017 13:30:37 -0800 Subject: [PATCH 2/8] Adding a prototype for documentation in layers --- python/paddle/v2/framework/layers.py | 30 +++++++++++++++++++++++++--- 1 file changed, 27 insertions(+), 3 deletions(-) diff --git a/python/paddle/v2/framework/layers.py b/python/paddle/v2/framework/layers.py index 22540b2b9731e..6322960de20ea 100644 --- a/python/paddle/v2/framework/layers.py +++ b/python/paddle/v2/framework/layers.py @@ -22,12 +22,36 @@ def fc(input, num_flatten_dims=1, main_program=None, startup_program=None): - # create helper + """ + Fully Connected Layer. + + Args: + input: The input tensor to the function + size: The size of the layer + param_attr: The parameters/weights to the FC Layer + bias_attr: The bias parameter for the FC layer + name: Name/alias of the function + act: Activation to be applied to the output of FC layer + num_flatten_dims: Number of columns in input + main_program: Name of the main program that calls this + startup_program: Name of the startup program + + This function can take in multiple inputs and performs the Fully Connected + function (linear transformation) on top of each of them. + So for input x, the output will be : Wx + b. Where W is the parameter, + b the bias and x is the input. + + The function also applies an activation (non-linearity) on top of the + output, if activation is passed in the input. + + All the input variables are passed in as local variables to the + LayerHelper function. + + """ helper = LayerHelper('fc', **locals()) dtype = helper.input_dtype() - # mul mul_results = [] for input_var, param_attr in helper.iter_inputs_and_params(): input_shape = input_var.shape @@ -531,7 +555,7 @@ def batch_norm(input, class BlockGuard(object): """ - BlockGuard used to create sub-block in program by using Python `with` + BlockGuard used to create sub-block in program by using Python `with` keyword. """ From e7cd591ca301e4d23f2ed73d4c8cf44ca5c4d1fe Mon Sep 17 00:00:00 2001 From: Kavya Srinet Date: Wed, 8 Nov 2017 13:54:21 -0800 Subject: [PATCH 3/8] small update to re-run Travis --- python/paddle/v2/framework/layers.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/python/paddle/v2/framework/layers.py b/python/paddle/v2/framework/layers.py index 6322960de20ea..cbec7c66bdd45 100644 --- a/python/paddle/v2/framework/layers.py +++ b/python/paddle/v2/framework/layers.py @@ -44,8 +44,8 @@ def fc(input, The function also applies an activation (non-linearity) on top of the output, if activation is passed in the input. - All the input variables are passed in as local variables to the - LayerHelper function. + All the input variables pf this function are passed in as local variables + to the LayerHelper constructor. """ helper = LayerHelper('fc', **locals()) From 8306bebfadb2afecd7762124a0e8eead2cdbbf7f Mon Sep 17 00:00:00 2001 From: Kavya Srinet Date: Wed, 8 Nov 2017 13:56:46 -0800 Subject: [PATCH 4/8] Removing file from another PR --- doc/design/multi_device_training.md | 72 ----------------------------- 1 file changed, 72 deletions(-) delete mode 100644 doc/design/multi_device_training.md diff --git a/doc/design/multi_device_training.md b/doc/design/multi_device_training.md deleted file mode 100644 index 150400c0a4eda..0000000000000 --- a/doc/design/multi_device_training.md +++ /dev/null @@ -1,72 +0,0 @@ -# Multi-device training in PaddlePaddle - -## Why Multi-device -On a typical device or system, there are multiple computing devices. In PaddlePaddle, the supported device types as of now, are CPU and GPU (we will be adding FPGA in the future too). - -If a PaddlePaddle operation has both CPU and GPU implementations, we decide which kernel to execute based on the device type. -Training deep learning models can be resource intensive. Even with a very powerful GPU, some models can take really long to train. This is obvious with deep learning models, especially recurrent models where the execution of each step depends on the execution and output of previous step. - -We also need to support multi-device during inference. When using FPGA for inference, we might want to use the CPU for some operator computations, GPU for some and FPGA for the others, since FPGA does not support all operators at the moment. In this setup, we also need multi-device support to facilitate the above scenario. - -We will have various combinations of devices in our PaddlePaddle usage setup. For example: - -1. DGX-1 : x64 CPU + CUDA -2. Intel Nervana system: x64 CPU + Xeon Phi -3. PX2/3 : ARM CPU + CUDA -4. Some servers : x64 CPU + FPGA - -Hence multi-device support will help us facilitate execution in the discussed device combinations as well. If we could come up with a way to optimize the usage of multiple heterogeneous devices that are available, we can achieve significant speedups during training as well as inference. - -There are two ways we could achieve this: -1. Data Parallelism -2. Model Parallelism - -### Data Parallelism -Data parallelism works by partitioning the training data over all the devices and hence distributes the workload over multiple devices. Each device has a copy of the complete model and only has access to 1/d of the total training data (if there are d devices in total). The updates from each device (gradients, parameter updates etc.) are communicated across all the devices once the device has an update. - -### Model Parallelism -Model parallelism on the other hand, works by keeping a part of the model on each available device. This is useful when the model is too large to keep on one device or when there are parts of the model that can be executed independently ( in parallel). In this setup, each device will train a part of the model and pass on its updates to the next device. - -Here, we want to explore the model parallelism setup, where different parts of the same model reside on different devices and communicate to each other by sending updates. - -### Components of Model Parallelism -Let us look at a very simple example of model parallelism in the figure below: -
- -Here we have four GPUs, (say GPU: 0, GPU: 1, GPU: 2 and GPU: 3) and each GPU is executing a separate operator viz. Op1, Op2, Op3 and Op4. All the operators together are a part of the same model. - -### Operator placement policy -Apart from distributing data and model components on different computing devices, PaddlePaddle should support a feature of letting the users explicitly decide which operator(operation: MatMul etc) should run on which device (computation device: CPU, GPU, FPGA etc.). This can be modeled once we design the python API. - -There are various ways of addressing this setup: -1. Pick CPU by default. -2. Pick GPU by default, if the device has a GPU. -3. Pick the first GPU by default, if the device has multiple GPUs. -4. Provide the functionality to support explicit assignment of device for operations, using some configuration options when setting up the devices. TensorFlow supports this very elegantly as mentioned [here](https://www.tensorflow.org/tutorials/using_gpu#manual_device_placement) - -We can discuss this in more detail when designing the Python API. - -### Copy operator -Now to pass on the updates from GPU: 0 to GPU: 1, we need to somehow copy the updates made by GPU: 0 and move them to GPU: 1 . This can be done in two ways: -1. Copy updates from GPU: 0 to CPU. Then copy updates from CPU to GPU: 1. This is shown as follows: -
- -2. Copy updates directly from GPU: 0 to GPU: 1, shown as follows: -
- -The first approach above requires two memcpy operations, one from GPU to CPU and another one from CPU to GPU. The second approach however requires just one memcpy operation (one GPU to another). - -We have some low level implementations of CUDA memcpy [here](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/memory/memcpy.h) and the next step would be to write C++ operators to expose these functions. - -### Python API -To enable users to use the setup we talked about above, we need to set up a Python API that the users can use. We specifically need two components here: -1. The Python API: This is the interface that the users of PaddlePaddle will use to set up model/data parallelism. -2. ProgramDesc: Behind the scenes we need a module that can convert the python API to a ProgramDesc (which is a collection of repeated OpDescs). The ProgramDesc then will be sent to the Executor, which creates the Ops and eventually runs the Ops. - -We need to design the above two components as well as propose how the Python API will be parsed into ProgramDesc. -These components will be addressed in the following design documents, one for each component. - -### C API -We need to define the C API to support the functionality discussed in this design document, as well, which would be used during inference. -We will address the design of the API in the following design documents. -C-API is usually used in inference. We may also need to add C-API document for multi-device support. From bdb5a406100ef3197752ac9d580fb4afdd055342 Mon Sep 17 00:00:00 2001 From: Kavya Srinet Date: Wed, 8 Nov 2017 14:50:11 -0800 Subject: [PATCH 5/8] Small typo --- python/paddle/v2/framework/layers.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/paddle/v2/framework/layers.py b/python/paddle/v2/framework/layers.py index 98e497169d429..b17bc9bb069ff 100644 --- a/python/paddle/v2/framework/layers.py +++ b/python/paddle/v2/framework/layers.py @@ -44,7 +44,7 @@ def fc(input, The function also applies an activation (non-linearity) on top of the output, if activation is passed in the input. - All the input variables pf this function are passed in as local variables + All the input variables of this function are passed in as local variables to the LayerHelper constructor. """ From 934d3be74cb5be5fadb72d5c93669757069f4967 Mon Sep 17 00:00:00 2001 From: Kavya Srinet Date: Wed, 20 Dec 2017 16:32:39 -0800 Subject: [PATCH 6/8] Adding documentation for the operators: lod_tensor_to_array , array_to_lod_tensor, create_array, increment --- python/paddle/v2/fluid/layers/control_flow.py | 78 ++++++++++++++++--- 1 file changed, 68 insertions(+), 10 deletions(-) diff --git a/python/paddle/v2/fluid/layers/control_flow.py b/python/paddle/v2/fluid/layers/control_flow.py index dc6c0e7f518ee..655581cd14418 100644 --- a/python/paddle/v2/fluid/layers/control_flow.py +++ b/python/paddle/v2/fluid/layers/control_flow.py @@ -440,9 +440,23 @@ def topk(input, k): def lod_tensor_to_array(x, table): - """ - This function creates an operator to convert an LOD_Tensor to - an array. + """This function performs the operation that converts an LOD_Tensor to + an array. + + Args: + x (LoDTensor): The tensor that needs to be converted to an array. + table (LoDRankTable): The RankTable that provides the coarse lod + infomation to build the output LoDTensorArray. + + Returns: + Variable: The tensor variable of type LOD_TENSOR. + + Examples: + .. code-block:: python + + x = fluid.layers.data(name='x', shape=[10]) + table = fluid.layers.lod_rank_table(x, level=0) + array = fluid.layers.lod_tensor_to_array(x, table) """ helper = LayerHelper("lod_tensor_to_array", **locals()) array = helper.create_variable( @@ -458,9 +472,24 @@ def lod_tensor_to_array(x, table): def array_to_lod_tensor(x, table): - """ - This function creates an operator to convert an array to a - LOD_Tensor. + """This function performs the operations that converts an array to + an LOD_Tensor. + + Args: + x (LoDTensorArray): The array that needs to be converted to LOD_Tensor. + table (LoDRankTable): The RankTable that provides the coarse lod + infomation to build the output LoDTensor. + + Returns: + Variable: The array variable of type LOD_TENSOR_ARRAY. + + Examples: + .. code-block:: python + + x = fluid.layers.data(name='x', shape=[10]) + table = fluid.layers.lod_rank_table(x, level=0) + array = fluid.layers.lod_tensor_to_array(x, table) + lod_tensor = fluid.layers.array_to_lod_tensor(array, table) """ helper = LayerHelper("array_to_lod_tensor", **locals()) tmp = helper.create_tmp_variable(dtype=x.dtype) @@ -473,10 +502,24 @@ def array_to_lod_tensor(x, table): def increment(x, value=1.0, in_place=True): - """ - This function creates an operator to increment each value in the input - `x` by an amount: `value` as mentioned in the input parameter. This - operation is performed in-place by default. + """This function performs an operation that increments each value in the + input :math:`x` by an amount: :math:`value` as mentioned in the input + parameter. This operation is performed in-place by default. + + Args: + x (LoDTensor): The tensor that has the input values. + value (float): The amount by which the values should be incremented. + in_place (bool): If the increment should be performed in-place. + + Returns: + Variable: The tensor variable storing the transformation of + element-wise increment of each value in the input. + + Examples: + .. code-block:: python + + data = fluid.layers.data(name='data', shape=[32, 32], dtype='float32') + data = fluid.layers.increment(x=data, value=3.0, in_place=True) """ helper = LayerHelper("increment", **locals()) if not in_place: @@ -511,6 +554,21 @@ def array_write(x, i, array=None): def create_array(dtype): + """This function creates an array of type :math:`LOD_TENSOR_ARRAY` using the + LayerHelper. + + Args: + dtype (data type): The data type of the elements in the array. + + Returns: + Variable: The tensor variable storing the elements of data type. + + Examples: + .. code-block:: python + + data = fluid.layers.create_array(dtype='float32') + + """ helper = LayerHelper("array", **locals()) return helper.create_variable( name="{0}.out".format(helper.name), From 395ebf217bd1ab360715fb78e372a087c3d6382b Mon Sep 17 00:00:00 2001 From: Kavya Srinet Date: Wed, 20 Dec 2017 17:04:20 -0800 Subject: [PATCH 7/8] Fixing indentation issue --- python/paddle/v2/fluid/layers/control_flow.py | 20 +++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/python/paddle/v2/fluid/layers/control_flow.py b/python/paddle/v2/fluid/layers/control_flow.py index 655581cd14418..f7d3f6e280d3b 100644 --- a/python/paddle/v2/fluid/layers/control_flow.py +++ b/python/paddle/v2/fluid/layers/control_flow.py @@ -454,9 +454,9 @@ def lod_tensor_to_array(x, table): Examples: .. code-block:: python - x = fluid.layers.data(name='x', shape=[10]) - table = fluid.layers.lod_rank_table(x, level=0) - array = fluid.layers.lod_tensor_to_array(x, table) + x = fluid.layers.data(name='x', shape=[10]) + table = fluid.layers.lod_rank_table(x, level=0) + array = fluid.layers.lod_tensor_to_array(x, table) """ helper = LayerHelper("lod_tensor_to_array", **locals()) array = helper.create_variable( @@ -486,10 +486,10 @@ def array_to_lod_tensor(x, table): Examples: .. code-block:: python - x = fluid.layers.data(name='x', shape=[10]) - table = fluid.layers.lod_rank_table(x, level=0) - array = fluid.layers.lod_tensor_to_array(x, table) - lod_tensor = fluid.layers.array_to_lod_tensor(array, table) + x = fluid.layers.data(name='x', shape=[10]) + table = fluid.layers.lod_rank_table(x, level=0) + array = fluid.layers.lod_tensor_to_array(x, table) + lod_tensor = fluid.layers.array_to_lod_tensor(array, table) """ helper = LayerHelper("array_to_lod_tensor", **locals()) tmp = helper.create_tmp_variable(dtype=x.dtype) @@ -518,8 +518,8 @@ def increment(x, value=1.0, in_place=True): Examples: .. code-block:: python - data = fluid.layers.data(name='data', shape=[32, 32], dtype='float32') - data = fluid.layers.increment(x=data, value=3.0, in_place=True) + data = fluid.layers.data(name='data', shape=[32, 32], dtype='float32') + data = fluid.layers.increment(x=data, value=3.0, in_place=True) """ helper = LayerHelper("increment", **locals()) if not in_place: @@ -566,7 +566,7 @@ def create_array(dtype): Examples: .. code-block:: python - data = fluid.layers.create_array(dtype='float32') + data = fluid.layers.create_array(dtype='float32') """ helper = LayerHelper("array", **locals()) From 6ffa2e51929e2b78a4a2c2d79daf1a71814cbe96 Mon Sep 17 00:00:00 2001 From: Kavya Srinet Date: Wed, 20 Dec 2017 21:30:23 -0800 Subject: [PATCH 8/8] Fixed datatype of input variables --- python/paddle/v2/fluid/layers/control_flow.py | 26 +++++++++++-------- 1 file changed, 15 insertions(+), 11 deletions(-) diff --git a/python/paddle/v2/fluid/layers/control_flow.py b/python/paddle/v2/fluid/layers/control_flow.py index f7d3f6e280d3b..73dafa46764cf 100644 --- a/python/paddle/v2/fluid/layers/control_flow.py +++ b/python/paddle/v2/fluid/layers/control_flow.py @@ -444,12 +444,14 @@ def lod_tensor_to_array(x, table): an array. Args: - x (LoDTensor): The tensor that needs to be converted to an array. - table (LoDRankTable): The RankTable that provides the coarse lod - infomation to build the output LoDTensorArray. + x (Variable|list): The tensor that needs to be converted to an array. + table (ParamAttr|list): The variable that stores the level of lod + which is ordered by sequence length in + descending order. Returns: - Variable: The tensor variable of type LOD_TENSOR. + Variable: The variable of type array that has been converted from a + tensor. Examples: .. code-block:: python @@ -476,12 +478,14 @@ def array_to_lod_tensor(x, table): an LOD_Tensor. Args: - x (LoDTensorArray): The array that needs to be converted to LOD_Tensor. - table (LoDRankTable): The RankTable that provides the coarse lod - infomation to build the output LoDTensor. + x (Variable|list): The array that needs to be converted to a tensor. + table (ParamAttr|list): The variable that stores the level of lod + which is ordered by sequence length in + descending order. Returns: - Variable: The array variable of type LOD_TENSOR_ARRAY. + Variable: The variable of type tensor that has been converted + from an array. Examples: .. code-block:: python @@ -507,13 +511,13 @@ def increment(x, value=1.0, in_place=True): parameter. This operation is performed in-place by default. Args: - x (LoDTensor): The tensor that has the input values. + x (Variable|list): The tensor that has the input values. value (float): The amount by which the values should be incremented. in_place (bool): If the increment should be performed in-place. Returns: Variable: The tensor variable storing the transformation of - element-wise increment of each value in the input. + element-wise increment of each value in the input. Examples: .. code-block:: python @@ -558,7 +562,7 @@ def create_array(dtype): LayerHelper. Args: - dtype (data type): The data type of the elements in the array. + dtype (int|float): The data type of the elements in the array. Returns: Variable: The tensor variable storing the elements of data type.