Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start from 1.12, torch_ccl is renamed as oneccl_bindings_for_pytorch … #18229

Merged
merged 3 commits into from
Jul 27, 2022

Conversation

sywangyi
Copy link
Contributor

…and should import it before use

Signed-off-by: Wang, Yi A yi.a.wang@intel.com

What does this PR do?

when run the transformer with torch 1.12 and we should pip install one ccl (version 1.12) as well to enable DDP finetune in cpu.
python -m pip install oneccl_bind_pt==1.12.0 -f https://developer.intel.com/ipex-whl-stable
from 1.12.0 the module name will be changed to oneccl_bindings_for_pytorch. and should be imported before use. or else
error will happen.

Fixes # (issue)
as described above.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Library:

…and should import it before use

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
@sywangyi
Copy link
Contributor Author

@yao-matrix @liangan1 please review

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 21, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR. Note that it needs to be documented if you want users to be able to use this integration properly.

src/transformers/utils/import_utils.py Show resolved Hide resolved
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
@sywangyi
Copy link
Contributor Author

@sgugger document has been uploaded

@sywangyi
Copy link
Contributor Author

Hi, @sgugger ,this fix is aligned with what we do in the accelerate PR, without the correct module import, the DDP could not work with CCL backend

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for drafting some documentation. I know the code is the same as for Accelerate but I have a couple of comments on the doc before we can merge this.

Comment on lines 28 to 42
For PyTorch-1.10:

```
pip install oneccl_bind_pt==1.10.0 -f https://software.intel.com/ipex-whl-stable
```
For PyTorch-1.11:

```
pip install oneccl_bind_pt==1.11.0 -f https://software.intel.com/ipex-whl-stable
```
For PyTorch-1.12:

```
pip install oneccl_bind_pt==1.12.0 -f https://software.intel.com/ipex-whl-stable
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem likely that we will remember to add each new PyTorch version, so maybe just say

Suggested change
For PyTorch-1.10:
```
pip install oneccl_bind_pt==1.10.0 -f https://software.intel.com/ipex-whl-stable
```
For PyTorch-1.11:
```
pip install oneccl_bind_pt==1.11.0 -f https://software.intel.com/ipex-whl-stable
```
For PyTorch-1.12:
```
pip install oneccl_bind_pt==1.12.0 -f https://software.intel.com/ipex-whl-stable
```
```bash
pip install oneccl_bind_pt=={pytorch_version} -f https://software.intel.com/ipex-whl-stable
```
where `{pytorch_version}` should be you PyTorch version, for instance 1.12.0

and add a comment if the micro should always stay at 0 and/or a link to the list of supported versions you have.


# Efficient Training on Multiple CPUs

When training on a single CPU is too slow, we will use multiple CPUs, This guide focuses on PyTorch-based DDP enabling and how to do it efficiently.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When training on a single CPU is too slow, we will use multiple CPUs, This guide focuses on PyTorch-based DDP enabling and how to do it efficiently.
When training on a single CPU is too slow, we can use multiple CPUs. This guide focuses on PyTorch-based DDP enabling distributed CPU training efficiently.


## Intel® oneCCL Bindings for PyTorch

Intel® oneCCL (collective communications library) is a library for efficient distributed deep learning training implementing such collectives like allreduce, allgather, alltoall. For more information on oneCCL, please refer to the oneCCL documentation and oneCCL specification.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add links for "oneCCL documentation" and "oneCCL specification" here.


Intel® oneCCL (collective communications library) is a library for efficient distributed deep learning training implementing such collectives like allreduce, allgather, alltoall. For more information on oneCCL, please refer to the oneCCL documentation and oneCCL specification.

oneccl_bindings_for_pytorch module implements PyTorch C10D ProcessGroup API and can be dynamically loaded as external ProcessGroup and only works on Linux platform now
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it oneccl_bind_pt or oneccl_bindings_for_pytorch? Also what is a "ProcessGroup API"? Not sure this sentence adds anything to the doc.

Check more approaches for [oneccl_bind_pt installation](https://github.com/intel/torch-ccl).

### Usage in Trainer
To enable DDP in Trainer with ccl backend, users should add **`--xpu_backend ccl`** in training command arguments.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To enable DDP in Trainer with ccl backend, users should add **`--xpu_backend ccl`** in training command arguments.
To enable multi CPU distributed training in the Trainer with the ccl backend, users should add **`--xpu_backend ccl`** in the command arguments.


Take an example of the use cases on [Transformers question-answering](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering)

following command enables **2DDP** in one Xeon node, with one process running per one socket, OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
following command enables **2DDP** in one Xeon node, with one process running per one socket, OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.
The following command enables training with 2 processes on one Xeon node, with one process running per one socket. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.

2DDP won't mean anything to the user.

--no_cuda \
--xpu_backend ccl
```
following command enables **4DDP** in two Xeons (node0 and node1, taking node0 as the master), ppn(processes per node) is set to 2, with one process running per one socket, OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
following command enables **4DDP** in two Xeons (node0 and node1, taking node0 as the master), ppn(processes per node) is set to 2, with one process running per one socket, OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.
The following command enables training with a total of four processes on two Xeons (node0 and node1, taking node0 as the main process), ppn (processes per node) is set to 2, with one process running per one socket. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.

```
following command enables **4DDP** in two Xeons (node0 and node1, taking node0 as the master), ppn(processes per node) is set to 2, with one process running per one socket, OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.

in node0, you need to create a config file which contains ip of each node(for ex: hostfile) and pass to mpirun as a argument
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
in node0, you need to create a config file which contains ip of each node(for ex: hostfile) and pass to mpirun as a argument
In node0, you need to create a configuration file which contains the IP addresses of each node (for example hostfile) and pass that configuration file path as an argument.

xxx.xxx.xxx.xxx #node0 ip
xxx.xxx.xxx.xxx #node1 ip
```
run the following command in node0 and **4DDP** will be enabled in node0 and node1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
run the following command in node0 and **4DDP** will be enabled in node0 and node1
Now, run the following command in node0 and **4DDP** will be enabled in node0 and node1:


### Intel® oneCCL Bindings for PyTorch installation:

Wheel files are avaiable for the following Python versions:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Wheel files are avaiable for the following Python versions:
Wheel files are available for the following Python versions:

@sywangyi
Copy link
Contributor Author

@sgugger thanks for the careful review. doc is updated based one your comment

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this!

@sgugger sgugger merged commit 2b81f72 into huggingface:main Jul 27, 2022
oneraghavan pushed a commit to oneraghavan/transformers that referenced this pull request Sep 26, 2022
huggingface#18229)

* start from 1.12, torch_ccl is renamed as oneccl_bindings_for_pytorch and should import it before use

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* add doc for perf_train_cpu_many

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update doc

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
@sywangyi sywangyi deleted the ccl_1.12 branch October 21, 2022 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants