New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

start from 1.12, torch_ccl is renamed as oneccl_bindings_for_pytorch … #18229

Merged

sgugger merged 3 commits into huggingface:main from sywangyi:ccl_1.12

Jul 27, 2022

Contributor

sywangyi commented Jul 21, 2022

…and should import it before use

Signed-off-by: Wang, Yi A yi.a.wang@intel.com

What does this PR do?

when run the transformer with torch 1.12 and we should pip install one ccl (version 1.12) as well to enable DDP finetune in cpu.
python -m pip install oneccl_bind_pt==1.12.0 -f https://developer.intel.com/ipex-whl-stable
from 1.12.0 the module name will be changed to oneccl_bindings_for_pytorch. and should be imported before use. or else
error will happen.

Fixes # (issue)
as described above.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Library:

trainer: @sgugger


          start from 1.12, torch_ccl is renamed as oneccl_bindings_for_pytorch …

64bbf26

…and should import it before use

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

Contributor Author

sywangyi commented Jul 21, 2022

@yao-matrix @liangan1 please review

HuggingFaceDocBuilderDev commented Jul 21, 2022 •

edited

Loading

The documentation is not available anymore as the PR was closed or merged.

sgugger reviewed

View reviewed changes

Collaborator

sgugger left a comment

Thanks for your PR. Note that it needs to be documented if you want users to be able to use this integration properly.

src/transformers/utils/import_utils.py Show resolved Hide resolved

sywangyi force-pushed the ccl_1.12 branch from bcb1c90 to 015b9f3 Compare

July 22, 2022 04:50


          add doc for perf_train_cpu_many

07ad59c

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

sywangyi force-pushed the ccl_1.12 branch from 015b9f3 to 07ad59c Compare

July 22, 2022 05:48

Contributor Author

sywangyi commented Jul 22, 2022

@sgugger document has been uploaded

Contributor Author

sywangyi commented Jul 26, 2022

Hi, @sgugger ,this fix is aligned with what we do in the accelerate PR, without the correct module import, the DDP could not work with CCL backend

sgugger reviewed

View reviewed changes

Collaborator

sgugger left a comment

Thanks for drafting some documentation. I know the code is the same as for Accelerate but I have a couple of comments on the doc before we can merge this.

docs/source/en/perf_train_cpu_many.mdx Outdated

Comment on lines 28 to 42

+              For PyTorch-1.10:
+              ```
+              pip install oneccl_bind_pt==1.10.0 -f https://software.intel.com/ipex-whl-stable
+              ```
+              For PyTorch-1.11:
+              ```
+              pip install oneccl_bind_pt==1.11.0 -f https://software.intel.com/ipex-whl-stable
+              ```
+              For PyTorch-1.12:
+              ```
+              pip install oneccl_bind_pt==1.12.0 -f https://software.intel.com/ipex-whl-stable
+              ```

Collaborator

sgugger Jul 22, 2022

It doesn't seem likely that we will remember to add each new PyTorch version, so maybe just say

Suggested change

      
            For PyTorch-1.10:
          
            ```
          
            pip install oneccl_bind_pt==1.10.0 -f https://software.intel.com/ipex-whl-stable
          
            ```
          
            For PyTorch-1.11:
          
            ```
          
            pip install oneccl_bind_pt==1.11.0 -f https://software.intel.com/ipex-whl-stable
          
            ```
          
            For PyTorch-1.12:
          
            ```
          
            pip install oneccl_bind_pt==1.12.0 -f https://software.intel.com/ipex-whl-stable
          
            ```
          
            ```bash
          
            pip install oneccl_bind_pt=={pytorch_version} -f https://software.intel.com/ipex-whl-stable
          
            ```
          
            where `{pytorch_version}` should be you PyTorch version, for instance 1.12.0

and add a comment if the micro should always stay at 0 and/or a link to the list of supported versions you have.

docs/source/en/perf_train_cpu_many.mdx Outdated


		# Efficient Training on Multiple CPUs

		When training on a single CPU is too slow, we will use multiple CPUs, This guide focuses on PyTorch-based DDP enabling and how to do it efficiently.

Collaborator

sgugger Jul 22, 2022

Suggested change

      
            When training on a single CPU is too slow, we will use multiple CPUs, This guide focuses on PyTorch-based DDP enabling and how to do it efficiently.
          
            When training on a single CPU is too slow, we can use multiple CPUs. This guide focuses on PyTorch-based DDP enabling distributed CPU training efficiently.

docs/source/en/perf_train_cpu_many.mdx Outdated


		## Intel® oneCCL Bindings for PyTorch

		Intel® oneCCL (collective communications library) is a library for efficient distributed deep learning training implementing such collectives like allreduce, allgather, alltoall. For more information on oneCCL, please refer to the oneCCL documentation and oneCCL specification.

Collaborator

sgugger Jul 22, 2022

You should add links for "oneCCL documentation" and "oneCCL specification" here.

docs/source/en/perf_train_cpu_many.mdx Outdated


		Intel® oneCCL (collective communications library) is a library for efficient distributed deep learning training implementing such collectives like allreduce, allgather, alltoall. For more information on oneCCL, please refer to the oneCCL documentation and oneCCL specification.

		oneccl_bindings_for_pytorch module implements PyTorch C10D ProcessGroup API and can be dynamically loaded as external ProcessGroup and only works on Linux platform now

Collaborator

sgugger Jul 22, 2022

Is it oneccl_bind_pt or oneccl_bindings_for_pytorch? Also what is a "ProcessGroup API"? Not sure this sentence adds anything to the doc.

docs/source/en/perf_train_cpu_many.mdx Outdated

+              Check more approaches for [oneccl_bind_pt installation](https://github.com/intel/torch-ccl).
+              ### Usage in Trainer
+              To enable DDP in Trainer with ccl backend, users should add **`--xpu_backend ccl`** in training command arguments.

Collaborator

sgugger Jul 22, 2022

Suggested change

      
            To enable DDP in Trainer with ccl backend, users should add **`--xpu_backend ccl`** in training command arguments.
          
            To enable multi CPU distributed training in the Trainer with the ccl backend, users should add **`--xpu_backend ccl`** in the command arguments.

docs/source/en/perf_train_cpu_many.mdx Outdated


		Take an example of the use cases on [Transformers question-answering](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering)

		following command enables 2DDP in one Xeon node, with one process running per one socket, OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.

Collaborator

sgugger Jul 22, 2022

Suggested change

      
            following command enables **2DDP** in one Xeon node, with one process running per one socket, OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.
          
            The following command enables training with 2 processes on one Xeon node, with one process running per one socket. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.

2DDP won't mean anything to the user.

docs/source/en/perf_train_cpu_many.mdx Outdated

+               --no_cuda \
+               --xpu_backend ccl
+              ```
+              following command enables **4DDP** in two Xeons (node0 and node1, taking node0 as the master), ppn(processes per node) is set to 2, with one process running per one socket, OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.

Collaborator

sgugger Jul 22, 2022

Suggested change

      
            following command enables **4DDP** in two Xeons (node0 and node1, taking node0 as the master), ppn(processes per node) is set to 2, with one process running per one socket, OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.
          
            The following command enables training with a total of four processes on two Xeons (node0 and node1, taking node0 as the main process), ppn (processes per node) is set to 2, with one process running per one socket. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.

docs/source/en/perf_train_cpu_many.mdx Outdated

+              ```
+              following command enables **4DDP** in two Xeons (node0 and node1, taking node0 as the master), ppn(processes per node) is set to 2, with one process running per one socket, OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.
+              in node0, you need to create a config file which contains ip of each node(for ex: hostfile) and pass to mpirun as a argument

Collaborator

sgugger Jul 22, 2022

Suggested change

      
            in node0, you need to create a config file which contains ip of each node(for ex: hostfile) and pass to mpirun as a argument
          
            In node0, you need to create a configuration file which contains the IP addresses of each node (for example hostfile) and pass that configuration file path as an argument.

docs/source/en/perf_train_cpu_many.mdx Outdated

+               xxx.xxx.xxx.xxx #node0 ip
+               xxx.xxx.xxx.xxx #node1 ip
+              ```
+              run the following command in node0 and **4DDP** will be enabled in node0 and node1

Collaborator

sgugger Jul 22, 2022

Suggested change

      
            run the following command in node0 and **4DDP** will be enabled in node0 and node1
          
            Now, run the following command in node0 and **4DDP** will be enabled in node0 and node1:

docs/source/en/perf_train_cpu_many.mdx Outdated


		### Intel® oneCCL Bindings for PyTorch installation:

		Wheel files are avaiable for the following Python versions:

Collaborator

sgugger Jul 26, 2022

Suggested change

      
            Wheel files are avaiable for the following Python versions:
          
            Wheel files are available for the following Python versions:

Contributor Author

sywangyi commented Jul 27, 2022

@sgugger thanks for the careful review. doc is updated based one your comment


          update doc

2131bc4

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

sywangyi force-pushed the ccl_1.12 branch from 74f1623 to 2131bc4 Compare

July 27, 2022 13:11

sgugger approved these changes

View reviewed changes

Collaborator

sgugger left a comment

Thanks for iterating on this!

sgugger merged commit 2b81f72 into huggingface:main

oneraghavan pushed a commit to oneraghavan/transformers that referenced this pull request


          start from 1.12, torch_ccl is renamed as oneccl_bindings_for_pytorch … (

7ba9bdf

huggingface#18229)

* start from 1.12, torch_ccl is renamed as oneccl_bindings_for_pytorch and should import it before use

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* add doc for perf_train_cpu_many

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update doc

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

sywangyi deleted the ccl_1.12 branch

October 21, 2022 12:18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet