Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support PyTorch backend on MLU #1770

Merged
merged 6 commits into from
Apr 14, 2022

Conversation

Qiza-lyhm
Copy link
Contributor

@Qiza-lyhm Qiza-lyhm commented Mar 4, 2022

Motivation

  • Support Cambricon MLU device for PyTorch backend.

Modification

setup.py is modified to support MLU, with additions of compile with MLUExtensions given by Cambricon PyTorch.

New operators dispatch of "MLU" is ready, and added to mlubind.py for BangC operators. Src-code will update in the future.

For distributed training, new dist backend named "cncl" is added for MLU.

Since MLU does not support origin DataParallel and DistributedDataParallel, new modules named MLUDataParallel and MLUDistributedDataParallel are supported. They can be used like:

from mmcv.device.mlu import MLUDataParallel, MLUDistributedDataParallel.

BC-breaking (Optional)

No BC-breaking.

Use cases (Optional)

import torch
import torch_mlu
from mmcv.device.mlu import MLUDataParallel, MLUDistributedDataParallel
...
DP_model = MLUDataParallel(model.to('mlu'), device_ids=0)
DDP_model = MLUDistributedDataParallel(model.to('mlu'), device_ids=[torch.mlu.current_device()])

Install MMCV with MLU

To run mmcv on a MLU device, Cambricon PyTorch (with torch and torch_mlu packages) and Cambricon Neuware need to be installed at first.

Run setup.py with enviroment FORCE_MLU=ON, or compile with MLU device available.

Checklist

Before PR:

  • I have read and followed the workflow indicated in the CONTRIBUTING.md to create this PR.
  • Pre-commit or linting tools indicated in CONTRIBUTING.md are used to fix the potential lint issues.
  • Bug fixes are covered by unit tests, the case that causes the bug should be added in the unit tests.
  • New functionalities are covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, including docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with some of those projects, like MMDet or MMCls.
  • CLA has been signed and all committers have signed the CLA in this PR.

@CLAassistant
Copy link

CLAassistant commented Mar 4, 2022

CLA assistant check
All committers have signed the CLA.

@zhouzaida zhouzaida requested a review from grimoire March 9, 2022 03:15
@zhouzaida zhouzaida changed the base branch from master to mlu March 9, 2022 03:19
@zhouzaida zhouzaida changed the base branch from mlu to master March 9, 2022 03:19
@grimoire
Copy link
Member

grimoire commented Mar 9, 2022

Should this PR commit to mlu branch? @zhouzaida

@Qiza-lyhm
Copy link
Contributor Author

Should this PR commit to mlu branch? @zhouzaida

This PR is ready for latest master branch, and is designed to support MMCV operator dispatcher for this version.

It's not suit mlu branch, since mlu branch is using old dispatch mode.

In design, this PR will be the first commit of MLU support on master, and BangC operators will make PR in the future.

@Qiza-lyhm Qiza-lyhm requested a review from grimoire March 14, 2022 06:54
Copy link
Member

@grimoire grimoire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@teamwong111 teamwong111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can we use some classes like DPWrapper and ScatterWrapper to support MLU without modifying the original code? Adding if hasattr(torch, 'is_mlu_available') and torch.is_mlu_available(): ... to many files doesn't seem appropriate.
  2. DDP part is not modified, is it because MLU training does not support multi GPU?

@Qiza-lyhm
Copy link
Contributor Author

Qiza-lyhm commented Mar 16, 2022

  1. Can we use some classes like DPWrapper and ScatterWrapper to support MLU without modifying the original code? Adding if hasattr(torch, 'is_mlu_available') and torch.is_mlu_available(): ... to many files doesn't seem appropriate.
  2. DDP
    part is not modified, is it because MLU training does not support multi GPU?

Infact, the problem is MLU doesn't support DP at yet. So we have to add codes to MMDP to workaround it.

With Cambricon-PyTorch, MMCV can run on MLU with DDP(and MMDDP) mode of multi-cards.

mmcv/runner/dist_utils.py Outdated Show resolved Hide resolved
mmcv/parallel/data_parallel.py Outdated Show resolved Hide resolved
@Qiza-lyhm
Copy link
Contributor Author

The workaround is changed in data_parallel.
Now when using MLU-backend, MMDataParallel will first generated by origin torch.nn.DataParallel with NO DEVICE INFO (since there are no GPUs), then, the MLU info with device_ids = [0] and src_device_objs = "mlu:0" will added to MMDataParallel.

Multi-device mode is only supported by the DistributedDataParallel with distributed launcher, and DataParallel is only used to run single-device, and only the first cards is available (It can be specified by env: e.g. export MLU_VISIBLE_DEVICES=2 )

  * MMCV support PyTorch backend on MLU

  * Add MLUDataParallel and MLUDistributedDataParallel

  * Add MLU operator support
@zhouzaida zhouzaida changed the title feat(MLU): Support PyTorch backend on MLU [Feature] Support PyTorch backend on MLU Apr 14, 2022
Qiza-lyhm and others added 2 commits April 14, 2022 16:59
Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com>
@zhouzaida zhouzaida merged commit 4826a9b into open-mmlab:master Apr 14, 2022
@zhouzaida zhouzaida mentioned this pull request Apr 14, 2022
7 tasks
@OpenMMLab-Assistant005
Copy link

Hi @Qiza-lyhm !First of all, we want to express our gratitude for your significant PR in the MMCV project. Your contribution is highly appreciated, and we are grateful for your efforts in helping improve this open-source project during your personal time. We believe that many developers will benefit from your PR.

We would also like to invite you to join our Special Interest Group (SIG) private channel on Discord, where you can share your experiences, ideas, and build connections with like-minded peers. To join the SIG channel, simply message moderator— OpenMMLab on Discord or briefly share your open-source contributions in the #introductions channel and we will assist you. Look forward to seeing you there! Join us :https://discord.gg/UjgXkPWNqA

If you have WeChat account,welcome to join our community on WeChat. You can add our assistant :openmmlabwx. Please add "mmsig + Github ID" as a remark when adding friends:)
Thank you again for your contribution❤

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants