-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Support PyTorch backend on MLU #1770
Conversation
Should this PR commit to mlu branch? @zhouzaida |
This PR is ready for latest master branch, and is designed to support MMCV operator dispatcher for this version. It's not suit mlu branch, since mlu branch is using old dispatch mode. In design, this PR will be the first commit of MLU support on master, and BangC operators will make PR in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can we use some classes like DPWrapper and ScatterWrapper to support MLU without modifying the original code? Adding
if hasattr(torch, 'is_mlu_available') and torch.is_mlu_available(): ...
to many files doesn't seem appropriate. - DDP part is not modified, is it because MLU training does not support multi GPU?
Infact, the problem is MLU doesn't support DP at yet. So we have to add codes to MMDP to workaround it. With Cambricon-PyTorch, MMCV can run on MLU with DDP(and MMDDP) mode of multi-cards. |
The workaround is changed in data_parallel. Multi-device mode is only supported by the DistributedDataParallel with distributed launcher, and DataParallel is only used to run single-device, and only the first cards is available (It can be specified by env: e.g. |
* MMCV support PyTorch backend on MLU * Add MLUDataParallel and MLUDistributedDataParallel * Add MLU operator support
Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com>
Hi @Qiza-lyhm !First of all, we want to express our gratitude for your significant PR in the MMCV project. Your contribution is highly appreciated, and we are grateful for your efforts in helping improve this open-source project during your personal time. We believe that many developers will benefit from your PR. We would also like to invite you to join our Special Interest Group (SIG) private channel on Discord, where you can share your experiences, ideas, and build connections with like-minded peers. To join the SIG channel, simply message moderator— OpenMMLab on Discord or briefly share your open-source contributions in the #introductions channel and we will assist you. Look forward to seeing you there! Join us :https://discord.gg/UjgXkPWNqA If you have WeChat account,welcome to join our community on WeChat. You can add our assistant :openmmlabwx. Please add "mmsig + Github ID" as a remark when adding friends:) |
Motivation
Modification
setup.py
is modified to support MLU, with additions of compile with MLUExtensions given by Cambricon PyTorch.New operators dispatch of "MLU" is ready, and added to
mlubind.py
for BangC operators. Src-code will update in the future.For distributed training, new dist backend named "cncl" is added for MLU.
Since MLU does not support origin
DataParallel
andDistributedDataParallel
, new modules namedMLUDataParallel
andMLUDistributedDataParallel
are supported. They can be used like:from mmcv.device.mlu import MLUDataParallel, MLUDistributedDataParallel
.BC-breaking (Optional)
No BC-breaking.
Use cases (Optional)
Install MMCV with MLU
To run mmcv on a MLU device, Cambricon PyTorch (with torch and torch_mlu packages) and Cambricon Neuware need to be installed at first.
Run
setup.py
with enviromentFORCE_MLU=ON
, or compile with MLU device available.Checklist
Before PR:
After PR: