[Feature] Support PyTorch backend on MLU #1770

Qiza-lyhm · 2022-03-04T09:15:17Z

Motivation

Support Cambricon MLU device for PyTorch backend.

Modification

setup.py is modified to support MLU, with additions of compile with MLUExtensions given by Cambricon PyTorch.

New operators dispatch of "MLU" is ready, and added to mlubind.py for BangC operators. Src-code will update in the future.

For distributed training, new dist backend named "cncl" is added for MLU.

Since MLU does not support origin DataParallel and DistributedDataParallel, new modules named MLUDataParallel and MLUDistributedDataParallel are supported. They can be used like:

from mmcv.device.mlu import MLUDataParallel, MLUDistributedDataParallel.

BC-breaking (Optional)

No BC-breaking.

Use cases (Optional)

import torch
import torch_mlu
from mmcv.device.mlu import MLUDataParallel, MLUDistributedDataParallel
...
DP_model = MLUDataParallel(model.to('mlu'), device_ids=0)
DDP_model = MLUDistributedDataParallel(model.to('mlu'), device_ids=[torch.mlu.current_device()])

Install MMCV with MLU

To run mmcv on a MLU device, Cambricon PyTorch (with torch and torch_mlu packages) and Cambricon Neuware need to be installed at first.

Run setup.py with enviroment FORCE_MLU=ON, or compile with MLU device available.

Checklist

Before PR:

I have read and followed the workflow indicated in the CONTRIBUTING.md to create this PR.
Pre-commit or linting tools indicated in CONTRIBUTING.md are used to fix the potential lint issues.
Bug fixes are covered by unit tests, the case that causes the bug should be added in the unit tests.
New functionalities are covered by complete unit tests. If not, please add more unit test to ensure the correctness.
The documentation has been modified accordingly, including docstring or example tutorials.

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with some of those projects, like MMDet or MMCls.
CLA has been signed and all committers have signed the CLA in this PR.

CLAassistant · 2022-03-04T09:29:43Z

All committers have signed the CLA.

grimoire · 2022-03-09T05:14:50Z

Should this PR commit to mlu branch? @zhouzaida

Qiza-lyhm · 2022-03-09T12:29:53Z

Should this PR commit to mlu branch? @zhouzaida

This PR is ready for latest master branch, and is designed to support MMCV operator dispatcher for this version.

It's not suit mlu branch, since mlu branch is using old dispatch mode.

In design, this PR will be the first commit of MLU support on master, and BangC operators will make PR in the future.

mmcv/ops/csrc/pytorch/mlu/mlubind.cpp

grimoire

LGTM

teamwong111

Can we use some classes like DPWrapper and ScatterWrapper to support MLU without modifying the original code? Adding if hasattr(torch, 'is_mlu_available') and torch.is_mlu_available(): ... to many files doesn't seem appropriate.
DDP part is not modified, is it because MLU training does not support multi GPU?

Qiza-lyhm · 2022-03-16T06:43:36Z

Can we use some classes like DPWrapper and ScatterWrapper to support MLU without modifying the original code? Adding if hasattr(torch, 'is_mlu_available') and torch.is_mlu_available(): ... to many files doesn't seem appropriate.

DDP
part is not modified, is it because MLU training does not support multi GPU?

Infact, the problem is MLU doesn't support DP at yet. So we have to add codes to MMDP to workaround it.

With Cambricon-PyTorch, MMCV can run on MLU with DDP(and MMDDP) mode of multi-cards.

mmcv/runner/dist_utils.py

mmcv/parallel/data_parallel.py

Qiza-lyhm · 2022-03-21T09:04:38Z

The workaround is changed in data_parallel.
Now when using MLU-backend, MMDataParallel will first generated by origin torch.nn.DataParallel with NO DEVICE INFO (since there are no GPUs), then, the MLU info with device_ids = [0] and src_device_objs = "mlu:0" will added to MMDataParallel.

Multi-device mode is only supported by the DistributedDataParallel with distributed launcher, and DataParallel is only used to run single-device, and only the first cards is available (It can be specified by env: e.g. export MLU_VISIBLE_DEVICES=2 )

* MMCV support PyTorch backend on MLU * Add MLUDataParallel and MLUDistributedDataParallel * Add MLU operator support

mmcv/device/mlu/_functions.py

mmcv/device/mlu/data_parallel.py

mmcv/runner/dist_utils.py

mmcv/device/mlu/scatter_gather.py

mmcv/runner/dist_utils.py

mmcv/device/mlu/data_parallel.py

mmcv/device/mlu/distributed.py

Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com>

OpenMMLab-Assistant005 · 2023-04-07T11:41:30Z

Hi @Qiza-lyhm ！First of all, we want to express our gratitude for your significant PR in the MMCV project. Your contribution is highly appreciated, and we are grateful for your efforts in helping improve this open-source project during your personal time. We believe that many developers will benefit from your PR.

We would also like to invite you to join our Special Interest Group (SIG) private channel on Discord, where you can share your experiences, ideas, and build connections with like-minded peers. To join the SIG channel, simply message moderator— OpenMMLab on Discord or briefly share your open-source contributions in the #introductions channel and we will assist you. Look forward to seeing you there! Join us ：https://discord.gg/UjgXkPWNqA

If you have WeChat account，welcome to join our community on WeChat. You can add our assistant ：openmmlabwx. Please add "mmsig + Github ID" as a remark when adding friends：）
Thank you again for your contribution❤

zhouzaida requested a review from grimoire March 9, 2022 03:15

zhouzaida changed the base branch from master to mlu March 9, 2022 03:19

zhouzaida changed the base branch from mlu to master March 9, 2022 03:19

grimoire reviewed Mar 10, 2022

View reviewed changes

mmcv/ops/csrc/pytorch/mlu/mlubind.cpp Outdated Show resolved Hide resolved

Qiza-lyhm requested a review from grimoire March 14, 2022 06:54

grimoire approved these changes Mar 14, 2022

View reviewed changes

zhouzaida requested a review from teamwong111 March 16, 2022 03:45

teamwong111 reviewed Mar 16, 2022

View reviewed changes

mmcv/runner/dist_utils.py Outdated Show resolved Hide resolved

mmcv/parallel/data_parallel.py Outdated Show resolved Hide resolved

Qiza-lyhm force-pushed the mmcv_mlu_pr branch from 4ed1633 to 85c5e34 Compare March 21, 2022 08:57

Qiza-lyhm requested a review from teamwong111 March 21, 2022 08:59

zhouzaida added the MLU label Mar 22, 2022

zhouzaida requested a review from ZwwWayne March 28, 2022 08:24

zhouzaida self-assigned this Apr 13, 2022

Qiza-lyhm force-pushed the mmcv_mlu_pr branch from 85c5e34 to 16eb78e Compare April 13, 2022 09:17

feat(MLU): Support PyTorch backend on MLU

16eb78e

* MMCV support PyTorch backend on MLU * Add MLUDataParallel and MLUDistributedDataParallel * Add MLU operator support