[NPU] Add NPU support for unit test #4569

RUAN-ZX · 2023-10-26T03:35:03Z

Unit tests would fail or skip when device=npu, and we definitely want to test all these wonderful features by official unit tests.
Here comes the commit to add NPU support for unit test. P.S. see what we have already done #4567.

What I do in this commit

Just add npu logic branch
feat: Add npu support for skip_on_arch in tests/unit/util.py
feat: Add npu support for skip_on_cuda in tests/unit/util.py
feat: Add npu support for tests/unit/common.py
Set_device of accelerator before deepspeed.init_distributed in tests/unit/common.py
It would be friendlier and easier for other device like npu, if we can set_device of accelerator before init_distributed. Plus, setting device param before init sounds more reasonable.
Solve the problem of calling get_accelerator().random().fork_rng with non-cuda device
Function train_cifar() in tests/unit/alexnet_model.py calls get_accelerator().random().fork_rng without passing device_type explicitly. Unfortunately, torch.random.fork_rng() has default value setting device_type=cuda and non-cuda devices would fail to run. So my solution is explicitly passing device_type=get_accelerator().device_name(), and either cuda or non-cuda devices would perform correctly.

RUAN-ZX · 2023-10-26T03:37:34Z

@microsoft-github-policy-service agree

hipudding · 2023-10-30T01:06:56Z

Please add cuda check in "check_environment" to avoid warning meesage when using NPU as the backend.
Please squash these commits into one(use git rebase -i), and describe all the changes.

RUAN-ZX · 2023-11-01T01:50:56Z

@tjruwase Would you be so kind to review this commit, since we have some other commits based on this? Or maybe you can invite other reviewers to do the job? Thank you.

RUAN-ZX · 2023-11-01T01:57:47Z

Please add cuda check in "check_environment" to avoid warning meesage when using NPU as the backend. Please squash these commits into one(use git rebase -i), and describe all the changes.

Npu support for check_environment will be done in another PR :)
For commits squashing, I think these commits should be seperated for clarity :)
If you have more suggestions, please let me know, thank you for your advice.

… with non-cuda device

RUAN-ZX · 2023-11-01T11:34:54Z

@tjruwase It seems that the CI problem might be fixed by loadams in #4590? If so, would you please launch CI for me again?
Several PRs (#4588, #4585, #4578, etc) have met the same problem as below:
unit/inference/test_inference.py::TestMPSize::test[fp16-bloom] FAILED [ 91%]

P.S. I have push new code for solving problems, you can see that in no.3 description in the first comment. Thanks :)

RUAN-ZX · 2023-11-03T03:33:55Z

@tjruwase Perhaps #4591 has solved the problem of CI? Cause I see the latest commit #4598 manages to pass. Would you launch CI for me again?

RUAN-ZX · 2023-11-04T08:25:12Z

@tjruwase Perhaps #4591 has solved the problem of CI? Cause I see the latest commit #4598 manages to pass. Would you launch CI for me again?

@tjruwase Could you launch CI for me again? The last two commits have already pass.

RUAN-ZX · 2023-11-06T04:11:05Z

@tjruwase Would you please launch CI for this PR? I have been waiting for a long time. Or maybe is there something that I need to improve? In that case I really hope you can point that out for me, so that I can fix it ASAP and do better next time. Thanks :)

tjruwase · 2023-11-07T14:33:40Z

@RUAN-ZX, apologies for the delay in merging this PR. We had to push out a scheduled release last week. Unfortunately, I notice a new CI failure https://github.com/microsoft/DeepSpeed/actions/runs/6785863907/job/18445092642?pr=4569.

Can you please take a look? It seems to have been caused by a recent merge. I will play close attention to ensure this PR is merged this week. Thanks for your patience.

RUAN-ZX · 2023-11-08T12:38:31Z

@RUAN-ZX, apologies for the delay in merging this PR. We had to push out a scheduled release last week. Unfortunately, I notice a new CI failure https://github.com/microsoft/DeepSpeed/actions/runs/6785863907/job/18445092642?pr=4569.

Can you please take a look? It seems to have been caused by a recent merge. I will play close attention to ensure this PR is merged this week. Thanks for your patience.

Thank you! @tjruwase About the failure, I found an assert error: CUDA_HOME does not exist, unable to compile CUDA op(s) from op_builder/builder.py, and eventually it uses cuda_home = os.environ.get('CUDA_HOME') or os.environ.get('CUDA_PATH') to get CUDA_HOME in torch\utils\cpp_extension.py.
There is no code that may change this env param and I don't understand why CUDA_HOME is None :)

tests/unit/util.py

RUAN-ZX · 2023-11-13T03:42:52Z

@tjruwase I have solved problems raised by pre-commit hooks, please launch CI again :)

accelerator/real_accelerator.py

Our torch 1.10 tests have been failling since the merge of #4569. This added a `device_type` kwarg to the `torch.random.fork_rng` call. But this is not compatible with older versions of torch. Added in pytorch/pytorch#98069 Fixes #4644, #4503

Unit tests would fail or skip when device=npu, and we definitely want to test all these wonderful features by official unit tests. Here comes the commit to add NPU support for unit test. P.S. see what we have already done deepspeedai#4567. **What I do in this commit** 1. Just add npu logic branch feat: Add npu support for skip_on_arch in tests/unit/util.py feat: Add npu support for skip_on_cuda in tests/unit/util.py feat: Add npu support for tests/unit/common.py 2. Set_device of accelerator before deepspeed.init_distributed in tests/unit/common.py It would be friendlier and easier for other device like npu, if we can set_device of accelerator before init_distributed. Plus, setting device param before init sounds more reasonable. 3. Solve the problem of calling get_accelerator().random().fork_rng with non-cuda device Function `train_cifar()` in `tests/unit/alexnet_model.py` calls `get_accelerator().random().fork_rng` without passing `device_type` explicitly. Unfortunately, `torch.random.fork_rng()` has default value setting `device_type=cuda` and non-cuda devices would fail to run. So my solution is explicitly passing `device_type=get_accelerator().device_name()`, and either cuda or non-cuda devices would perform correctly. --------- Co-authored-by: ryan <ruanzhixiang1@huawei.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

Our torch 1.10 tests have been failling since the merge of deepspeedai#4569. This added a `device_type` kwarg to the `torch.random.fork_rng` call. But this is not compatible with older versions of torch. Added in pytorch/pytorch#98069 Fixes deepspeedai#4644, deepspeedai#4503

RUAN-ZX added 4 commits October 26, 2023 11:16

feat: Add npu support for tests/unit/common.py

c3979f6

feat: Add npu support for skip_on_cuda in tests/unit/util.py

cf7e790

feat: Add npu support for skip_on_arch in tests/unit/util.py

84eec54

Merge remote-tracking branch 'upstream/master'

2d09de6

RUAN-ZX requested review from jeffra, mrwyattii and tjruwase as code owners October 26, 2023 03:35

RUAN-ZX marked this pull request as draft October 26, 2023 03:43

RUAN-ZX changed the title ~~[NPU] Add NPU support for unit test #4568~~ [WIP] [NPU] Add NPU support for unit test #4568 Oct 26, 2023

RUAN-ZX changed the title ~~[WIP] [NPU] Add NPU support for unit test #4568~~ [WIP] [NPU] Add NPU support for unit test Oct 26, 2023

feat: Set_device of accelerator before deepspeed.init_distributed

9bf85a5

hipudding mentioned this pull request Oct 27, 2023

[Feature package] Full feature support with Ascend NPU #4567

Closed

RUAN-ZX changed the title ~~[WIP] [NPU] Add NPU support for unit test~~ [NPU] Add NPU support for unit test Oct 30, 2023

RUAN-ZX marked this pull request as ready for review October 30, 2023 06:30

Merge branch 'master' into master

9fcda4a

tjruwase approved these changes Nov 1, 2023

View reviewed changes

RUAN-ZX added 2 commits November 1, 2023 19:01

fix: solve the problem of calling get_accelerator().random().fork_rng…

d650826

… with non-cuda device

Merge remote-tracking branch 'origin/master'

95537b1

Merge branch 'master' into master

0220709

Merge branch 'master' into master

1d5db41

Merge branch 'master' into master

053b3b0

loadams reviewed Nov 8, 2023

View reviewed changes

tests/unit/util.py Outdated Show resolved Hide resolved

RUAN-ZX added 3 commits November 9, 2023 09:37

Merge remote-tracking branch 'upstream/master'

795ec24

fix: merge logic branch of xpu and npu

4e0b9aa

Merge remote-tracking branch 'origin/master'

68c54b2

tjruwase reviewed Nov 9, 2023

View reviewed changes

tests/unit/util.py Outdated Show resolved Hide resolved

tjruwase reviewed Nov 9, 2023

View reviewed changes

tests/unit/util.py Outdated Show resolved Hide resolved

RUAN-ZX and others added 7 commits November 11, 2023 11:12

feat: to check if current accelerator is supported

ee399db

feat: assert if current accelerator is supported

a9bfab8

feat: add npu support for bf16_required_version_check

b9d20d9

Merge branch 'master' into master

83ba10f

fix: solve pre-commit hooks problems

ace8fbe

Merge remote-tracking branch 'upstream/master'

67df1b7

Merge remote-tracking branch 'origin/master'

1f114b6

tjruwase reviewed Nov 13, 2023

View reviewed changes

accelerator/real_accelerator.py Show resolved Hide resolved

tjruwase approved these changes Nov 13, 2023

View reviewed changes

tjruwase added this pull request to the merge queue Nov 13, 2023

Merged via the queue into deepspeedai:master with commit 4b7cae7 Nov 13, 2023

mrwyattii mentioned this pull request Dec 14, 2023

fix for tests using torch<2.1 #4818

Merged

[NPU] Add NPU support for unit test #4569

[NPU] Add NPU support for unit test #4569

Uh oh!

Conversation

RUAN-ZX commented Oct 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RUAN-ZX commented Oct 26, 2023

Uh oh!

hipudding commented Oct 30, 2023

Uh oh!

RUAN-ZX commented Nov 1, 2023

Uh oh!

RUAN-ZX commented Nov 1, 2023

Uh oh!

RUAN-ZX commented Nov 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RUAN-ZX commented Nov 3, 2023

Uh oh!

RUAN-ZX commented Nov 4, 2023

Uh oh!

RUAN-ZX commented Nov 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjruwase commented Nov 7, 2023

Uh oh!

RUAN-ZX commented Nov 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RUAN-ZX commented Nov 13, 2023

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RUAN-ZX commented Oct 26, 2023 •

edited

Loading

RUAN-ZX commented Nov 1, 2023 •

edited

Loading

RUAN-ZX commented Nov 6, 2023 •

edited

Loading

RUAN-ZX commented Nov 8, 2023 •

edited

Loading