-
Notifications
You must be signed in to change notification settings - Fork 4.7k
[NPU] Add NPU support for unit test #4569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@microsoft-github-policy-service agree |
|
Please add cuda check in "check_environment" to avoid warning meesage when using NPU as the backend. |
|
@tjruwase Would you be so kind to review this commit, since we have some other commits based on this? Or maybe you can invite other reviewers to do the job? Thank you. |
Npu support for check_environment will be done in another PR :) |
|
@tjruwase It seems that the CI problem might be fixed by loadams in #4590? If so, would you please launch CI for me again? P.S. I have push new code for solving problems, you can see that in no.3 description in the first comment. Thanks :) |
|
@tjruwase Would you please launch CI for this PR? I have been waiting for a long time. Or maybe is there something that I need to improve? In that case I really hope you can point that out for me, so that I can fix it ASAP and do better next time. Thanks :) |
|
@RUAN-ZX, apologies for the delay in merging this PR. We had to push out a scheduled release last week. Unfortunately, I notice a new CI failure https://github.com/microsoft/DeepSpeed/actions/runs/6785863907/job/18445092642?pr=4569. Can you please take a look? It seems to have been caused by a recent merge. I will play close attention to ensure this PR is merged this week. Thanks for your patience. |
Thank you! @tjruwase About the failure, I found an assert error: |
|
@tjruwase I have solved problems raised by pre-commit hooks, please launch CI again :) |
Our torch 1.10 tests have been failling since the merge of #4569. This added a `device_type` kwarg to the `torch.random.fork_rng` call. But this is not compatible with older versions of torch. Added in pytorch/pytorch#98069 Fixes #4644, #4503
Unit tests would fail or skip when device=npu, and we definitely want to test all these wonderful features by official unit tests. Here comes the commit to add NPU support for unit test. P.S. see what we have already done deepspeedai#4567. **What I do in this commit** 1. Just add npu logic branch feat: Add npu support for skip_on_arch in tests/unit/util.py feat: Add npu support for skip_on_cuda in tests/unit/util.py feat: Add npu support for tests/unit/common.py 2. Set_device of accelerator before deepspeed.init_distributed in tests/unit/common.py It would be friendlier and easier for other device like npu, if we can set_device of accelerator before init_distributed. Plus, setting device param before init sounds more reasonable. 3. Solve the problem of calling get_accelerator().random().fork_rng with non-cuda device Function `train_cifar()` in `tests/unit/alexnet_model.py` calls `get_accelerator().random().fork_rng` without passing `device_type` explicitly. Unfortunately, `torch.random.fork_rng()` has default value setting `device_type=cuda` and non-cuda devices would fail to run. So my solution is explicitly passing `device_type=get_accelerator().device_name()`, and either cuda or non-cuda devices would perform correctly. --------- Co-authored-by: ryan <ruanzhixiang1@huawei.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Our torch 1.10 tests have been failling since the merge of deepspeedai#4569. This added a `device_type` kwarg to the `torch.random.fork_rng` call. But this is not compatible with older versions of torch. Added in pytorch/pytorch#98069 Fixes deepspeedai#4644, deepspeedai#4503
Unit tests would fail or skip when device=npu, and we definitely want to test all these wonderful features by official unit tests.
Here comes the commit to add NPU support for unit test. P.S. see what we have already done #4567.
What I do in this commit
Just add npu logic branch
feat: Add npu support for skip_on_arch in tests/unit/util.py
feat: Add npu support for skip_on_cuda in tests/unit/util.py
feat: Add npu support for tests/unit/common.py
Set_device of accelerator before deepspeed.init_distributed in tests/unit/common.py
It would be friendlier and easier for other device like npu, if we can set_device of accelerator before init_distributed. Plus, setting device param before init sounds more reasonable.
Solve the problem of calling get_accelerator().random().fork_rng with non-cuda device
Function
train_cifar()intests/unit/alexnet_model.pycallsget_accelerator().random().fork_rngwithout passingdevice_typeexplicitly. Unfortunately,torch.random.fork_rng()has default value settingdevice_type=cudaand non-cuda devices would fail to run. So my solution is explicitly passingdevice_type=get_accelerator().device_name(), and either cuda or non-cuda devices would perform correctly.