-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{ai}[foss/2022a] PyTorch v1.13.1 w/ Python 3.10.4 w/ CUDA 11.7.0 #17156
{ai}[foss/2022a] PyTorch v1.13.1 w/ Python 3.10.4 w/ CUDA 11.7.0 #17156
Conversation
…hes: PyTorch-1.13.1_fix-test-ops-conf.patch, PyTorch-1.13.1_no-cuda-stubs-rpath.patch, PyTorch-1.13.1_remove-flaky-test-in-testnn.patch, PyTorch-1.13.1_skip-ao-sparsity-test-without-fbgemm.patch
Hi @branfosj. Are there issues still pending within this PR? |
Update patches based on PyTorch 1.13.1
Those tests require 2 pytest plugins and a bugfix.
Fix test_ops* startup failures
This comment was marked as outdated.
This comment was marked as outdated.
…asyconfigs into 20230119093315_new_pr_PyTorch1131
Test report by @Flamefire |
Test report by @Flamefire |
Test report by @Flamefire |
Test report by @smoors |
as discussed in the last conf call, to avoid this PR becoming stale and as the number of failed tests is limited, we decided to merge this as is and create an issue for the failing tests to follow-up. @branfosj if you agree, can you add |
Yes, we should get #17155 merged first and then make sure all the patches are synced to here. |
I would also skip that test for now, and mention it in #17712 |
Failing tests on POWER9:
Let's not block this PR over that, those can be dealt with in a follow-up PR. |
true, but on the other hand i prefer to run a test and ignore the failure than to skip the test altogether, especially if the failure is specific to another architecture than the one i am building on. we could add another parameter |
Test report by @VRehnberg |
Test report by @VRehnberg |
I'm seeing three failed tests:
For distributed/_shard/sharded_tensor/ops/test_linear there is one bfloat16 tensor comparison where 385.5 is compared to 385 in a few different places for 3 out of four GPUs (this is on the 4xA40 node). This is not much for bfloat16 so I'd say either skip test or remove absolute tolerance and increase relative tolerance to at least I'll check the other tests next. |
For the rpc one I'm not making much sense of. I've tracked it down to https://github.com/pytorch/pytorch/blob/v1.13.1/torch/testing/_internal/distributed/rpc/rpc_test.py#L5075 and that it is reminicent of pytorch/pytorch#41474 but in this case it is not sporadic. |
Test report by @surak |
@branfosj: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/5468268082
bleep, bloop, I'm just a bot (boegelbot v20200716.01) |
Test report by @branfosj |
|
easybuild/easyconfigs/p/PyTorch/PyTorch-1.13.1-foss-2022a-CUDA-11.7.0.eb
Outdated
Show resolved
Hide resolved
Test report by @branfosj edit
|
Test report by @branfosj |
closing for #18305 |
(created using
eb --new-pr
)