Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{ai}[foss/2022a] PyTorch v1.13.1 w/ Python 3.10.4 w/ CUDA 11.7.0 #17156

Closed

Conversation

branfosj
Copy link
Member

@branfosj branfosj commented Jan 19, 2023

…hes: PyTorch-1.13.1_fix-test-ops-conf.patch, PyTorch-1.13.1_no-cuda-stubs-rpath.patch, PyTorch-1.13.1_remove-flaky-test-in-testnn.patch, PyTorch-1.13.1_skip-ao-sparsity-test-without-fbgemm.patch
@branfosj branfosj changed the title {ai}[foss/2022a] PyTorch v1.13.1 w/ Python 3.10.4 {ai}[foss/2022a] PyTorch v1.13.1 w/ Python 3.10.4 w/ CUDA 11.7.0 Jan 19, 2023
@branfosj branfosj marked this pull request as draft January 19, 2023 10:09
@satishskamath
Copy link
Contributor

satishskamath commented Jan 31, 2023

Hi @branfosj. Are there issues still pending within this PR?

Flamefire and others added 4 commits February 10, 2023 11:54
Update patches based on PyTorch 1.13.1
Those tests require 2 pytest plugins and a bugfix.
@boegelbot

This comment was marked as outdated.

@Flamefire
Copy link
Contributor

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusi8006 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/d562b0460290df6c5c3cde89694c1311 for a full test report.

@Flamefire
Copy link
Contributor

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusml26 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/24a30bba9da16fb44941b172986475a0 for a full test report.

@Flamefire
Copy link
Contributor

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusa11 - Linux CentOS Linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), Python 2.7.5
See https://gist.github.com/bb99683f64fd9982ae70752654544d3d for a full test report.

@smoors
Copy link
Contributor

smoors commented Apr 13, 2023

Test report by @smoors
FAILED
Build succeeded for 2 out of 3 (1 easyconfigs in total)
node406.hydra.os - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7282 16-Core Processor, 1 x NVIDIA NVIDIA A100-PCIE-40GB, 515.48.07, Python 3.6.8
See https://gist.github.com/smoors/cefd98a7e9a2da2b2683ffecc386d663 for a full test report.

@smoors
Copy link
Contributor

smoors commented Apr 13, 2023

as discussed in the last conf call, to avoid this PR becoming stale and as the number of failed tests is limited, we decided to merge this as is and create an issue for the failing tests to follow-up.

@branfosj if you agree, can you add max_failed_tests = 10 and remove the draft label?
i'll then merge this and create the issue for the failing tests.

@boegel
Copy link
Member

boegel commented Apr 13, 2023

@smoors Issue already created, see #17712 (I'm focusing on getting #17155 merged first)

@boegel
Copy link
Member

boegel commented Apr 13, 2023

For test_ops_gradients, we probably just need to add the PyTorch-1.13.1_skip-failing-grad-test.patch as was done in 6124d4c in #17155

For the test_jit* failing tests, we should include those in excluded_tests, which is a bit more strict than just allowing 10 random tests to fail.

@branfosj
Copy link
Member Author

Yes, we should get #17155 merged first and then make sure all the patches are synced to here.

@boegel
Copy link
Member

boegel commented Apr 13, 2023

test_tensorpipe_agent is failing on broadwell and zen2 in @Flamefire's tests, which is excluded in PyTorch-1.9.0*.eb, but only for POWER.

I would also skip that test for now, and mention it in #17712

@boegel
Copy link
Member

boegel commented Apr 13, 2023

Failing tests on POWER9:

test_ao_sparsity failed!
test_optim failed!
test_quantization failed!
distributed/rpc/test_tensorpipe_agent failed!
test_cpp_extensions_aot_ninja failed!
test_cpp_extensions_aot_no_ninja failed!
test_cpp_extensions_open_device_registration failed!
test_cuda failed!
test_ops failed!

Let's not block this PR over that, those can be dealt with in a follow-up PR.

@smoors
Copy link
Contributor

smoors commented Apr 13, 2023

For the test_jit* failing tests, we should include those in excluded_tests, which is a bit more strict than just allowing 10 random tests to fail.

true, but on the other hand i prefer to run a test and ignore the failure than to skip the test altogether, especially if the failure is specific to another architecture than the one i am building on.

we could add another parameter ignored_tests, but that may be overkill..
or even ignore_tests_for_architecture=<arch>

@VRehnberg
Copy link
Contributor

Test report by @VRehnberg
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
alvis2-12 - Linux Rocky Linux 8.6, x86_64, Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz, 8 x NVIDIA Tesla T4, 520.61.05, Python 3.6.8
See https://gist.github.com/VRehnberg/eb6da3f62c2c703ca377f93735d71cc1 for a full test report.

@VRehnberg
Copy link
Contributor

Test report by @VRehnberg
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
alvis3-22 - Linux Rocky Linux 8.6, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 520.61.05, Python 3.6.8
See https://gist.github.com/VRehnberg/ce43ced71ca294084e9e03b1f16a2b1c for a full test report.

@VRehnberg
Copy link
Contributor

VRehnberg commented May 10, 2023

I'm seeing three failed tests:

Failed tests (suites/files):
* distributed/_shard/sharded_tensor/ops/test_linear
* distributed/rpc/test_tensorpipe_agent
* test_jit_legacy
distributed/_shard/sharded_tensor/ops/test_linear (3 total tests, errors=1)

For distributed/_shard/sharded_tensor/ops/test_linear there is one bfloat16 tensor comparison where 385.5 is compared to 385 in a few different places for 3 out of four GPUs (this is on the 4xA40 node).

This is not much for bfloat16 so I'd say either skip test or remove absolute tolerance and increase relative tolerance to at least $(2^0 + 2^{-8})/2^0$ which should be about as much accuracy to expect with a 7-bit mantissa. However, I'm also confused because unless I'm miscounting the closest bfloat16 numbers should be 384 and 386 (i.e. 385 and 385.5 shouldn't be expressible) so there might be something I'm missing.

I'll check the other tests next.

@VRehnberg
Copy link
Contributor

VRehnberg commented May 10, 2023

For the rpc one I'm not making much sense of. I've tracked it down to https://github.com/pytorch/pytorch/blob/v1.13.1/torch/testing/_internal/distributed/rpc/rpc_test.py#L5075 and that it is reminicent of pytorch/pytorch#41474 but in this case it is not sporadic.

@boegel boegel removed this from the next release (4.7.2) milestone May 23, 2023
@boegel boegel added this to the release after 4.7.2 milestone May 23, 2023
@surak
Copy link
Contributor

surak commented Jun 15, 2023

Test report by @surak
SUCCESS
Build succeeded for 0 out of 0 (1 easyconfigs in total)
haicluster1.fz-juelich.de - Linux Ubuntu 20.04, x86_64, AMD EPYC 7F72 24-Core Processor, 4 x NVIDIA NVIDIA GeForce RTX 3090, 515.65.01, Python 3.8.10
See https://gist.github.com/surak/0cf9d9fad51dc92ea82e84d50054be54 for a full test report.

@boegel boegel modified the milestones: 4.7.3, release after 4.7.3 Jul 5, 2023
@boegelbot
Copy link
Collaborator

@branfosj: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/5468268082
Output from first failing test suite run:

FAIL: test__parse_easyconfig_PyTorch-1.13.1-foss-2022a-CUDA-11.7.0.eb (test.easyconfigs.easyconfigs.EasyConfigTest)
Test for easyconfig PyTorch-1.13.1-foss-2022a-CUDA-11.7.0.eb
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/test/easyconfigs/easyconfigs.py", line 1555, in innertest
    template_easyconfig_test(self, spec_path)
  File "/home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/test/easyconfigs/easyconfigs.py", line 1406, in template_easyconfig_test
    self.assertTrue(os.path.isfile(patch_full), msg)
AssertionError: False is not true : Patch file /home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/easybuild/easyconfigs/p/PyTorch/PyTorch-1.13.1_increase-tolerance-test_ops.patch is available for PyTorch-1.13.1-foss-2022a-CUDA-11.7.0.eb

----------------------------------------------------------------------
Ran 17521 tests in 802.177s

FAILED (failures=1)
ERROR: Not all tests were successful

bleep, bloop, I'm just a bot (boegelbot v20200716.01)
Please talk to my owner @boegel if you notice me acting stupid),
or submit a pull request to https://github.com/boegel/boegelbot fix the problem.

@branfosj
Copy link
Member Author

branfosj commented Jul 5, 2023

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bear-pg0203u29a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 520.61.05, Python 3.6.8
See https://gist.github.com/branfosj/e3f2805e2bf1d4e69c34a49bb6d3a671 for a full test report.

@boegel
Copy link
Member

boegel commented Jul 6, 2023

@branfosj Should be synced with develop now that #17155 is merged

@branfosj
Copy link
Member Author

branfosj commented Jul 6, 2023

test_jit_legacy, test_jit_profiling, and test_jit

Same failed test in the three testsuites

======================================================================
FAIL: test_freeze_conv_relu_fusion (jit.test_freezing.TestFrozenOptimizations)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022a-CUDA-11.7.0/pytorch-v1.13.1/test/jit/test_freezing.py", line 2258, in test_freeze_conv_relu_fusion
    self.assertEqual(mod_eager(inp), frozen_mod(inp))
  File "/dev/shm/branfosj/tmp-up-EL8/eb-8fe5hmfe/tmpuilxmqz9/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2470, in assertEqual
    assert_equal(
  File "/dev/shm/branfosj/tmp-up-EL8/eb-8fe5hmfe/tmpuilxmqz9/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 10 / 30 (33.3%)
Greatest absolute difference: 3.057718276977539e-05 at index (2, 3, 0, 0, 0) (up to 1e-05 allowed)
Greatest relative difference: 8.758584417742737e-05 at index (0, 3, 0, 0, 0) (up to 1.3e-06 allowed)

----------------------------------------------------------------------

I'm patching this one out.

test_optim

======================================================================
FAIL: test_rprop (__main__.TestOptim)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/tmp-up-EL8/eb-8fe5hmfe/tmpuilxmqz9/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 1054, in wrapper
    fn(*args, **kwargs)
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022a-CUDA-11.7.0/pytorch-v1.13.1/test/test_optim.py", line 1016, in test_rprop
    self._test_basic_cases(
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022a-CUDA-11.7.0/pytorch-v1.13.1/test/test_optim.py", line 283, in _test_basic_cases
    self._test_state_dict(
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022a-CUDA-11.7.0/pytorch-v1.13.1/test/test_optim.py", line 258, in _test_state_dict
    self.assertEqual(bias, bias_cuda)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-8fe5hmfe/tmpuilxmqz9/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2470, in assertEqual
    assert_equal(
  File "/dev/shm/branfosj/tmp-up-EL8/eb-8fe5hmfe/tmpuilxmqz9/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 10 (10.0%)
Greatest absolute difference: 0.00010061264038085938 at index (0,) (up to 1e-05 allowed)
Greatest relative difference: 6.106863159088995e-05 at index (0,) (up to 1.3e-06 allowed)

----------------------------------------------------------------------

We skip test_optim in 1.12.x, due to intermittent test failures, so readd this test skip here.

@branfosj branfosj marked this pull request as ready for review July 6, 2023 08:34
@branfosj
Copy link
Member Author

branfosj commented Jul 6, 2023

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bear-pg0203u29a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 520.61.05, Python 3.6.8
See https://gist.github.com/branfosj/a91c14fec2ac001a1831a8229e17a14a for a full test report.

edit

test_jit_cuda_fuser failed! Received signal: SIGIOT

@branfosj
Copy link
Member Author

branfosj commented Jul 8, 2023

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bear-pg0203u31a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 4 x NVIDIA NVIDIA A100-SXM4-40GB, 520.61.05, Python 3.6.8
See https://gist.github.com/branfosj/11e97ea919a6ce9e6b191b8a1c5870b6 for a full test report.

@branfosj
Copy link
Member Author

closing for #18305

@branfosj branfosj closed this Jul 14, 2023
@branfosj branfosj deleted the 20230119093315_new_pr_PyTorch1131 branch October 7, 2023 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants