Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"CUDA error, AssertionError: Tensor-likes are not close!" on model test for cuda-resnet101 #7618

Closed
atalman opened this issue May 23, 2023 · 2 comments · Fixed by #7634
Closed

Comments

@atalman
Copy link
Contributor

atalman commented May 23, 2023

🐛 Describe the bug

Similar to: #7143

When switching CI from CUDA 11.7 to CUDA 11.8. Unit tests on Linux fails:
#7616

2023-05-23T13:41:31.5794788Z =================================== FAILURES ===================================
2023-05-23T13:41:31.5795148Z �[31m�[1m__________________ test_classification_model[cuda-resnet101] ___________________�[0m
2023-05-23T13:41:31.5795459Z Traceback (most recent call last):
2023-05-23T13:41:31.5795737Z   File "/work/test/test_models.py", line 705, in test_classification_model
2023-05-23T13:41:31.5796046Z     _assert_expected(out.cpu(), model_name, prec=prec)
2023-05-23T13:41:31.5796368Z   File "/work/test/test_models.py", line 155, in _assert_expected
2023-05-23T13:41:31.5796725Z     torch.testing.assert_close(output, expected, rtol=rtol, atol=atol, check_dtype=False, check_device=False)
2023-05-23T13:41:31.5797207Z   File "/opt/conda/envs/ci/lib/python3.8/site-packages/torch/testing/_comparison.py", line 1511, in assert_close
2023-05-23T13:41:31.5797525Z     raise error_metas[0].to_error(msg)
2023-05-23T13:41:31.5797814Z AssertionError: Tensor-likes are not close!
2023-05-23T13:41:31.5797970Z 
2023-05-23T13:41:31.5798060Z Mismatched elements: 1 / 50 (2.0%)
2023-05-23T13:41:31.5798341Z Greatest absolute difference: 5.10198974609375 at index (0, 22) (up to 0.2 allowed)
2023-05-23T13:41:31.5798665Z Greatest relative difference: 0.2689853608608246 at index (0, 22) (up to 0.2 allowed)

Versions

nightly 2.1.0

cc @pmeier @NicolasHug @ptrblck @malfet @ngimel

@ptrblck
Copy link
Contributor

ptrblck commented May 25, 2023

I'm able to reproduce the issue on an A100 using a different seed that 0, which is hard-coded in the test.
Based on the failure it seems we are seeding the test and are comparing the result to a pre-defined result defined in the expect folder and stored as a pkl file as seen here.
I was also able to verify that indeed the model parameters as well as the inputs change based on the seed:

diff --git a/test/test_models.py b/test/test_models.py
index 91aa66c667..d2b312ab15 100644
--- a/test/test_models.py
+++ b/test/test_models.py
@@ -674,7 +674,10 @@ def test_vitc_models(model_fn, dev):
 @pytest.mark.parametrize("model_fn", list_model_fns(models))
 @pytest.mark.parametrize("dev", cpu_and_gpu())
 def test_classification_model(model_fn, dev):
-    set_rng_seed(0)
+    import os
+    seed = int(os.getenv("TORCH_SEED"))
+    print("using seed {}".format(seed))
+    set_rng_seed(seed)
     defaults = {
         "num_classes": 50,
         "input_shape": (1, 3, 224, 224),

So far I would guess the PRNG behavior might have changed between 11.7 and 11.8 for A10G, but I still need to verify it on the actual device.
The same behavior is observed on an A40 (seed=0 passes, every other fails as expected).

I'll try to lease an A10G next to reproduce the actual failure with the default seed to check if my guess is correct.

@ptrblck
Copy link
Contributor

ptrblck commented May 26, 2023

My guess was wrong and while changing the seed let's the test fail the reported errors are in a larger range.
I was now able to reproduce the issue on an A10G and it seems the numerical mismatch is caused if TF32 is allowed in cuDNN.
The output values have a large range reported as:

tensor([[ 8.1592e+03, -2.7165e+04,  2.9925e+03,  1.6079e+04, -6.1412e+03,
         -5.4558e+03,  8.6438e+03,  1.0517e+04,  2.7873e+04,  3.0356e+03,
         -1.1014e+04,  1.9574e+04,  7.1062e+03, -3.5376e+03,  6.9987e+03,
         -6.3800e+03, -1.8092e+04,  1.6719e+04,  2.5773e+03, -2.6049e+03,
         -1.3284e+04, -7.9999e+03,  1.3866e+01,  8.8126e+02, -6.2183e+03,
         -8.9771e+03, -1.0583e+03, -1.0977e+04,  6.3043e+03, -7.0138e+03,
         -1.6880e+04,  6.6776e+03, -1.1648e+04,  3.6115e+03,  2.0045e+04,
          7.8362e+02, -2.1655e+04, -9.3831e+03,  2.6998e+04,  1.5136e+04,
         -3.8140e+03,  1.4637e+03, -1.5687e+04, -6.0253e+03, -3.6343e+03,
          1.8916e+03,  7.9858e+03,  2.1514e+03, -1.0606e+04,  7.2659e+03]],

so the abs. error of ~5 might be expected.
Disabling TF32 for this test let's it pass again.

Additionally, I've created internal cuBLAS and cuDNN logs and compared the outputs to a CPU implementation without seeing any disallowed mismatches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants