-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"CUDA error, AssertionError: Tensor-likes are not close!" on model test for cuda-resnet101 #7618
Comments
I'm able to reproduce the issue on an A100 using a different seed that diff --git a/test/test_models.py b/test/test_models.py
index 91aa66c667..d2b312ab15 100644
--- a/test/test_models.py
+++ b/test/test_models.py
@@ -674,7 +674,10 @@ def test_vitc_models(model_fn, dev):
@pytest.mark.parametrize("model_fn", list_model_fns(models))
@pytest.mark.parametrize("dev", cpu_and_gpu())
def test_classification_model(model_fn, dev):
- set_rng_seed(0)
+ import os
+ seed = int(os.getenv("TORCH_SEED"))
+ print("using seed {}".format(seed))
+ set_rng_seed(seed)
defaults = {
"num_classes": 50,
"input_shape": (1, 3, 224, 224), So far I would guess the PRNG behavior might have changed between 11.7 and 11.8 for A10G, but I still need to verify it on the actual device. I'll try to lease an A10G next to reproduce the actual failure with the default seed to check if my guess is correct. |
My guess was wrong and while changing the seed let's the test fail the reported errors are in a larger range. tensor([[ 8.1592e+03, -2.7165e+04, 2.9925e+03, 1.6079e+04, -6.1412e+03,
-5.4558e+03, 8.6438e+03, 1.0517e+04, 2.7873e+04, 3.0356e+03,
-1.1014e+04, 1.9574e+04, 7.1062e+03, -3.5376e+03, 6.9987e+03,
-6.3800e+03, -1.8092e+04, 1.6719e+04, 2.5773e+03, -2.6049e+03,
-1.3284e+04, -7.9999e+03, 1.3866e+01, 8.8126e+02, -6.2183e+03,
-8.9771e+03, -1.0583e+03, -1.0977e+04, 6.3043e+03, -7.0138e+03,
-1.6880e+04, 6.6776e+03, -1.1648e+04, 3.6115e+03, 2.0045e+04,
7.8362e+02, -2.1655e+04, -9.3831e+03, 2.6998e+04, 1.5136e+04,
-3.8140e+03, 1.4637e+03, -1.5687e+04, -6.0253e+03, -3.6343e+03,
1.8916e+03, 7.9858e+03, 2.1514e+03, -1.0606e+04, 7.2659e+03]], so the abs. error of ~5 might be expected. Additionally, I've created internal cuBLAS and cuDNN logs and compared the outputs to a CPU implementation without seeing any disallowed mismatches. |
🐛 Describe the bug
Similar to: #7143
When switching CI from CUDA 11.7 to CUDA 11.8. Unit tests on Linux fails:
#7616
Versions
nightly 2.1.0
cc @pmeier @NicolasHug @ptrblck @malfet @ngimel
The text was updated successfully, but these errors were encountered: