-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some model tests are failing on GPU #7098
Comments
Looking at the experiment by running the script, seems like the model tests are very sensitive to seed (the relative differences between different seeds vary by a lot). Also, for some models such as efficientnet, their output will be close to zero if we use random input. To address these two problem, I tried to do experiment with real pretrained weight and real image. Here is the script: import torch
import torchvision
import random
from PIL import Image
img_path = "grace_hopper_517x606.jpg"
img_pil = Image.open(img_path)
def get_cpu_gpu_model_output_maxdiff(model_fn, seed):
torch.manual_seed(seed)
random.seed(seed)
# Use real weight, we use the DEFAULT weight
weight_enum = torchvision.models.get_model_weights(model_fn)
weight = weight_enum.DEFAULT
preprocess = weight.transforms()
x_cpu = preprocess(img_pil).unsqueeze(0).to("cpu")
x_gpu = preprocess(img_pil).unsqueeze(0).to("cuda")
m_cpu = model_fn(weights=weight).eval()
m_gpu = model_fn(weights=weight).cuda().eval()
y_cpu = m_cpu(x_cpu).squeeze(0)
y_gpu = m_gpu(x_gpu).to("cpu").squeeze(0)
abs_diff = torch.abs(y_gpu - y_cpu)
max_abs_diff = torch.max(abs_diff)
max_abs_idx = torch.argmax(abs_diff)
max_rel_diff = torch.abs(max_abs_diff / y_cpu[max_abs_idx])
max_val_gpu = torch.max(torch.abs(y_gpu))
mean_val_gpu = torch.mean(torch.abs(y_gpu))
prec = 1e-3
pass_test = torch.allclose(y_gpu, y_cpu, atol=prec, rtol=prec)
print(f" [{seed}]max_abs_diff: {max_abs_diff},\tmax_rel_diff: {max_rel_diff},\tmax_val_gpu: {max_val_gpu},\tmean_val_gpu: {mean_val_gpu},\tpass_test: {pass_test}")
for model_fn in [torchvision.models.resnet.resnet34, torchvision.models.resnet.resnet101, torchvision.models.efficientnet.efficientnet_b0]:
print(f"model_fn: {model_fn.__name__}")
for seed in range(1):
get_cpu_gpu_model_output_maxdiff(model_fn, seed) I used the following image for the test: https://github.com/pytorch/vision/blob/main/test/assets/encode_jpeg/grace_hopper_517x606.jpg Since we use real image and weight, there is no randomness, thats why I only use 1 seed (I have tried using multiple seeds and I can confirmed they will produce exactly same results). Here is the output after using real weight and image:
Compared to random image and random weight, now we have more comparable absolute differences between efficientnet and resnet. Note: The max_rel_diff is computed by considering index with max_abs_diff and compute the relative difference of this index. Although the result has more consistency across models, but now all models dont pass the test with precision Is this differences between CPU and GPU expected? If yes, I think our TorchVision test should use real weight and image, then relaxes the precision constraint for GPU. Otherwise, we should investigate the biggest factor that cause the differences. cc @NicolasHug |
I compiled the statistics on all classification models in torchvision: https://docs.google.com/spreadsheets/d/162nq0p0-7Be0nBzffyEMhMF5PohNQ6Ew1OPdCcCVsGY/edit#gid=790607453 |
Currently the some model tests are failing on Linux GPU on GHA.
Error observations:
Here are sample of the error from a run in 17 January 2023:
After tracing back, seems like the problem start from around 8 or 9 December 2022. We notice in 8 December 2022 the run was succeeded, however it skip the GPU test and only run CPU test (example of 8 December 2022 run).
And on 9 December 2022, we notice it run both CPU and GPU test and the GPU test failed by having different result from the CPU counterpart (example run on 9 December 2022, notice that the failure on resnet101 has different relative difference with the one on 17 January 2023).
Another observation is on 9 December 2022 if we see the PR #6919, we can see that although the GHA linux GPU failed due to precision problem, the circle CI gpu test succeed.
There is not change in the model (resnet34) and the test, and the CPU test always succeed between 8 December 2022 to 17 January 2023.
Possible problems
Script to reproduce the problem
Here is a small script that able to reproduce the problems:
When I ran this script on AWS Cluster with cuda 11.6 on python 3.8 (I provide the result of collect_env.py at the end of the section), I got the following output log:
Our test we have tolerance of
0.001
and these results are consistently bigger than the usual tolerance for resnet models, hence it is unexpected.There seems no change associated with resnet models in torchvision, hence most likely some changes in pytorch-core cause this differences.
The environment I used to run this reproduction:
cc @osalpekar @seemethere @atalman
The text was updated successfully, but these errors were encountered: