Skip to content

test_quantized_models.py times out within Nightly CI #1857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
seemethere opened this issue Feb 7, 2020 · 12 comments
Closed

test_quantized_models.py times out within Nightly CI #1857

seemethere opened this issue Feb 7, 2020 · 12 comments

Comments

@seemethere
Copy link
Member

seemethere commented Feb 7, 2020

Currently the nightly pipelines are failing when running pytest . on the test/test_ops.py tests.

Example CircleCI Logs:
https://app.circleci.com/jobs/github/pytorch/vision/85108

Log Excerpt:

+ pytest .
============================= test session starts ==============================
platform linux -- Python 3.6.10, pytest-5.3.5, py-1.8.1, pluggy-0.13.1
rootdir: $SRC_DIR
collected 286 items

test/test_backbone_utils.py ..                                           [  0%]
test/test_cpp_models.py sssssssssssssssssssssssssssssss                  [ 11%]
test/test_datasets.py ..........                                         [ 15%]
test/test_datasets_samplers.py .....                                     [ 16%]
test/test_datasets_transforms.py ..                                      [ 17%]
test/test_datasets_utils.py ..........                                   [ 20%]
test/test_datasets_video_utils.py ....                                   [ 22%]
test/test_functional_tensor.py ........                                  [ 25%]
test/test_hub.py sss                                                     [ 26%]
test/test_io.py ss............                                           [ 31%]
test/test_models.py .........s.........................................  [ 48%]
test/test_models_detection_utils.py .                                    [ 49%]
test/test_onnx.py sssssssssssss                                          [ 53%]
test/test_ops.py ..ss..ss..ss..ss..ss..ss..ss..ss.s...ss..ss             [ 68%]
Too long with no output (exceeded 10m0s): context deadline exceeded

This is what is currently effecting pytorch/pytorch#33103

Also this may be related to #1528

cc @fmassa

@fmassa
Copy link
Member

fmassa commented Feb 10, 2020

Thanks for the report!

I think this might be due to @zou3519 changes in pytorch/pytorch#32495, which also caused nested tensors CI to fail, see pytorch/pytorch#33091

@zou3519
Copy link
Contributor

zou3519 commented Feb 11, 2020

@fmassa does test_ops.py run c++ extensions?

@zou3519
Copy link
Contributor

zou3519 commented Feb 11, 2020

I don't think pytorch/pytorch#32495 is related because test_ops.py doesn't run c++ extensions directly. However, there's an easy way to test, assuming the CI runs on each PR (send a PR that disables building with ninja). I'll do that and see what happens.

EDIT: Here's another reason why I think pytorch/pytorch#32495 is unrelated. Consider #1850. The log for the first failing job, if we download it completely, doesn't say it is using ninja to build, which means that the CI ran and failed before the change in pytorch/pytorch#32495 was committed.

EDIT2: Here is the test PR where we build without ninja. This one also fails in the same way as reported in this issue, so ninja is probably unrelated.

@seemethere
Copy link
Member Author

Do we have any other leads on why this is happening? Perhaps adding some debugging information to the pytest run would provide some insight into what test it's actually timing out on

@seemethere
Copy link
Member Author

After submitting #1880 I now believe the culprit we should be looking at is something that runs after test/test_quantized_models.py::ModelTester::test_googlenet.

From these logs:

test/test_ops.py::DeformConvTester::test_forward_cpu_contiguous PASSED   [ 67%]
test/test_ops.py::DeformConvTester::test_forward_cpu_non_contiguous PASSED [ 68%]
test/test_ops.py::DeformConvTester::test_forward_cuda_contiguous SKIPPED [ 68%]
test/test_ops.py::DeformConvTester::test_forward_cuda_non_contiguous SKIPPED [ 68%]
test/test_quantized_models.py::ModelTester::test_googlenet PASSED        [ 69%]

Too long with no output (exceeded 10m0s): context deadline exceeded

@seemethere seemethere changed the title test/test_ops.py times out within Nightly CI test_quantized_models.py times out within Nightly CI Feb 13, 2020
@zou3519
Copy link
Contributor

zou3519 commented Feb 13, 2020

Is this reproducible locally?

@seemethere
Copy link
Member Author

I think it’s highly dependent on performance. If the test takes longer than 10 min to run then it’ll time out like this. Were there any tests within that file that were changed recently?

@fmassa
Copy link
Member

fmassa commented Feb 13, 2020

test_ops.py does call C++ extensions, but as you mentioned it might not be test_ops the culprit.

If the issue is in test_quantized_models, I'd recommend checking with @raghuramank100 to see if anything has changed on the quantization side

@seemethere
Copy link
Member Author

seemethere commented Feb 13, 2020

From my local testing the test that takes the longest is Test Inception V3 from the test_quantized_models.py file, which takes about 846s on a devserver

Got this result from running on my branch that outputs junit logs

CU_VERSION=cpu PYTHON_VERSION=3.8 packaging/build_conda.sh

Screen Shot 2020-02-13 at 1 46 01 PM

Results are here: results.zip

This would lead me to believe that this test is the culprit.

I've submitted #1885 to skip the test for now and get the nightly build pipeline back on track.

@seemethere
Copy link
Member Author

seemethere commented Feb 26, 2020

@fmassa @zou3519 Who would be the person to assign this issue to so we can re-add this test back to the nightly matrix?

Assigning @fmassa in the interim

@fmassa
Copy link
Member

fmassa commented Mar 20, 2020

Assigning this to @raghuramank100 , who leads the quantization efforts

@fmassa fmassa assigned raghuramank100 and unassigned fmassa Mar 20, 2020
@datumbox
Copy link
Contributor

This should be fixed by #3196.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants