Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unit-tests for M1 #6111

Closed
wants to merge 16 commits into from
Closed

Add unit-tests for M1 #6111

wants to merge 16 commits into from

Conversation

datumbox
Copy link
Contributor

@datumbox datumbox commented May 31, 2022

Addressing #5948 (comment)

This PR:

  • Adds binary build execution runs for M1 on main branch
  • Adds unit-test execution for M1 on main and nightly branches

Copy link
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -6,6 +6,7 @@ on:
push:
branches:
- nightly
- main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to execute this on main ? Since you already compiling vision inside test-m1.yml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test-m1 only tests on py3.8 and executes only part of the checks (no wheel validation). I thought it would be best to do this at least once in each PR on main to be able to catch the issues easier. It's a middle ground tradeoff between executing the tests in every PR/commit and doing so only once a day. Hopefully the volume of commits that get merged every day on TorchVision isn't significant to affect the overall resources.

.github/workflows/build-m1-binaries.yml Show resolved Hide resolved
.github/workflows/test-m1.yml Show resolved Hide resolved
@datumbox
Copy link
Contributor Author

datumbox commented May 31, 2022

As suspected there are failures on some of the m1 tests:

FAILED test/test_datasets.py::DatasetFolderTestCase::test_classes - datasets_...
FAILED test/test_datasets.py::DatasetFolderTestCase::test_feature_types - dat...
FAILED test/test_datasets.py::DatasetFolderTestCase::test_is_valid_file - dat...
FAILED test/test_datasets.py::DatasetFolderTestCase::test_num_examples - data...
FAILED test/test_datasets.py::DatasetFolderTestCase::test_smoke - datasets_ut...
FAILED test/test_datasets.py::DatasetFolderTestCase::test_str_smoke - dataset...
FAILED test/test_datasets.py::DatasetFolderTestCase::test_transforms - datase...
FAILED test/test_image.py::test_decode_jpeg[None-ImageReadMode.UNCHANGED-grace_hopper_517x606.jpg]
FAILED test/test_image.py::test_decode_jpeg[None-ImageReadMode.UNCHANGED-gray_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[None-ImageReadMode.UNCHANGED-rgb_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[None-ImageReadMode.UNCHANGED-cmyk_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[L-ImageReadMode.GRAY-grace_hopper_517x606.jpg]
FAILED test/test_image.py::test_decode_jpeg[L-ImageReadMode.GRAY-gray_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[L-ImageReadMode.GRAY-rgb_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[RGB-ImageReadMode.RGB-grace_hopper_517x606.jpg]
FAILED test/test_image.py::test_decode_jpeg[RGB-ImageReadMode.RGB-gray_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[RGB-ImageReadMode.RGB-rgb_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg_errors - AssertionError: Regex pa...
FAILED test/test_image.py::test_decode_bad_huffman_images - RuntimeError: dec...
FAILED test/test_image.py::test_damaged_corrupt_images[corrupt34_4.jpg] - Ass...
FAILED test/test_image.py::test_damaged_corrupt_images[corrupt.jpg] - Asserti...
FAILED test/test_image.py::test_damaged_corrupt_images[corrupt34_3.jpg] - Ass...
FAILED test/test_image.py::test_damaged_corrupt_images[corrupt34_2.jpg] - Ass...
FAILED test/test_image.py::test_encode_jpeg_errors - AssertionError: Regex pa...
= 24 failed, 15515 passed, 14011 skipped, 57 xfailed, 456 warnings in 1054.68s (0:17:34) =

I'm going to rerun the test with more verbosity show more details.

@pmeier Any initial ideas what could be the problem with the Dataset test?

@NicolasHug We had quite a few issues with the specific jpeg decoding tests. I remember you did an investigation previously on Windows and the main issue was the underlying JPEG lib? Any ideas whether we should take these failures seriously?

@NicolasHug
Copy link
Member

@NicolasHug We had quite a few issues with the specific jpeg decoding tests. I remember you did an investigation previously on Windows and the main issue was the underlying JPEG lib? Any ideas whether we should take these failures seriously?

I don't think it's related to comparison tests. From the log, the error is

 _______ test_decode_jpeg[None-ImageReadMode.UNCHANGED-cmyk_pytorch.jpg] ________
Traceback (most recent call last):
  File "/Users/ec2-user/runner/_work/vision/vision/test/test_image.py", line 95, in test_decode_jpeg
    img_ljpeg = decode_image(data, mode=mode)
  File "/Users/ec2-user/runner/_work/vision/vision/torchvision/io/image.py", line 227, in decode_image
    output = torch.ops.image.decode_image(input, mode.value)
RuntimeError: decode_jpeg: torchvision not compiled with libjpeg support

@NicolasHug
Copy link
Member

The Dataset failures:

datasets_utils.UsageError: Failed to import module 'av'. This probably means that the current test case needs 'av' installed, but it is not a dependency of torchvision. You need to install it manually, for example 'pip install av'.

@datumbox
Copy link
Contributor Author

@NicolasHug Thanks for checking. You are right, the Dataset is related to the AV dependency. The log is a bit hard to read because it truncates the output. I've added the dependency.

@atalman @malfet It seems on #5948, you installed openjpeg instead of libjpeg. Could you provide some context on why this change? Note that if we are to use openjpeg we would need to make sure TorchVision compiles against it. Currently this is not configured and thus the jpeg related functions don't work.

@malfet
Copy link
Contributor

malfet commented May 31, 2022

@atalman @malfet It seems on #5948, you installed openjpeg instead of libjpeg. Could you provide some context on why this change?

No reason, I just copy-n-pasted requirements from https://github.com/pytorch/builder/blob/bfce31f0d6af712081504ccda5c8d6c6d9ee7c94/build_m1_domains.sh#L12

But I'm a bit surprised that it is not used, as libjpeg.dylib is indeed bundled into whl. Investigating

@malfet
Copy link
Contributor

malfet commented May 31, 2022

Hmm, so I can read both png and jpeg images on M1 using torchvision pip wheel (which were built using OpenJPEG):

% python -c 'import torchvision; print(torchvision.__version__, torchvision.io.read_image("tux.png").shape, torchvision.io.read_image("hummingbird.jpeg").shape)'
0.12.0 torch.Size([4, 479, 400]) torch.Size([3, 668, 1140])

But I can not do it using nightly:

% python -c 'import torchvision; print(torchvision.__version__, torchvision.io.read_image("tux.png").shape, torchvision.io.read_image("hummingbird.jpeg").shape)'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/malfet/miniconda3/envs/py39-test/lib/python3.9/site-packages/torchvision/io/image.py", line 249, in read_image
    data = read_file(path)
  File "/Users/malfet/miniconda3/envs/py39-test/lib/python3.9/site-packages/torchvision/io/image.py", line 47, in read_file
    data = torch.ops.image.read_file(path)
  File "/Users/malfet/miniconda3/envs/py39-test/lib/python3.9/site-packages/torch/_ops.py", line 143, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: isString()INTERNAL ASSERT FAILED at "/Users/ec2-user/runner/_work/vision/vision/conda-env-2414653305/lib/python3.9/site-packages/torch/include/ATen/core/ivalue_inl.h":2088, please report a bug to PyTorch. Expected String but got Blob

Ok, and the problem should have been obvious, considering https://github.com/pytorch/vision/runs/6669674559?check_suite_focus=true#step:3:126 and https://github.com/pytorch/pytorch/blob/release/1.11/aten/src/ATen/core/ivalue_inl.h#L2088

@datumbox
Copy link
Contributor Author

datumbox commented Jun 1, 2022

@malfet I'm sorry, I don't think I follow you. Here is my understanding of the situation:

  1. The M1 builds use a openjpeg instead of libjpeg. In your earlier comment, you indicated it was accidentally selected due to copy-pasting but I suspect it was chosen because libjpeg is not available on M1. The problem is openjpeg is not a tested solution and it doesn't seem to be picked up by our compilation scripts.
  2. Not all tests of TorchVision pass on M1. In your previous comment, you seem to indicate that things don't work on nightly. It is not obvious to me what the problem is and if it needs to be addressed on Core or on TorchVision.

Please note that we are currently 5 calendar days away from the RC cut on Domains (6/6). Since in the UK the 2nd and 3rd of June are bank holidays, we effectively have 2 biz days (today and the 6th). We currently don't have any proof that the M1 build of TorchVision works properly and unfortunately we are running out of time to test and fix things.

This is where we could use your help:

  1. Fix the JPEG compilation issues identified with the M1 builds.
  2. Add a CI job that runs the unit-tests of TorchVision on M1.
  3. Fix outstanding issues with the CI jobs of M1 such as the hanging of jobs due to Secret source: None on the S3 upload step. @atalman is aware of the issue and had to manually kick things to restart the queue. This PR attempts to mitigate it partially but I think a guard might be required.

If you have the bandwidth to provide support on the above, once our team is back on the 6th of June we will try to address any broken unit-tests on the M1, cherrypick the fixes and move them to the release branch. If you are not able to provide support due to bandwidth issues, then our team will revert all the M1 PRs to avoid releasing a broken binary in the next release and we can review adding M1 support to TorchVision in the weeks ahead. (cc @NicolasHug who is the PoC for this release)

I'll close this PR but feel free to copy it and use any part that you find useful to help you in your work. Thanks!

@datumbox datumbox closed this Jun 1, 2022
@datumbox datumbox deleted the ci/m1 branch June 1, 2022 08:52
@NicolasHug NicolasHug mentioned this pull request Jun 6, 2022
NicolasHug added a commit to YosuaMichael/vision that referenced this pull request Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants