Add unit-tests for M1 #6111

datumbox · 2022-05-31T07:46:56Z

This PR:

Adds binary build execution runs for M1 on main branch
Adds unit-test execution for M1 on main and nightly branches

atalman

LGTM

atalman · 2022-05-31T13:33:05Z

.github/workflows/build-m1-binaries.yml

@@ -6,6 +6,7 @@ on:
  push:
    branches:
      - nightly
+      - main


Why do we want to execute this on main ? Since you already compiling vision inside test-m1.yml.

The test-m1 only tests on py3.8 and executes only part of the checks (no wheel validation). I thought it would be best to do this at least once in each PR on main to be able to catch the issues easier. It's a middle ground tradeoff between executing the tests in every PR/commit and doing so only once a day. Hopefully the volume of commits that get merged every day on TorchVision isn't significant to affect the overall resources.

.github/workflows/build-m1-binaries.yml

.github/workflows/test-m1.yml

datumbox · 2022-05-31T13:40:44Z

As suspected there are failures on some of the m1 tests:

FAILED test/test_datasets.py::DatasetFolderTestCase::test_classes - datasets_...
FAILED test/test_datasets.py::DatasetFolderTestCase::test_feature_types - dat...
FAILED test/test_datasets.py::DatasetFolderTestCase::test_is_valid_file - dat...
FAILED test/test_datasets.py::DatasetFolderTestCase::test_num_examples - data...
FAILED test/test_datasets.py::DatasetFolderTestCase::test_smoke - datasets_ut...
FAILED test/test_datasets.py::DatasetFolderTestCase::test_str_smoke - dataset...
FAILED test/test_datasets.py::DatasetFolderTestCase::test_transforms - datase...
FAILED test/test_image.py::test_decode_jpeg[None-ImageReadMode.UNCHANGED-grace_hopper_517x606.jpg]
FAILED test/test_image.py::test_decode_jpeg[None-ImageReadMode.UNCHANGED-gray_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[None-ImageReadMode.UNCHANGED-rgb_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[None-ImageReadMode.UNCHANGED-cmyk_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[L-ImageReadMode.GRAY-grace_hopper_517x606.jpg]
FAILED test/test_image.py::test_decode_jpeg[L-ImageReadMode.GRAY-gray_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[L-ImageReadMode.GRAY-rgb_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[RGB-ImageReadMode.RGB-grace_hopper_517x606.jpg]
FAILED test/test_image.py::test_decode_jpeg[RGB-ImageReadMode.RGB-gray_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[RGB-ImageReadMode.RGB-rgb_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg_errors - AssertionError: Regex pa...
FAILED test/test_image.py::test_decode_bad_huffman_images - RuntimeError: dec...
FAILED test/test_image.py::test_damaged_corrupt_images[corrupt34_4.jpg] - Ass...
FAILED test/test_image.py::test_damaged_corrupt_images[corrupt.jpg] - Asserti...
FAILED test/test_image.py::test_damaged_corrupt_images[corrupt34_3.jpg] - Ass...
FAILED test/test_image.py::test_damaged_corrupt_images[corrupt34_2.jpg] - Ass...
FAILED test/test_image.py::test_encode_jpeg_errors - AssertionError: Regex pa...
= 24 failed, 15515 passed, 14011 skipped, 57 xfailed, 456 warnings in 1054.68s (0:17:34) =

I'm going to rerun the test with more verbosity show more details.

@pmeier Any initial ideas what could be the problem with the Dataset test?

@NicolasHug We had quite a few issues with the specific jpeg decoding tests. I remember you did an investigation previously on Windows and the main issue was the underlying JPEG lib? Any ideas whether we should take these failures seriously?

NicolasHug · 2022-05-31T14:43:29Z

@NicolasHug We had quite a few issues with the specific jpeg decoding tests. I remember you did an investigation previously on Windows and the main issue was the underlying JPEG lib? Any ideas whether we should take these failures seriously?

I don't think it's related to comparison tests. From the log, the error is

 _______ test_decode_jpeg[None-ImageReadMode.UNCHANGED-cmyk_pytorch.jpg] ________
Traceback (most recent call last):
  File "/Users/ec2-user/runner/_work/vision/vision/test/test_image.py", line 95, in test_decode_jpeg
    img_ljpeg = decode_image(data, mode=mode)
  File "/Users/ec2-user/runner/_work/vision/vision/torchvision/io/image.py", line 227, in decode_image
    output = torch.ops.image.decode_image(input, mode.value)
RuntimeError: decode_jpeg: torchvision not compiled with libjpeg support

NicolasHug · 2022-05-31T14:44:24Z

The Dataset failures:

datasets_utils.UsageError: Failed to import module 'av'. This probably means that the current test case needs 'av' installed, but it is not a dependency of torchvision. You need to install it manually, for example 'pip install av'.

datumbox · 2022-05-31T15:16:21Z

@NicolasHug Thanks for checking. You are right, the Dataset is related to the AV dependency. The log is a bit hard to read because it truncates the output. I've added the dependency.

@atalman @malfet It seems on #5948, you installed openjpeg instead of libjpeg. Could you provide some context on why this change? Note that if we are to use openjpeg we would need to make sure TorchVision compiles against it. Currently this is not configured and thus the jpeg related functions don't work.

malfet · 2022-05-31T17:02:03Z

@atalman @malfet It seems on #5948, you installed openjpeg instead of libjpeg. Could you provide some context on why this change?

No reason, I just copy-n-pasted requirements from https://github.com/pytorch/builder/blob/bfce31f0d6af712081504ccda5c8d6c6d9ee7c94/build_m1_domains.sh#L12

But I'm a bit surprised that it is not used, as libjpeg.dylib is indeed bundled into whl. Investigating

malfet · 2022-05-31T17:53:35Z

Hmm, so I can read both png and jpeg images on M1 using torchvision pip wheel (which were built using OpenJPEG):

% python -c 'import torchvision; print(torchvision.__version__, torchvision.io.read_image("tux.png").shape, torchvision.io.read_image("hummingbird.jpeg").shape)'
0.12.0 torch.Size([4, 479, 400]) torch.Size([3, 668, 1140])

But I can not do it using nightly:

% python -c 'import torchvision; print(torchvision.__version__, torchvision.io.read_image("tux.png").shape, torchvision.io.read_image("hummingbird.jpeg").shape)'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/malfet/miniconda3/envs/py39-test/lib/python3.9/site-packages/torchvision/io/image.py", line 249, in read_image
    data = read_file(path)
  File "/Users/malfet/miniconda3/envs/py39-test/lib/python3.9/site-packages/torchvision/io/image.py", line 47, in read_file
    data = torch.ops.image.read_file(path)
  File "/Users/malfet/miniconda3/envs/py39-test/lib/python3.9/site-packages/torch/_ops.py", line 143, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: isString()INTERNAL ASSERT FAILED at "/Users/ec2-user/runner/_work/vision/vision/conda-env-2414653305/lib/python3.9/site-packages/torch/include/ATen/core/ivalue_inl.h":2088, please report a bug to PyTorch. Expected String but got Blob

Ok, and the problem should have been obvious, considering https://github.com/pytorch/vision/runs/6669674559?check_suite_focus=true#step:3:126 and https://github.com/pytorch/pytorch/blob/release/1.11/aten/src/ATen/core/ivalue_inl.h#L2088

datumbox · 2022-06-01T08:52:48Z

@malfet I'm sorry, I don't think I follow you. Here is my understanding of the situation:

The M1 builds use a openjpeg instead of libjpeg. In your earlier comment, you indicated it was accidentally selected due to copy-pasting but I suspect it was chosen because libjpeg is not available on M1. The problem is openjpeg is not a tested solution and it doesn't seem to be picked up by our compilation scripts.
Not all tests of TorchVision pass on M1. In your previous comment, you seem to indicate that things don't work on nightly. It is not obvious to me what the problem is and if it needs to be addressed on Core or on TorchVision.

Please note that we are currently 5 calendar days away from the RC cut on Domains (6/6). Since in the UK the 2nd and 3rd of June are bank holidays, we effectively have 2 biz days (today and the 6th). We currently don't have any proof that the M1 build of TorchVision works properly and unfortunately we are running out of time to test and fix things.

This is where we could use your help:

Fix the JPEG compilation issues identified with the M1 builds.
Add a CI job that runs the unit-tests of TorchVision on M1.
Fix outstanding issues with the CI jobs of M1 such as the hanging of jobs due to Secret source: None on the S3 upload step. @atalman is aware of the issue and had to manually kick things to restart the queue. This PR attempts to mitigate it partially but I think a guard might be required.

If you have the bandwidth to provide support on the above, once our team is back on the 6th of June we will try to address any broken unit-tests on the M1, cherrypick the fixes and move them to the release branch. If you are not able to provide support due to bandwidth issues, then our team will revert all the M1 PRs to avoid releasing a broken binary in the next release and we can review adding M1 support to TorchVision in the weeks ahead. (cc @NicolasHug who is the PoC for this release)

I'll close this PR but feel free to copy it and use any part that you find useful to help you in your work. Thanks!

Add unit-tests for M1

6f63829

facebook-github-bot added the cla signed label May 31, 2022

Remove S3 upload from PRs

833f214

datumbox added the module: ci label May 31, 2022

datumbox added 3 commits May 31, 2022 09:23

Depend on build m1

3bb85fa

Switch to conda

f64872a

Install test dependencies

bafae2f

datumbox mentioned this pull request May 31, 2022

Add TorchVision wheel builds for M1 #5948

Merged

datumbox added 5 commits May 31, 2022 09:50

remove scipy

e7394f4

Run pytest via conda env

2d462f5

Show output and separate tasks

1403635

Remove unnecessary var

7ec1a71

Restrict to main and nightly branches

63f9e62

datumbox added the enhancement label May 31, 2022

datumbox mentioned this pull request May 31, 2022

[Cherrypick 0.13] Add M1 build and test jobs #6112

Closed

Add conda exec

d35029b

atalman approved these changes May 31, 2022

View reviewed changes

datumbox added 3 commits May 31, 2022 14:52

Changing verbosity

2757787

Address review comments.

f5678e4

Adding verbosity flag

00d2de4

datumbox force-pushed the ci/m1 branch from 169e9e7 to ec8647c Compare May 31, 2022 14:54

YosuaMichael mentioned this pull request May 31, 2022

[v0.13] Release Tracker #6081

Closed

Add more dependencies

377a0f4

datumbox force-pushed the ci/m1 branch from ec8647c to 377a0f4 Compare May 31, 2022 15:02

Switch openjpeg to libjpeg

632cba4

datumbox closed this Jun 1, 2022

datumbox deleted the ci/m1 branch June 1, 2022 08:52

NicolasHug mentioned this pull request Jun 6, 2022

Add unit-tests for M1 #6132

Merged

NicolasHug added a commit to YosuaMichael/vision that referenced this pull request Jun 6, 2022

Cherry pick pytorch#6122 and pytorch#6132 which supersedes pytorch#6111

d3b1d3c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unit-tests for M1 #6111

Add unit-tests for M1 #6111

datumbox commented May 31, 2022 •

edited

Loading

atalman left a comment

atalman May 31, 2022

datumbox May 31, 2022

datumbox commented May 31, 2022 •

edited

Loading

NicolasHug commented May 31, 2022

NicolasHug commented May 31, 2022

datumbox commented May 31, 2022

malfet commented May 31, 2022

malfet commented May 31, 2022 •

edited

Loading

datumbox commented Jun 1, 2022

Add unit-tests for M1 #6111

Add unit-tests for M1 #6111

Conversation

datumbox commented May 31, 2022 • edited Loading

atalman left a comment

Choose a reason for hiding this comment

atalman May 31, 2022

Choose a reason for hiding this comment

datumbox May 31, 2022

Choose a reason for hiding this comment

datumbox commented May 31, 2022 • edited Loading

NicolasHug commented May 31, 2022

NicolasHug commented May 31, 2022

datumbox commented May 31, 2022

malfet commented May 31, 2022

malfet commented May 31, 2022 • edited Loading

datumbox commented Jun 1, 2022

datumbox commented May 31, 2022 •

edited

Loading

datumbox commented May 31, 2022 •

edited

Loading

malfet commented May 31, 2022 •

edited

Loading