Rebuild ci-arm, ci-cpu, and ci-gpu container #8177

areusch · 2021-06-02T15:34:22Z

areusch · 2021-06-02T15:34:33Z

cc @d-smirnov @u99127

areusch · 2021-06-04T22:28:55Z

also #8088 needs to be a part of this

areusch · 2021-06-16T19:30:40Z

expanding to ci-gpu and including #8268

Lunderberg · 2021-06-18T14:36:27Z

Found while implementing #8029, there are a few version incompatibilities in tlcpack/ci_gpu:v0.75.

Current version of tensorflow==2.3.1 is incompatible with cuda 11.0. Based on this table, we'll want to switch to tensorflow==2.4.0. The warning can be reproduced by running import tensorflow in a python shell.
WARNING:root:scikit-learn version 0.24.2 is not supported. Minimum required version: 0.17. Maximum required version: 0.19.2. Disabling scikit-learn conversion API. The warning can be reproduced by running import coremltools in a python shell.

The current version of coremltools==4.1 has a maximum version of scikit-learn<=0.19.2. Unfortunately, the current version of gluoncv==0.10.1 has a conflicting minimum version of scikit-learn>=0.23.2, so it won't be a simple update of a single package.

areusch · 2021-06-21T16:07:10Z

i won't have time to get to this during this week. if someone else wants to push the images, go for it. #8193 could also potentially join this merge train.

if this isn't closed by next Monday, I'll handle it then.

mbrookhart · 2021-06-21T16:15:02Z

I can probably take this. It looks like there are a number of things we need to merge to main before regenerating the tickets?

It looks like the follow up to #8169 was merged as #8245. Do we need to merge #8088 and a CI-only version of #8193 before generating the images? It looks like #8268 failed a flaky test in the TF frontend, @areusch can you rebase?

@Lunderberg Do you want to make a PR for the TF update?

areusch · 2021-06-22T00:56:36Z

@mbrookhart rebased

Lunderberg · 2021-06-22T17:58:21Z

@mbrookhart #8306 is now open, includes the TF version update.

Regarding coremltools, we're already installing the latest version coremltools==4.1, and even the beta version coremltools==5.0b1 gives warnings about the scikit-learn and tensorflow versions. That said, all the tests in tests/python/frontend/coreml and tests/python/contrib/test_coreml_codegen.py pass despite the warnings. If somebody with an IOS device available could test whether tests/python/contrib/test_coreml_runtime.py passes, I think that would be a good additional check.

mbrookhart · 2021-06-23T00:43:16Z

I've merged everything I think we want, but unfortunately, none of the docker images are building.

ci_arm fails with this: /install/ubuntu_download_arm_compute_lib_binaries.sh: line 54: curl: command not found, I think that's an issue with #8245 . @d-smirnov @leandron , can you debug and fix? you probably need to install curl in the download script or use wget.

Everything else is failing with:

E: Failed to fetch https://apt.llvm.org/bionic/dists/llvm-toolchain-bionic/main/source/Sources.gz  File has unexpected size (2207 != 2205). Mirror sync in progress? [IP: 199.232.194.49 443]
   Hashes of expected file:
    - Filesize:2205 [weak]
    - SHA256:ad346b2d1b4cffcf27cad9cdd14f8d5dd047cfc10cd09e6e17671d78eeac3501
    - SHA1:a4cb9420123a78efb36e79f4cbe8acda7c27ecf7 [weak]
    - MD5Sum:ae153a86b6a5f0f7ee88f5c97ab13a36 [weak]
   Release file created at: Tue, 22 Jun 2021 23:51:39 +0000
E: Some index files failed to download. They have been ignored, or old ones used instead.
The command '/bin/sh -c bash /install/ubuntu1804_install_llvm.sh' returned a non-zero code: 100

I wonder if there's something wrong with the llvm release at the moment?

mbrookhart · 2021-06-23T00:45:12Z

I'm trying to rebuild the images one at a time instead of in parallel on this machine to see if that fixes the LLVM issue

u99127 · 2021-06-23T08:25:18Z

I've merged everything I think we want, but unfortunately, none of the docker images are building.

ci_arm fails with this: /install/ubuntu_download_arm_compute_lib_binaries.sh: line 54: curl: command not found, I think that's an issue with #8245 . @d-smirnov @leandron , can you debug and fix? you probably need to install curl in the download script or use wget.

I've had a look and probably have a fix for it - it appears that curl is installed in individual install files rather than moving into a common place which is what I'll look to do .

Everything else is failing with:

E: Failed to fetch https://apt.llvm.org/bionic/dists/llvm-toolchain-bionic/main/source/Sources.gz  File has unexpected size (2207 != 2205). Mirror sync in progress? [IP: 199.232.194.49 443]
   Hashes of expected file:
    - Filesize:2205 [weak]
    - SHA256:ad346b2d1b4cffcf27cad9cdd14f8d5dd047cfc10cd09e6e17671d78eeac3501
    - SHA1:a4cb9420123a78efb36e79f4cbe8acda7c27ecf7 [weak]
    - MD5Sum:ae153a86b6a5f0f7ee88f5c97ab13a36 [weak]
   Release file created at: Tue, 22 Jun 2021 23:51:39 +0000
E: Some index files failed to download. They have been ignored, or old ones used instead.
The command '/bin/sh -c bash /install/ubuntu1804_install_llvm.sh' returned a non-zero code: 100

That happens when the LLVM mirrors are updating to get new packages. Nothing much can be done but wait, sadly. @leandron probably knows more.

Ramana

I wonder if there's something wrong with the llvm release at the moment?

u99127 · 2021-06-23T09:06:22Z

@mbrookhart - see #8310

mbrookhart · 2021-06-23T15:02:42Z

Thanks, @u99127 . I'll retry building this morning with these changes

mbrookhart · 2021-06-23T17:08:19Z

@areusch is out this week, but his sphinx fix had a syntax error. If someone could review this, I'd appreciate it:
#8316

With that PR and the other two linked above, I can build the ci_cpu, ci_gpu, and ci_arm images. ci_qemu is failing with a gpg error that arose with one of these PRs, I've asked @mehrdadh to debug. #8156 #8190

mehrdadh · 2021-06-23T17:59:23Z

@mbrookhart looks like http://keys.gnupg.net/ is down. I tried with replacing it to hkp://keyserver.ubuntu.com:80 and it worked.
gpg --keyserver keys.gnupg.net --recv-keys 0x3353C9CEF108B584

mehrdadh · 2021-06-23T21:58:57Z

@mbrookhart sent a PR to fix this issue: #8319

Lunderberg · 2021-06-25T16:08:18Z

A question on the rebuilds overall. Beyond just the version dependencies, we've had a few cases where the built images would result in failing unit tests (e.g. #8339). Would it be good to add running the unit tests to the Dockerfiles? Since we need to test the built images anyways before using them, that would make the testing of each image, with unit tests appropriate for that image, be automatic as part of the build process.

leandron · 2021-06-25T16:15:11Z

A question on the rebuilds overall. Beyond just the version dependencies, we've had a few cases where the built images would result in failing unit tests (e.g. #8339). Would it be good to add running the unit tests to the Dockerfiles? Since we need to test the built images anyways before using them, that would make the testing of each image, with unit tests appropriate for that image, be automatic as part of the build process.

I think running TVM unit tests would imply in also building TVM as part of the Docker image rebuild, which doesn't make much sense. Based on that, I think we should avoid adding the unit tests as part of the Dockerfiles themselves.

IMHO, it would be better just to be able to qualify images in a systematic way, as part of the release of new images.

We (myself and @areusch mainly) are currently working on a way to improve the image rebuild testing, and making it a daily Jenkins job in the upstream CI. The next logical step is to validate the images using the regular TVM build process, pointing to "staging" docker images.

mbrookhart · 2021-06-25T16:19:05Z

I think a nightly job would be great, trying to do it manually this week has been a mess :)

leandron · 2021-06-25T16:27:16Z

I think a nightly job would be great, trying to do it manually this week has been a mess :)

Yeah, it won't solve the problems per se, but once we get it into a stable state once, we'll be able to take one problem at a time, and not all of them at once, when we decide to update the images - that's the advantage.

mbrookhart · 2021-06-29T15:51:42Z

I built the images at pushed them to a PR this morning, I expect the job to fail due to #8339 , but we'll see if there's anything else.

mbrookhart · 2021-06-29T18:20:51Z

Okay, I see failures with tflite, but I also saw this:

WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested

standard_init_linux.go:219: exec user process caused: exec format error

script returned exit code 1

I built the arm image on an amd64 system, do I need to build it on an ARM system instead?

areusch · 2021-06-29T18:37:41Z

the platform error is due to building ci-arm on wrong arch (i use an m6g on AWS). i'm rebuilding for you now.

areusch · 2021-06-29T18:45:32Z

 ---> Running in 6b00dc0292ec
Hit:1 https://apt.llvm.org/xenial llvm-toolchain-xenial-4.0 InRelease
Hit:2 https://apt.llvm.org/xenial llvm-toolchain-xenial-7 InRelease
Hit:3 https://apt.llvm.org/xenial llvm-toolchain-xenial-8 InRelease
Hit:4 https://apt.llvm.org/xenial llvm-toolchain-xenial-9 InRelease
Hit:5 https://apt.llvm.org/xenial llvm-toolchain-xenial InRelease
Hit:6 http://ports.ubuntu.com/ubuntu-ports bionic InRelease
Hit:7 http://ports.ubuntu.com/ubuntu-ports bionic-updates InRelease
Hit:8 http://ports.ubuntu.com/ubuntu-ports bionic-backports InRelease
Hit:9 http://ports.ubuntu.com/ubuntu-ports bionic-security InRelease
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
Package gcc-aarch64-linux-gnu is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

Package g++-aarch64-linux-gnu is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'g++-aarch64-linux-gnu' has no installation candidate
E: Package 'gcc-aarch64-linux-gnu' has no installation candidate
The command '/bin/sh -c bash /install/ubuntu_download_arm_compute_lib_binaries.sh' returned a non-zero code: 100
ERROR: docker build failed.

u99127 · 2021-06-29T19:55:24Z

Looks like I hadn't tested it properly. I think #8371 should fix it.

mbrookhart · 2021-06-30T15:05:10Z

I think we have fixes in for the tflite error and the arm build error (Thanks @u99127 !). I'll regenerate the cpu/gpu/qemu images this morning off 8d4df91, @areusch could you try the ARM image on that commit?

mbrookhart · 2021-07-01T14:36:02Z

The staging job failed because Keras didn't seem to make it into the packages? Did TF drop Keras in 2.4?

Also, @Lunderberg there are a couple more TF failures:

https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/107/pipeline

Lunderberg · 2021-07-01T16:39:47Z

@mbrookhart Whoops, I was mistaken. I thought that keras was a dependency of TF, because it was in the ubuntu_install_tensorflow.sh script, and that the removal of the pinned version of keras was part of the same issue that required the pinned h5py version. I was wrong, and the keras installation was separate from the installation of TF, but depends on TF.

It looks like the standalone-keras is only used if we're importing a keras model that isn't a tensorflow model. Based on this announcement, the support for non-tensorflow models is being dropped in keras 2.4, so I think we should pin it to keras<2.4.0. @leandron , were the issues you ran into in #6810 based on using keras 2.3.1 alongside cuda 10.2, and do you know if they would still be present with cuda 11.0?

mbrookhart · 2021-07-01T20:56:54Z

Is this something we can fix, or do we need to revert the TF upgrade for this round of CI upgrades?

leandron · 2021-07-02T08:37:37Z

@leandron , were the issues you ran into in #6810 based on using keras 2.3.1 alongside cuda 10.2, and do you know if they would still be present with cuda 11.0?

The original discussion was kept in #6754, but my local tests are CPU only, so I don't have any data regarding CUDA versions w.r.t the referred issue.

leandron · 2021-07-02T12:47:07Z

I think we should pin it to keras<2.4.0

I tried this, and just to report back, it fails with AttributeError: module 'tensorflow.python.framework.ops' has no attribute '_TensorLike', which is sort of expected with that combination.

leandron · 2021-07-02T14:49:25Z

On a CPU-only machine, all frontend tests pass with the following combination:
tensorflow==2.4.2 keras==2.4.3

There is no need to pin h5py (which, out of curiosity, installs version 2.10.0)

cc @mbrookhart @Lunderberg

mbrookhart · 2021-07-02T15:01:28Z

Thank you!

It looks like which version of h5py you get might depend on your pip version:

-> % pip install h5py  
Collecting h5py
  Downloading h5py-3.3.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (4.5 MB)

I think we should keep that constraint in there.

areusch · 2021-07-02T16:10:34Z

it may also depend on the order in which things are installed, too.

mbrookhart · 2021-07-02T21:17:10Z

I'm going to rebuild the images on 7b898d0 and try again. Thanks, @leandron and @areusch!

mbrookhart · 2021-07-07T05:37:15Z

The images are built and running, but I'm hitting a few more TF issues. Anyone care to take a look? #8193

Lunderberg · 2021-07-08T15:35:52Z

It looks like the latest issues are running out of GPU memory. Does the new TF increase memory usage, or change when python garbage collection runs?

mbrookhart · 2021-07-08T17:06:58Z

2021-07-07 02:54:43.479442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1632 MB memory) -> physical GPU (device: 0, name: GeForce GTX 750 Ti, pci bus id: 0000:01:00.0, compute capability: 5.0)

Anyone know how a 750 Ti got into CI?

areusch · 2021-07-08T17:16:29Z

+ echo 'INFO: NODE_NAME=sampl.aladdin.cuda1 EXECUTOR_NUMBER=0'
INFO: NODE_NAME=sampl.aladdin.cuda1 EXECUTOR_NUMBER=0`

looks like another SAMPL node @tqchen. do we want to keep this in for diversity of testing or remove it? my vote is to remove it since it is a probabilistic failure scenario rather than a deterministic one. i think our CI is so long that we can't afford to allow code to merge that isn't expected to deterministically pass a following run.

mbrookhart · 2021-07-09T15:05:11Z

Hmm, now I hit an out of memory issue on a T4: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/111/pipeline/393

mbrookhart · 2021-07-12T22:39:33Z

Still hitting the Tensorflow OOM issue on the T4. Any ideas?

mbrookhart · 2021-07-12T23:10:04Z

I don't see any excessive memory use when I put the test on a loop locally or run the full tensorflow test suite locally with this version of TF...

leandron · 2021-07-13T17:52:24Z

I don't see any excessive memory use when I put the test on a loop locally or run the full tensorflow test suite locally with this version of TF...

Does Jenkins has more than one executor running on that same machine?

areusch · 2021-07-13T18:09:29Z

It may have that, I think there are 2 executors per GPU node.

…

On Tue, Jul 13, 2021 at 10:52 AM Leandro Nunes ***@***.***> wrote: I don't see any excessive memory use when I put the test on a loop locally or run the full tensorflow test suite locally with this version of TF... Does Jenkins has more than one executor runnin on that same machine? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8177 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAECDXKYIC5GAD6NUG644NDTXR4OHANCNFSM4567SWIA> .

mbrookhart · 2021-07-14T17:06:37Z

@trevor-m I'm hitting an error in this test you added:

tvm/tests/python/frontend/tensorflow/test_forward.py

Lines 460 to 468 in e1b3ff4

    
           if package_version.parse(tf.VERSION) >= package_version.parse("2.4.1"): 
        
               _test_pooling( 
        
                   input_shape=[2, 9, 10, 2], 
        
                   window_shape=[4, 4], 
        
                   padding=[[0, 0], [0, 1], [2, 3], [0, 0]], 
        
                   pooling_type="MAX", 
        
                   dilation_rate=[1, 1], 
        
                   strides=[1, 1], 
        
               )

When updating the TF version in CI.

The error is

E       ValueError: Nonzero explicit padding in the batch or depth dimensions is not supported for '{{node max_pool}} = MaxPool[T=DT_FLOAT, data_format="NCHW", explicit_paddings=[0, 0, 0, 1, 2, 3, 0, 0], ksize=[1, 1, 4, 4], padding="EXPLICIT", strides=[1, 1, 1, 1]](Placeholder)' with input shapes: [2,2,9,10].

Did you expect that test to be run in NHWC?

trevor-m · 2021-07-14T17:09:24Z

@trevor-m I'm hitting an error in this test you added:

tvm/tests/python/frontend/tensorflow/test_forward.py

Lines 460 to 468 in e1b3ff4

if package_version.parse(tf.VERSION) >= package_version.parse("2.4.1"):

_test_pooling(

input_shape=[2, 9, 10, 2],

window_shape=[4, 4],

padding=[[0, 0], [0, 1], [2, 3], [0, 0]],

pooling_type="MAX",

dilation_rate=[1, 1],

strides=[1, 1],

)

When updating the TF version in CI.

The error is
E       ValueError: Nonzero explicit padding in the batch or depth dimensions is not supported for '{{node max_pool}} = MaxPool[T=DT_FLOAT, data_format="NCHW", explicit_paddings=[0, 0, 0, 1, 2, 3, 0, 0], ksize=[1, 1, 4, 4], padding="EXPLICIT", strides=[1, 1, 1, 1]](Placeholder)' with input shapes: [2,2,9,10].
Did you expect that test to be run in NHWC?

Hmm, yeah It looks like it was meant for NHWC. I guess we may need to explicitly state the format now? Or if it is easier, we could change to NCHW.

mbrookhart · 2021-07-14T17:10:16Z

tvm/tests/python/frontend/tensorflow/test_forward.py

Lines 315 to 322 in e1b3ff4

    
           def _test_pooling(input_shape, **kwargs): 
        
               _test_pooling_iteration(input_shape, **kwargs) 
        
               if is_gpu_available(): 
        
                   if len(input_shape) == 4: 
        
                       input_shape = [input_shape[ii] for ii in (0, 3, 1, 2)] 
        
                       kwargs["data_format"] = "NCHW" 
        
                       _test_pooling_iteration(input_shape, **kwargs)

It looks like that utility is doing something a little crazy with gpu. Let me add the explicit padding to the GPU transformation.

mbrookhart · 2021-07-14T17:18:24Z

c6064b0 fixed it.

mbrookhart · 2021-07-15T15:55:34Z

@areusch @mehrdadh @Lunderberg I'm not hitting issues with the QEMU tflite test https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/115/pipeline

mehrdadh · 2021-07-15T16:29:30Z

@mbrookhart trying to reproduce the issue now.

mbrookhart · 2021-07-19T15:28:42Z

Hey everyone. Thanks for all the help! This merged this morning. Hopefully @leandron 's nightly builds will prevent this level of pain in the future :D

areusch self-assigned this Jun 2, 2021

areusch changed the title ~~Rebuild ci-arm container~~ Rebuild ci-arm and ci-cpu container Jun 2, 2021

areusch changed the title ~~Rebuild ci-arm and ci-cpu container~~ Rebuild ci-arm, ci-cpu, and ci-gpu container Jun 16, 2021

areusch assigned mbrookhart Jun 22, 2021

areusch mentioned this issue Jul 1, 2021

tflite==2.4.2 breaks "test_forward.py::test_forward_mobilenet_v1" (and few other tests in the tflite frontend) #8339

Closed

mbrookhart mentioned this issue Jul 4, 2021

[BYOC][NNAPI]: Add testing package to ci_cpu image #8088

Merged

u99127 mentioned this issue Jul 4, 2021

Add Compute Library tests to Jenkins for AArch64 CI #8394

Merged

u99127 mentioned this issue Jul 7, 2021

[TEST][FLAKY] test_arm_compute_lib #8417

Closed

mbrookhart closed this as completed Jul 19, 2021

Rebuild ci-arm, ci-cpu, and ci-gpu container #8177

Rebuild ci-arm, ci-cpu, and ci-gpu container #8177

Comments

areusch commented Jun 2, 2021 • edited by mbrookhart Loading

areusch commented Jun 2, 2021

areusch commented Jun 4, 2021

areusch commented Jun 16, 2021

Lunderberg commented Jun 18, 2021

areusch commented Jun 21, 2021 • edited Loading

mbrookhart commented Jun 21, 2021

areusch commented Jun 22, 2021

Lunderberg commented Jun 22, 2021

mbrookhart commented Jun 23, 2021

mbrookhart commented Jun 23, 2021

u99127 commented Jun 23, 2021

u99127 commented Jun 23, 2021 • edited Loading

mbrookhart commented Jun 23, 2021

mbrookhart commented Jun 23, 2021

mehrdadh commented Jun 23, 2021

mehrdadh commented Jun 23, 2021

Lunderberg commented Jun 25, 2021

leandron commented Jun 25, 2021 • edited Loading

mbrookhart commented Jun 25, 2021 • edited Loading

leandron commented Jun 25, 2021

mbrookhart commented Jun 29, 2021

mbrookhart commented Jun 29, 2021

areusch commented Jun 29, 2021

areusch commented Jun 29, 2021

u99127 commented Jun 29, 2021

mbrookhart commented Jun 30, 2021

mbrookhart commented Jul 1, 2021

Lunderberg commented Jul 1, 2021

mbrookhart commented Jul 1, 2021

leandron commented Jul 2, 2021

leandron commented Jul 2, 2021 • edited Loading

leandron commented Jul 2, 2021

mbrookhart commented Jul 2, 2021

areusch commented Jul 2, 2021

mbrookhart commented Jul 2, 2021 • edited Loading

mbrookhart commented Jul 7, 2021

Lunderberg commented Jul 8, 2021

mbrookhart commented Jul 8, 2021

areusch commented Jul 8, 2021

mbrookhart commented Jul 9, 2021

mbrookhart commented Jul 12, 2021

mbrookhart commented Jul 12, 2021

leandron commented Jul 13, 2021 • edited Loading

areusch commented Jul 13, 2021 via email

mbrookhart commented Jul 14, 2021

trevor-m commented Jul 14, 2021

mbrookhart commented Jul 14, 2021 • edited Loading

mbrookhart commented Jul 14, 2021

mbrookhart commented Jul 15, 2021

mehrdadh commented Jul 15, 2021

mbrookhart commented Jul 19, 2021

areusch commented Jun 2, 2021 •

edited by mbrookhart

Loading

areusch commented Jun 21, 2021 •

edited

Loading

u99127 commented Jun 23, 2021 •

edited

Loading

leandron commented Jun 25, 2021 •

edited

Loading

mbrookhart commented Jun 25, 2021 •

edited

Loading

leandron commented Jul 2, 2021 •

edited

Loading

mbrookhart commented Jul 2, 2021 •

edited

Loading

leandron commented Jul 13, 2021 •

edited

Loading

mbrookhart commented Jul 14, 2021 •

edited

Loading