Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebuild ci-arm, ci-cpu, and ci-gpu container #8177

Closed
26 tasks done
areusch opened this issue Jun 2, 2021 · 54 comments
Closed
26 tasks done

Rebuild ci-arm, ci-cpu, and ci-gpu container #8177

areusch opened this issue Jun 2, 2021 · 54 comments
Assignees

Comments

@areusch
Copy link
Contributor

areusch commented Jun 2, 2021

This is a tracking issue for the process of updating the TVM ci- containers to reflect the following PRs:

#8169

Steps:

@areusch areusch self-assigned this Jun 2, 2021
@areusch
Copy link
Contributor Author

areusch commented Jun 2, 2021

cc @d-smirnov @u99127

@areusch areusch changed the title Rebuild ci-arm container Rebuild ci-arm and ci-cpu container Jun 2, 2021
@areusch
Copy link
Contributor Author

areusch commented Jun 4, 2021

also #8088 needs to be a part of this

@areusch
Copy link
Contributor Author

areusch commented Jun 16, 2021

expanding to ci-gpu and including #8268

@areusch areusch changed the title Rebuild ci-arm and ci-cpu container Rebuild ci-arm, ci-cpu, and ci-gpu container Jun 16, 2021
@Lunderberg
Copy link
Contributor

Found while implementing #8029, there are a few version incompatibilities in tlcpack/ci_gpu:v0.75.

  • Current version of tensorflow==2.3.1 is incompatible with cuda 11.0. Based on this table, we'll want to switch to tensorflow==2.4.0. The warning can be reproduced by running import tensorflow in a python shell.

  • WARNING:root:scikit-learn version 0.24.2 is not supported. Minimum required version: 0.17. Maximum required version: 0.19.2. Disabling scikit-learn conversion API. The warning can be reproduced by running import coremltools in a python shell.

    The current version of coremltools==4.1 has a maximum version of scikit-learn<=0.19.2. Unfortunately, the current version of gluoncv==0.10.1 has a conflicting minimum version of scikit-learn>=0.23.2, so it won't be a simple update of a single package.

@areusch
Copy link
Contributor Author

areusch commented Jun 21, 2021

i won't have time to get to this during this week. if someone else wants to push the images, go for it. #8193 could also potentially join this merge train.

if this isn't closed by next Monday, I'll handle it then.

@mbrookhart
Copy link
Contributor

I can probably take this. It looks like there are a number of things we need to merge to main before regenerating the tickets?

It looks like the follow up to #8169 was merged as #8245. Do we need to merge #8088 and a CI-only version of #8193 before generating the images? It looks like #8268 failed a flaky test in the TF frontend, @areusch can you rebase?

@Lunderberg Do you want to make a PR for the TF update?

@areusch
Copy link
Contributor Author

areusch commented Jun 22, 2021

@mbrookhart rebased

@Lunderberg
Copy link
Contributor

@mbrookhart #8306 is now open, includes the TF version update.

Regarding coremltools, we're already installing the latest version coremltools==4.1, and even the beta version coremltools==5.0b1 gives warnings about the scikit-learn and tensorflow versions. That said, all the tests in tests/python/frontend/coreml and tests/python/contrib/test_coreml_codegen.py pass despite the warnings. If somebody with an IOS device available could test whether tests/python/contrib/test_coreml_runtime.py passes, I think that would be a good additional check.

@mbrookhart
Copy link
Contributor

I've merged everything I think we want, but unfortunately, none of the docker images are building.

ci_arm fails with this: /install/ubuntu_download_arm_compute_lib_binaries.sh: line 54: curl: command not found, I think that's an issue with #8245 . @d-smirnov @leandron , can you debug and fix? you probably need to install curl in the download script or use wget.

Everything else is failing with:

E: Failed to fetch https://apt.llvm.org/bionic/dists/llvm-toolchain-bionic/main/source/Sources.gz  File has unexpected size (2207 != 2205). Mirror sync in progress? [IP: 199.232.194.49 443]
   Hashes of expected file:
    - Filesize:2205 [weak]
    - SHA256:ad346b2d1b4cffcf27cad9cdd14f8d5dd047cfc10cd09e6e17671d78eeac3501
    - SHA1:a4cb9420123a78efb36e79f4cbe8acda7c27ecf7 [weak]
    - MD5Sum:ae153a86b6a5f0f7ee88f5c97ab13a36 [weak]
   Release file created at: Tue, 22 Jun 2021 23:51:39 +0000
E: Some index files failed to download. They have been ignored, or old ones used instead.
The command '/bin/sh -c bash /install/ubuntu1804_install_llvm.sh' returned a non-zero code: 100

I wonder if there's something wrong with the llvm release at the moment?

@mbrookhart
Copy link
Contributor

I'm trying to rebuild the images one at a time instead of in parallel on this machine to see if that fixes the LLVM issue

@u99127
Copy link
Contributor

u99127 commented Jun 23, 2021

I've merged everything I think we want, but unfortunately, none of the docker images are building.

ci_arm fails with this: /install/ubuntu_download_arm_compute_lib_binaries.sh: line 54: curl: command not found, I think that's an issue with #8245 . @d-smirnov @leandron , can you debug and fix? you probably need to install curl in the download script or use wget.

I've had a look and probably have a fix for it - it appears that curl is installed in individual install files rather than moving into a common place which is what I'll look to do .

Everything else is failing with:

E: Failed to fetch https://apt.llvm.org/bionic/dists/llvm-toolchain-bionic/main/source/Sources.gz  File has unexpected size (2207 != 2205). Mirror sync in progress? [IP: 199.232.194.49 443]
   Hashes of expected file:
    - Filesize:2205 [weak]
    - SHA256:ad346b2d1b4cffcf27cad9cdd14f8d5dd047cfc10cd09e6e17671d78eeac3501
    - SHA1:a4cb9420123a78efb36e79f4cbe8acda7c27ecf7 [weak]
    - MD5Sum:ae153a86b6a5f0f7ee88f5c97ab13a36 [weak]
   Release file created at: Tue, 22 Jun 2021 23:51:39 +0000
E: Some index files failed to download. They have been ignored, or old ones used instead.
The command '/bin/sh -c bash /install/ubuntu1804_install_llvm.sh' returned a non-zero code: 100

That happens when the LLVM mirrors are updating to get new packages. Nothing much can be done but wait, sadly. @leandron probably knows more.

Ramana

I wonder if there's something wrong with the llvm release at the moment?

@u99127
Copy link
Contributor

u99127 commented Jun 23, 2021

@mbrookhart - see #8310

@mbrookhart
Copy link
Contributor

Thanks, @u99127 . I'll retry building this morning with these changes

@mbrookhart
Copy link
Contributor

@areusch is out this week, but his sphinx fix had a syntax error. If someone could review this, I'd appreciate it:
#8316

With that PR and the other two linked above, I can build the ci_cpu, ci_gpu, and ci_arm images. ci_qemu is failing with a gpg error that arose with one of these PRs, I've asked @mehrdadh to debug. #8156 #8190

@mehrdadh
Copy link
Member

@mbrookhart looks like http://keys.gnupg.net/ is down. I tried with replacing it to hkp://keyserver.ubuntu.com:80 and it worked.
gpg --keyserver keys.gnupg.net --recv-keys 0x3353C9CEF108B584

@mehrdadh
Copy link
Member

@mbrookhart sent a PR to fix this issue: #8319

@Lunderberg
Copy link
Contributor

A question on the rebuilds overall. Beyond just the version dependencies, we've had a few cases where the built images would result in failing unit tests (e.g. #8339). Would it be good to add running the unit tests to the Dockerfiles? Since we need to test the built images anyways before using them, that would make the testing of each image, with unit tests appropriate for that image, be automatic as part of the build process.

@leandron
Copy link
Contributor

leandron commented Jun 25, 2021

A question on the rebuilds overall. Beyond just the version dependencies, we've had a few cases where the built images would result in failing unit tests (e.g. #8339). Would it be good to add running the unit tests to the Dockerfiles? Since we need to test the built images anyways before using them, that would make the testing of each image, with unit tests appropriate for that image, be automatic as part of the build process.

I think running TVM unit tests would imply in also building TVM as part of the Docker image rebuild, which doesn't make much sense. Based on that, I think we should avoid adding the unit tests as part of the Dockerfiles themselves.

IMHO, it would be better just to be able to qualify images in a systematic way, as part of the release of new images.

We (myself and @areusch mainly) are currently working on a way to improve the image rebuild testing, and making it a daily Jenkins job in the upstream CI. The next logical step is to validate the images using the regular TVM build process, pointing to "staging" docker images.

@mbrookhart
Copy link
Contributor

mbrookhart commented Jun 25, 2021

I think a nightly job would be great, trying to do it manually this week has been a mess :)

@leandron
Copy link
Contributor

I think a nightly job would be great, trying to do it manually this week has been a mess :)

Yeah, it won't solve the problems per se, but once we get it into a stable state once, we'll be able to take one problem at a time, and not all of them at once, when we decide to update the images - that's the advantage.

@mbrookhart
Copy link
Contributor

I built the images at pushed them to a PR this morning, I expect the job to fail due to #8339 , but we'll see if there's anything else.

@mbrookhart
Copy link
Contributor

Okay, I see failures with tflite, but I also saw this:

WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested

standard_init_linux.go:219: exec user process caused: exec format error

script returned exit code 1

I built the arm image on an amd64 system, do I need to build it on an ARM system instead?

@areusch
Copy link
Contributor Author

areusch commented Jun 29, 2021

the platform error is due to building ci-arm on wrong arch (i use an m6g on AWS). i'm rebuilding for you now.

@areusch
Copy link
Contributor Author

areusch commented Jun 29, 2021

 ---> Running in 6b00dc0292ec
Hit:1 https://apt.llvm.org/xenial llvm-toolchain-xenial-4.0 InRelease
Hit:2 https://apt.llvm.org/xenial llvm-toolchain-xenial-7 InRelease
Hit:3 https://apt.llvm.org/xenial llvm-toolchain-xenial-8 InRelease
Hit:4 https://apt.llvm.org/xenial llvm-toolchain-xenial-9 InRelease
Hit:5 https://apt.llvm.org/xenial llvm-toolchain-xenial InRelease
Hit:6 http://ports.ubuntu.com/ubuntu-ports bionic InRelease
Hit:7 http://ports.ubuntu.com/ubuntu-ports bionic-updates InRelease
Hit:8 http://ports.ubuntu.com/ubuntu-ports bionic-backports InRelease
Hit:9 http://ports.ubuntu.com/ubuntu-ports bionic-security InRelease
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
Package gcc-aarch64-linux-gnu is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

Package g++-aarch64-linux-gnu is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'g++-aarch64-linux-gnu' has no installation candidate
E: Package 'gcc-aarch64-linux-gnu' has no installation candidate
The command '/bin/sh -c bash /install/ubuntu_download_arm_compute_lib_binaries.sh' returned a non-zero code: 100
ERROR: docker build failed.

@u99127
Copy link
Contributor

u99127 commented Jun 29, 2021

Looks like I hadn't tested it properly. I think #8371 should fix it.

@mbrookhart
Copy link
Contributor

I think we have fixes in for the tflite error and the arm build error (Thanks @u99127 !). I'll regenerate the cpu/gpu/qemu images this morning off 8d4df91, @areusch could you try the ARM image on that commit?

@mbrookhart
Copy link
Contributor

The staging job failed because Keras didn't seem to make it into the packages? Did TF drop Keras in 2.4?

Also, @Lunderberg there are a couple more TF failures:

https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/107/pipeline

@Lunderberg
Copy link
Contributor

@mbrookhart Whoops, I was mistaken. I thought that keras was a dependency of TF, because it was in the ubuntu_install_tensorflow.sh script, and that the removal of the pinned version of keras was part of the same issue that required the pinned h5py version. I was wrong, and the keras installation was separate from the installation of TF, but depends on TF.

It looks like the standalone-keras is only used if we're importing a keras model that isn't a tensorflow model. Based on this announcement, the support for non-tensorflow models is being dropped in keras 2.4, so I think we should pin it to keras<2.4.0. @leandron , were the issues you ran into in #6810 based on using keras 2.3.1 alongside cuda 10.2, and do you know if they would still be present with cuda 11.0?

@mbrookhart
Copy link
Contributor

Is this something we can fix, or do we need to revert the TF upgrade for this round of CI upgrades?

@leandron
Copy link
Contributor

leandron commented Jul 2, 2021

@leandron , were the issues you ran into in #6810 based on using keras 2.3.1 alongside cuda 10.2, and do you know if they would still be present with cuda 11.0?

The original discussion was kept in #6754, but my local tests are CPU only, so I don't have any data regarding CUDA versions w.r.t the referred issue.

@leandron
Copy link
Contributor

leandron commented Jul 2, 2021

I think we should pin it to keras<2.4.0

I tried this, and just to report back, it fails with AttributeError: module 'tensorflow.python.framework.ops' has no attribute '_TensorLike', which is sort of expected with that combination.

@leandron
Copy link
Contributor

leandron commented Jul 2, 2021

On a CPU-only machine, all frontend tests pass with the following combination:
tensorflow==2.4.2 keras==2.4.3

There is no need to pin h5py (which, out of curiosity, installs version 2.10.0)

cc @mbrookhart @Lunderberg

@mbrookhart
Copy link
Contributor

Thank you!

It looks like which version of h5py you get might depend on your pip version:

-> % pip install h5py  
Collecting h5py
  Downloading h5py-3.3.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (4.5 MB)

I think we should keep that constraint in there.

@areusch
Copy link
Contributor Author

areusch commented Jul 2, 2021

it may also depend on the order in which things are installed, too.

@mbrookhart
Copy link
Contributor

mbrookhart commented Jul 2, 2021

I'm going to rebuild the images on 7b898d0 and try again. Thanks, @leandron and @areusch!

@mbrookhart
Copy link
Contributor

The images are built and running, but I'm hitting a few more TF issues. Anyone care to take a look? #8193

@Lunderberg
Copy link
Contributor

It looks like the latest issues are running out of GPU memory. Does the new TF increase memory usage, or change when python garbage collection runs?

@mbrookhart
Copy link
Contributor

2021-07-07 02:54:43.479442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1632 MB memory) -> physical GPU (device: 0, name: GeForce GTX 750 Ti, pci bus id: 0000:01:00.0, compute capability: 5.0)

Anyone know how a 750 Ti got into CI?

@areusch
Copy link
Contributor Author

areusch commented Jul 8, 2021

+ echo 'INFO: NODE_NAME=sampl.aladdin.cuda1 EXECUTOR_NUMBER=0'
INFO: NODE_NAME=sampl.aladdin.cuda1 EXECUTOR_NUMBER=0`

looks like another SAMPL node @tqchen. do we want to keep this in for diversity of testing or remove it? my vote is to remove it since it is a probabilistic failure scenario rather than a deterministic one. i think our CI is so long that we can't afford to allow code to merge that isn't expected to deterministically pass a following run.

@mbrookhart
Copy link
Contributor

Hmm, now I hit an out of memory issue on a T4: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/111/pipeline/393

@mbrookhart
Copy link
Contributor

Still hitting the Tensorflow OOM issue on the T4. Any ideas?

@mbrookhart
Copy link
Contributor

I don't see any excessive memory use when I put the test on a loop locally or run the full tensorflow test suite locally with this version of TF...

@leandron
Copy link
Contributor

leandron commented Jul 13, 2021

I don't see any excessive memory use when I put the test on a loop locally or run the full tensorflow test suite locally with this version of TF...

Does Jenkins has more than one executor running on that same machine?

@areusch
Copy link
Contributor Author

areusch commented Jul 13, 2021 via email

@mbrookhart
Copy link
Contributor

@trevor-m I'm hitting an error in this test you added:

if package_version.parse(tf.VERSION) >= package_version.parse("2.4.1"):
_test_pooling(
input_shape=[2, 9, 10, 2],
window_shape=[4, 4],
padding=[[0, 0], [0, 1], [2, 3], [0, 0]],
pooling_type="MAX",
dilation_rate=[1, 1],
strides=[1, 1],
)

When updating the TF version in CI.

The error is

E       ValueError: Nonzero explicit padding in the batch or depth dimensions is not supported for '{{node max_pool}} = MaxPool[T=DT_FLOAT, data_format="NCHW", explicit_paddings=[0, 0, 0, 1, 2, 3, 0, 0], ksize=[1, 1, 4, 4], padding="EXPLICIT", strides=[1, 1, 1, 1]](Placeholder)' with input shapes: [2,2,9,10].

Did you expect that test to be run in NHWC?

@trevor-m
Copy link
Contributor

@trevor-m I'm hitting an error in this test you added:

if package_version.parse(tf.VERSION) >= package_version.parse("2.4.1"):
_test_pooling(
input_shape=[2, 9, 10, 2],
window_shape=[4, 4],
padding=[[0, 0], [0, 1], [2, 3], [0, 0]],
pooling_type="MAX",
dilation_rate=[1, 1],
strides=[1, 1],
)

When updating the TF version in CI.

The error is

E       ValueError: Nonzero explicit padding in the batch or depth dimensions is not supported for '{{node max_pool}} = MaxPool[T=DT_FLOAT, data_format="NCHW", explicit_paddings=[0, 0, 0, 1, 2, 3, 0, 0], ksize=[1, 1, 4, 4], padding="EXPLICIT", strides=[1, 1, 1, 1]](Placeholder)' with input shapes: [2,2,9,10].

Did you expect that test to be run in NHWC?

Hmm, yeah It looks like it was meant for NHWC. I guess we may need to explicitly state the format now? Or if it is easier, we could change to NCHW.

@mbrookhart
Copy link
Contributor

mbrookhart commented Jul 14, 2021

def _test_pooling(input_shape, **kwargs):
_test_pooling_iteration(input_shape, **kwargs)
if is_gpu_available():
if len(input_shape) == 4:
input_shape = [input_shape[ii] for ii in (0, 3, 1, 2)]
kwargs["data_format"] = "NCHW"
_test_pooling_iteration(input_shape, **kwargs)

It looks like that utility is doing something a little crazy with gpu. Let me add the explicit padding to the GPU transformation.

@mbrookhart
Copy link
Contributor

c6064b0 fixed it.

@mbrookhart
Copy link
Contributor

@mehrdadh
Copy link
Member

@mbrookhart trying to reproduce the issue now.

@mbrookhart
Copy link
Contributor

Hey everyone. Thanks for all the help! This merged this morning. Hopefully @leandron 's nightly builds will prevent this level of pain in the future :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants