-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rebuild ci-arm, ci-cpu, and ci-gpu container #8177
Comments
also #8088 needs to be a part of this |
expanding to ci-gpu and including #8268 |
Found while implementing #8029, there are a few version incompatibilities in
|
i won't have time to get to this during this week. if someone else wants to push the images, go for it. #8193 could also potentially join this merge train. if this isn't closed by next Monday, I'll handle it then. |
I can probably take this. It looks like there are a number of things we need to merge to main before regenerating the tickets? It looks like the follow up to #8169 was merged as #8245. Do we need to merge #8088 and a CI-only version of #8193 before generating the images? It looks like #8268 failed a flaky test in the TF frontend, @areusch can you rebase? @Lunderberg Do you want to make a PR for the TF update? |
@mbrookhart rebased |
@mbrookhart #8306 is now open, includes the TF version update. Regarding |
I've merged everything I think we want, but unfortunately, none of the docker images are building. ci_arm fails with this: Everything else is failing with:
I wonder if there's something wrong with the llvm release at the moment? |
I'm trying to rebuild the images one at a time instead of in parallel on this machine to see if that fixes the LLVM issue |
I've had a look and probably have a fix for it - it appears that curl is installed in individual install files rather than moving into a common place which is what I'll look to do .
That happens when the LLVM mirrors are updating to get new packages. Nothing much can be done but wait, sadly. @leandron probably knows more. Ramana
|
@mbrookhart - see #8310 |
Thanks, @u99127 . I'll retry building this morning with these changes |
@areusch is out this week, but his sphinx fix had a syntax error. If someone could review this, I'd appreciate it: With that PR and the other two linked above, I can build the ci_cpu, ci_gpu, and ci_arm images. ci_qemu is failing with a gpg error that arose with one of these PRs, I've asked @mehrdadh to debug. #8156 #8190 |
@mbrookhart looks like |
@mbrookhart sent a PR to fix this issue: #8319 |
A question on the rebuilds overall. Beyond just the version dependencies, we've had a few cases where the built images would result in failing unit tests (e.g. #8339). Would it be good to add running the unit tests to the Dockerfiles? Since we need to test the built images anyways before using them, that would make the testing of each image, with unit tests appropriate for that image, be automatic as part of the build process. |
I think running TVM unit tests would imply in also building TVM as part of the Docker image rebuild, which doesn't make much sense. Based on that, I think we should avoid adding the unit tests as part of the Dockerfiles themselves. IMHO, it would be better just to be able to qualify images in a systematic way, as part of the release of new images. We (myself and @areusch mainly) are currently working on a way to improve the image rebuild testing, and making it a daily Jenkins job in the upstream CI. The next logical step is to validate the images using the regular TVM build process, pointing to "staging" docker images. |
I think a nightly job would be great, trying to do it manually this week has been a mess :) |
Yeah, it won't solve the problems per se, but once we get it into a stable state once, we'll be able to take one problem at a time, and not all of them at once, when we decide to update the images - that's the advantage. |
I built the images at pushed them to a PR this morning, I expect the job to fail due to #8339 , but we'll see if there's anything else. |
Okay, I see failures with tflite, but I also saw this:
I built the arm image on an amd64 system, do I need to build it on an ARM system instead? |
the platform error is due to building ci-arm on wrong arch (i use an m6g on AWS). i'm rebuilding for you now. |
|
Looks like I hadn't tested it properly. I think #8371 should fix it. |
The staging job failed because Keras didn't seem to make it into the packages? Did TF drop Keras in 2.4? Also, @Lunderberg there are a couple more TF failures: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/107/pipeline |
@mbrookhart Whoops, I was mistaken. I thought that It looks like the standalone-keras is only used if we're importing a keras model that isn't a tensorflow model. Based on this announcement, the support for non-tensorflow models is being dropped in keras 2.4, so I think we should pin it to |
Is this something we can fix, or do we need to revert the TF upgrade for this round of CI upgrades? |
The original discussion was kept in #6754, but my local tests are CPU only, so I don't have any data regarding CUDA versions w.r.t the referred issue. |
I tried this, and just to report back, it fails with |
On a CPU-only machine, all frontend tests pass with the following combination: There is no need to pin |
Thank you! It looks like which version of h5py you get might depend on your pip version:
I think we should keep that constraint in there. |
it may also depend on the order in which things are installed, too. |
The images are built and running, but I'm hitting a few more TF issues. Anyone care to take a look? #8193 |
It looks like the latest issues are running out of GPU memory. Does the new TF increase memory usage, or change when python garbage collection runs? |
Anyone know how a 750 Ti got into CI? |
looks like another SAMPL node @tqchen. do we want to keep this in for diversity of testing or remove it? my vote is to remove it since it is a probabilistic failure scenario rather than a deterministic one. i think our CI is so long that we can't afford to allow code to merge that isn't expected to deterministically pass a following run. |
Hmm, now I hit an out of memory issue on a T4: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/111/pipeline/393 |
Still hitting the Tensorflow OOM issue on the T4. Any ideas? |
I don't see any excessive memory use when I put the test on a loop locally or run the full tensorflow test suite locally with this version of TF... |
Does Jenkins has more than one executor running on that same machine? |
It may have that, I think there are 2 executors per GPU node.
…On Tue, Jul 13, 2021 at 10:52 AM Leandro Nunes ***@***.***> wrote:
I don't see any excessive memory use when I put the test on a loop locally
or run the full tensorflow test suite locally with this version of TF...
Does Jenkins has more than one executor runnin on that same machine?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8177 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAECDXKYIC5GAD6NUG644NDTXR4OHANCNFSM4567SWIA>
.
|
@trevor-m I'm hitting an error in this test you added: tvm/tests/python/frontend/tensorflow/test_forward.py Lines 460 to 468 in e1b3ff4
When updating the TF version in CI. The error is
Did you expect that test to be run in NHWC? |
Hmm, yeah It looks like it was meant for NHWC. I guess we may need to explicitly state the format now? Or if it is easier, we could change to NCHW. |
tvm/tests/python/frontend/tensorflow/test_forward.py Lines 315 to 322 in e1b3ff4
It looks like that utility is doing something a little crazy with gpu. Let me add the explicit padding to the GPU transformation. |
c6064b0 fixed it. |
@areusch @mehrdadh @Lunderberg I'm not hitting issues with the QEMU tflite test https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/115/pipeline |
@mbrookhart trying to reproduce the issue now. |
Hey everyone. Thanks for all the help! This merged this morning. Hopefully @leandron 's nightly builds will prevent this level of pain in the future :D |
This is a tracking issue for the process of updating the TVM ci- containers to reflect the following PRs:
#8169
Steps:
The text was updated successfully, but these errors were encountered: