[CI Image] Update PyTorch to v1.10 in GPU image #9713

masahi · 2021-12-11T01:54:00Z

S0. The version on CI is 1.7, more than a year old
S1. Tag of nightly build: TAG. Docker hub: https://hub.docker.com/layers/tlcpackstaging/ci_gpu/20220105-225914-79cfb797e
S2. The nightly is built on TVM commit: 79cfb79
S3. Testing the nightly image on ci-docker-staging: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/193/pipeline/
S4. Retag TAG to VERSION:

docker pull tlcpackstaging/IMAGE_NAME:TAG
docker tag tlcpackstaging/IMAGE_NAME:TAG tlcpack/IMAGE_NAME:VERSION
docker push tlcpack/IMAGE_NAME:VERSION

S5. Check if the new tag is really there: https://hub.docker.com/u/tlcpack
S6. Submit a PR updating the IMAGE_NAME version on Jenkins [CI Image] Update PyTorch to v1.10 in GPU image #9713

The text was updated successfully, but these errors were encountered:

manupak · 2021-12-13T10:28:05Z

Hi @masahi ,

FYI, I'd be careful with S3.

I was trying to use the image from S2 in S3 as mentioned here : #9659
Unfortunately, it got a timeout and leaving the node with a image. It is only then I realize our jobs dont start with clean CI nodes in terms of images. Now the node does not have enough memory.

Now the ci_cpu upgrade is currently blocked on this and we dont have access to any of the nodes. cc : @leandron
#9705.

masahi · 2021-12-13T10:40:46Z

@manupa-arm Thanks, yes I'm aware of the recent CI outrage. Probably I'll work on this next week.

leandron · 2021-12-17T20:23:07Z

I can deal with this request in #9762.

leandron · 2021-12-17T20:23:24Z

#9762

masahi · 2021-12-17T22:25:20Z

@leandron This is for updating the GPU image, so I think it should be independent.

masahi · 2021-12-20T23:46:28Z

@manupa-arm @leandron @areusch I'm not familiar with the new CI image update protocol. How does one push an image to tlcpackstaging? I've tried

docker push tlcpackstaging/ci_gpu:20211219-100400-c0d326dbd

But got

denied: requested access to the resource is denied

Previously I would just create an image like ci_gpu:v0.76 and push it to tlcpack dockerhub org

manupak · 2021-12-21T05:37:42Z

@masahi you should not be needing to push images there. The ci-docker-build nighly job will automatically build images from a nightly checkout of main for all images, and push it there. The last number indicates the commit hash used when building the image

masahi · 2021-12-21T06:15:33Z

Interesting! I guess I don't even need to build a new image locally anymore... So should I just send a PR to update docker/install/ubuntu_install_onnx.sh to cause a new nightly image with the new PyTorch version to be built?

I have more CI questions I want to ask. Recently I've joined discord to learn about CI issues. Can we continue there?

manupak · 2021-12-21T06:28:32Z

Yes, that should do it.

S3 should capture any issues should the updated image breaks the current tests anyway.

As a next step, we want to explore the possibility to making every PR rebuild the images using tlcpackstaging images as a cache. i.e. we don't sacrifice correctness in the verification but the PRs that involves docker changes will be a bit slow due to the invalidation of the cache -- however, if we could push another tlcpackstaging image then it should be go back to normal behavior of using the cache. Something to discuss in the next meetup : @leandron @areusch

masahi · 2021-12-27T20:30:40Z

@jiangjiajun I'm testing a new image with PT 1.10 and getting an error from paddle: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/182/pipeline/

Any idea what's going on? I can run from_paddle.py locally without issues. free(): invalid pointer thing looks similar to a known issue with recent PT + LLVM discussed in #9362

jiangjiajun · 2021-12-28T02:23:14Z

It looks strange, I'll check this today

jiangjiajun · 2021-12-29T02:07:33Z

@masahi I'm not sure why this error occurred.

I have found there was a core dumped problem while we running autotvm with paddle frontend, and the reason is the system signal capturing of PaddlePaddle.

To solve the problem, we have released a new version of 2.1.3, use the new function paddle.disable_signal_handler() to disable signal capturing.

This function is called in paddle frontend and tvmc frontend, we have tested in TVM, it works with no problem
https://github.com/apache/tvm/blob/main/python/tvm/relay/frontend/paddlepaddle.py#L2270
https://github.com/apache/tvm/blob/main/python/tvm/driver/tvmc/frontends.py#L279

I checked all the codes in tvm where we have imported paddle, and found the test code didn't call paddle.disable_signal_handler(), so I send a new PR #9809 , could you test this modification with PyTorch v1.10?

masahi · 2021-12-29T12:10:07Z

@jiangjiajun Unfortunately it didn't help https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/184/pipeline

Were you able to reproduce the issue locally?

jiangjiajun · 2021-12-29T16:04:15Z

Do you know how to reproduce the problem, or which script I should run

masahi · 2021-12-29T23:45:34Z

It's not clear what script caused the failure. The error message seems to indicate that the issue happens during make html (tutorial scripts), and the only clue is

Extension error (sphinx_gallery.docs_resolv):

Handler <function embed_code_links at 0x7f60e1b15f28> for event 'build-finished' threw an exception (exception: list indices must be integers or slices, not str)

free(): invalid pointer

--------------------------------------

C++ Traceback (most recent call last):

--------------------------------------

0   paddle::framework::SignalHandle(char const*, int)
1   paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

I thought the error was coming from from_paddle.py, but I couldn't reproduce it in a non-docker environment. I'll try running this script under our gpu container.

jiangjiajun · 2021-12-30T07:14:42Z

Okay, I will test on my evironment

masahi · 2021-12-31T10:17:30Z

@jiangjiajun I was able to reproduce the error under the new gpu container.

Does paddle use PyTorch internally, or link against libtorch?

masahi · 2021-12-31T12:44:52Z

@jiangjiajun ok it turned out the error has nothing to do with paddle: After applying the mitigation for PyTorch + LLVM symbol conflict issue #9362 (comment), there is no longer free(): invalid pointer and backtrace from paddle. https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/185/pipeline/267

jiangjiajun · 2021-12-31T15:48:25Z

Paddle does not depend on PyTorch or LibTorch.

Did you solve the problem now, there's error in this link https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/185/pipeline/267/

leandron closed this as completed Dec 17, 2021

masahi reopened this Dec 17, 2021

masahi mentioned this issue Dec 21, 2021

[Docker] Update to Torch 1.10.1 #9781

Merged

jiangjiajun mentioned this issue Dec 29, 2021

disable signal capture in unit test of paddle frontend #9809

Merged

masahi mentioned this issue Jan 7, 2022

[CI] Update to PyTorch v1.10 in GPU image #9866

Merged

masahi closed this as completed in #9866 Jan 8, 2022

masahi mentioned this issue Jan 10, 2022

add oneflow dependency in docker file #9881

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI Image] Update PyTorch to v1.10 in GPU image #9713

[CI Image] Update PyTorch to v1.10 in GPU image #9713

masahi commented Dec 11, 2021 •

edited

Loading

manupak commented Dec 13, 2021 •

edited

Loading

masahi commented Dec 13, 2021

leandron commented Dec 17, 2021

leandron commented Dec 17, 2021

masahi commented Dec 17, 2021

masahi commented Dec 20, 2021

manupak commented Dec 21, 2021 •

edited

Loading

masahi commented Dec 21, 2021 •

edited

Loading

manupak commented Dec 21, 2021

masahi commented Dec 27, 2021

jiangjiajun commented Dec 28, 2021

jiangjiajun commented Dec 29, 2021

masahi commented Dec 29, 2021

jiangjiajun commented Dec 29, 2021

masahi commented Dec 29, 2021

jiangjiajun commented Dec 30, 2021

masahi commented Dec 31, 2021

masahi commented Dec 31, 2021

jiangjiajun commented Dec 31, 2021

[CI Image] Update PyTorch to v1.10 in GPU image #9713

[CI Image] Update PyTorch to v1.10 in GPU image #9713

Comments

masahi commented Dec 11, 2021 • edited Loading

manupak commented Dec 13, 2021 • edited Loading

masahi commented Dec 13, 2021

leandron commented Dec 17, 2021

leandron commented Dec 17, 2021

masahi commented Dec 17, 2021

masahi commented Dec 20, 2021

manupak commented Dec 21, 2021 • edited Loading

masahi commented Dec 21, 2021 • edited Loading

manupak commented Dec 21, 2021

masahi commented Dec 27, 2021

jiangjiajun commented Dec 28, 2021

jiangjiajun commented Dec 29, 2021

masahi commented Dec 29, 2021

jiangjiajun commented Dec 29, 2021

masahi commented Dec 29, 2021

jiangjiajun commented Dec 30, 2021

masahi commented Dec 31, 2021

masahi commented Dec 31, 2021

jiangjiajun commented Dec 31, 2021

masahi commented Dec 11, 2021 •

edited

Loading

manupak commented Dec 13, 2021 •

edited

Loading

manupak commented Dec 21, 2021 •

edited

Loading

masahi commented Dec 21, 2021 •

edited

Loading