Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI Image] Update PyTorch to v1.10 in GPU image #9713

Closed
7 tasks done
masahi opened this issue Dec 11, 2021 · 19 comments · Fixed by #9866
Closed
7 tasks done

[CI Image] Update PyTorch to v1.10 in GPU image #9713

masahi opened this issue Dec 11, 2021 · 19 comments · Fixed by #9866

Comments

@masahi
Copy link
Member

masahi commented Dec 11, 2021

docker pull tlcpackstaging/IMAGE_NAME:TAG
docker tag tlcpackstaging/IMAGE_NAME:TAG tlcpack/IMAGE_NAME:VERSION
docker push tlcpack/IMAGE_NAME:VERSION
@manupak
Copy link
Contributor

manupak commented Dec 13, 2021

Hi @masahi ,

FYI, I'd be careful with S3.

I was trying to use the image from S2 in S3 as mentioned here : #9659
Unfortunately, it got a timeout and leaving the node with a image. It is only then I realize our jobs dont start with clean CI nodes in terms of images. Now the node does not have enough memory.

Now the ci_cpu upgrade is currently blocked on this and we dont have access to any of the nodes. cc : @leandron
#9705.

@masahi
Copy link
Member Author

masahi commented Dec 13, 2021

@manupa-arm Thanks, yes I'm aware of the recent CI outrage. Probably I'll work on this next week.

@leandron
Copy link
Contributor

I can deal with this request in #9762.

@leandron
Copy link
Contributor

#9762

@masahi
Copy link
Member Author

masahi commented Dec 17, 2021

@leandron This is for updating the GPU image, so I think it should be independent.

@masahi masahi reopened this Dec 17, 2021
@masahi
Copy link
Member Author

masahi commented Dec 20, 2021

@manupa-arm @leandron @areusch I'm not familiar with the new CI image update protocol. How does one push an image to tlcpackstaging? I've tried

docker push tlcpackstaging/ci_gpu:20211219-100400-c0d326dbd

But got

denied: requested access to the resource is denied

Previously I would just create an image like ci_gpu:v0.76 and push it to tlcpack dockerhub org

@manupak
Copy link
Contributor

manupak commented Dec 21, 2021

@masahi you should not be needing to push images there. The ci-docker-build nighly job will automatically build images from a nightly checkout of main for all images, and push it there. The last number indicates the commit hash used when building the image

@masahi
Copy link
Member Author

masahi commented Dec 21, 2021

Interesting! I guess I don't even need to build a new image locally anymore... So should I just send a PR to update docker/install/ubuntu_install_onnx.sh to cause a new nightly image with the new PyTorch version to be built?

I have more CI questions I want to ask. Recently I've joined discord to learn about CI issues. Can we continue there?

@manupak
Copy link
Contributor

manupak commented Dec 21, 2021

Yes, that should do it.

S3 should capture any issues should the updated image breaks the current tests anyway.

As a next step, we want to explore the possibility to making every PR rebuild the images using tlcpackstaging images as a cache. i.e. we don't sacrifice correctness in the verification but the PRs that involves docker changes will be a bit slow due to the invalidation of the cache -- however, if we could push another tlcpackstaging image then it should be go back to normal behavior of using the cache. Something to discuss in the next meetup : @leandron @areusch

@masahi
Copy link
Member Author

masahi commented Dec 27, 2021

@jiangjiajun I'm testing a new image with PT 1.10 and getting an error from paddle: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/182/pipeline/

Any idea what's going on? I can run from_paddle.py locally without issues. free(): invalid pointer thing looks similar to a known issue with recent PT + LLVM discussed in #9362

@jiangjiajun
Copy link
Contributor

It looks strange, I'll check this today

@jiangjiajun
Copy link
Contributor

@masahi I'm not sure why this error occurred.

I have found there was a core dumped problem while we running autotvm with paddle frontend, and the reason is the system signal capturing of PaddlePaddle.

To solve the problem, we have released a new version of 2.1.3, use the new function paddle.disable_signal_handler() to disable signal capturing.

This function is called in paddle frontend and tvmc frontend, we have tested in TVM, it works with no problem
https://github.com/apache/tvm/blob/main/python/tvm/relay/frontend/paddlepaddle.py#L2270
https://github.com/apache/tvm/blob/main/python/tvm/driver/tvmc/frontends.py#L279

I checked all the codes in tvm where we have imported paddle, and found the test code didn't call paddle.disable_signal_handler(), so I send a new PR #9809 , could you test this modification with PyTorch v1.10?

@masahi
Copy link
Member Author

masahi commented Dec 29, 2021

@jiangjiajun Unfortunately it didn't help https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/184/pipeline

Were you able to reproduce the issue locally?

@jiangjiajun
Copy link
Contributor

Do you know how to reproduce the problem, or which script I should run

@masahi
Copy link
Member Author

masahi commented Dec 29, 2021

It's not clear what script caused the failure. The error message seems to indicate that the issue happens during make html (tutorial scripts), and the only clue is

Extension error (sphinx_gallery.docs_resolv):

Handler <function embed_code_links at 0x7f60e1b15f28> for event 'build-finished' threw an exception (exception: list indices must be integers or slices, not str)

free(): invalid pointer

--------------------------------------

C++ Traceback (most recent call last):

--------------------------------------

0   paddle::framework::SignalHandle(char const*, int)
1   paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

I thought the error was coming from from_paddle.py, but I couldn't reproduce it in a non-docker environment. I'll try running this script under our gpu container.

@jiangjiajun
Copy link
Contributor

Okay, I will test on my evironment

@masahi
Copy link
Member Author

masahi commented Dec 31, 2021

@jiangjiajun I was able to reproduce the error under the new gpu container.

Does paddle use PyTorch internally, or link against libtorch?

@masahi
Copy link
Member Author

masahi commented Dec 31, 2021

@jiangjiajun ok it turned out the error has nothing to do with paddle: After applying the mitigation for PyTorch + LLVM symbol conflict issue #9362 (comment), there is no longer free(): invalid pointer and backtrace from paddle. https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/185/pipeline/267

@jiangjiajun
Copy link
Contributor

Paddle does not depend on PyTorch or LibTorch.

Did you solve the problem now, there's error in this link https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/185/pipeline/267/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants