Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems caused by launching multiple pods at the same time #25

Open
pidb opened this issue Feb 28, 2022 · 6 comments
Open

Problems caused by launching multiple pods at the same time #25

pidb opened this issue Feb 28, 2022 · 6 comments

Comments

@pidb
Copy link

pidb commented Feb 28, 2022

Why do I get an error when I start multiple GPU-resource pods simultaneously (concurrently) using vcuda?

In vcuda loader.c, I add ferror to print errno related error message, I get it

image

But when I start the pods sequentially, I don't have this problem. So I guess it may be caused by a gap between the kubelet startup container and the gpu-manager placing the libcuda.so file.

@pidb
Copy link
Author

pidb commented Feb 28, 2022

cc @mYmNeo

@rainfd
Copy link

rainfd commented Mar 21, 2022

As if vcuda would copy lib to container after container start up. When start multiple GPU-resource pods simultaneously, this action is not fast enough. You can try to modify your command sh -c sleep 5 && your command

@pidb
Copy link
Author

pidb commented Mar 23, 2022

As if vcuda would copy lib to container after container start up. When start multiple GPU-resource pods simultaneously, this action is not fast enough. You can try to modify your command sh -c sleep 5 && your command

Oh, Thanks rainfd, I knew this solution, but I felt this way is a hat trick.

@mYmNeo
Copy link
Contributor

mYmNeo commented Mar 23, 2022

What's the version of gpu-manager? I've fixed a problem in master branch but not released a image

@rainfd
Copy link

rainfd commented Mar 24, 2022

@mYmNeo my version is v1.0.4. What is the commit?

@mYmNeo
Copy link
Contributor

mYmNeo commented Mar 25, 2022

@mYmNeo my version is v1.0.4. What is the commit?

tkestack/gpu-manager#130

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants