Problems caused by launching multiple pods at the same time #25

pidb · 2022-02-28T02:40:24Z

Why do I get an error when I start multiple GPU-resource pods simultaneously (concurrently) using vcuda?

In vcuda loader.c, I add ferror to print errno related error message, I get it

But when I start the pods sequentially, I don't have this problem. So I guess it may be caused by a gap between the kubelet startup container and the gpu-manager placing the libcuda.so file.

The text was updated successfully, but these errors were encountered:

pidb · 2022-02-28T07:09:28Z

cc @mYmNeo

rainfd · 2022-03-21T12:17:04Z

As if vcuda would copy lib to container after container start up. When start multiple GPU-resource pods simultaneously, this action is not fast enough. You can try to modify your command sh -c sleep 5 && your command

pidb · 2022-03-23T01:35:33Z

As if vcuda would copy lib to container after container start up. When start multiple GPU-resource pods simultaneously, this action is not fast enough. You can try to modify your command sh -c sleep 5 && your command

Oh, Thanks rainfd, I knew this solution, but I felt this way is a hat trick.

mYmNeo · 2022-03-23T12:37:55Z

What's the version of gpu-manager? I've fixed a problem in master branch but not released a image

rainfd · 2022-03-24T02:32:27Z

@mYmNeo my version is v1.0.4. What is the commit?

mYmNeo · 2022-03-25T02:31:18Z

@mYmNeo my version is v1.0.4. What is the commit?

tkestack/gpu-manager#130

hzliangbin mentioned this issue Oct 19, 2022

/tmp/cuda-control/src/loader.c:865 can't find library libcuda.so by use image thomassong/gpu-manager:master tkestack/gpu-manager#150

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems caused by launching multiple pods at the same time #25

Problems caused by launching multiple pods at the same time #25

pidb commented Feb 28, 2022

pidb commented Feb 28, 2022

rainfd commented Mar 21, 2022

pidb commented Mar 23, 2022

mYmNeo commented Mar 23, 2022

rainfd commented Mar 24, 2022

mYmNeo commented Mar 25, 2022

Problems caused by launching multiple pods at the same time #25

Problems caused by launching multiple pods at the same time #25

Comments

pidb commented Feb 28, 2022

pidb commented Feb 28, 2022

rainfd commented Mar 21, 2022

pidb commented Mar 23, 2022

mYmNeo commented Mar 23, 2022

rainfd commented Mar 24, 2022

mYmNeo commented Mar 25, 2022