Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-gpu at runtime error #988

Open
ecilay opened this issue Jan 31, 2024 · 4 comments
Open

multi-gpu at runtime error #988

ecilay opened this issue Jan 31, 2024 · 4 comments

Comments

@ecilay
Copy link

ecilay commented Jan 31, 2024

So say if I have two AIT convered models, model0 on cuda0 and model1 on cuda1.
Even if I used cudaSetDevice to load the models properly on each cuda device, at run time, after running inference on model0 on cuda0, model1 fails to run. Once I move two models into same devices, problem resolved.

Is this expected? Any possible short term fix? I did the experiment on A10g with 4 GPUs.

@chenyang78
Copy link
Contributor

Hi @ecilay , Thanks for reporting the issue. What's the error message that you got?

@ecilay
Copy link
Author

ecilay commented Feb 7, 2024

File "/home/test/runtime/runtime/ait/eps_ait.py", line 485, in __call__ return self.forward(
File "/home/test/runtime/runtime/ait/eps_ait.py", line 791, in forward noise_pred = self.dispatch_resolution_forward(inputs)
File "/home/test/runtime/runtime/ait/eps_ait.py", line 890, in dispatch_resolution_forward cur_engines[f"{h}x{w}"].run_with_tensors(inputs, ys, graph_mode=False)
File "/opt/conda/envs/test/lib/python3.10/site-packages/aitemplate/compiler/model.py", line 587, in run_with_tensors outputs_ait = self.run(
File "/opt/conda/envs/test/lib/python3.10/site-packages/aitemplate/compiler/model.py", line 490, in run return self._run_impl(
File "/opt/conda/envs/test/lib/python3.10/site-packages/aitemplate/compiler/model.py", line 429, in _run_impl self.DLL.AITemplateModelContainerRun(
File "/opt/conda/envs/test/lib/python3.10/site-packages/aitemplate/compiler/model.py", line 196, in _wrapped_func raise RuntimeError(f"Error in function: {method.__name__}")
RuntimeError: Error in function: AITemplateModelContainerRun

@chenyang78
Copy link
Contributor

Thanks, @ecilay ! Hmm, doesn't have any clue. If it's possible, could you share a small repro that would help us investigate? Thanks!

@ecilay
Copy link
Author

ecilay commented Feb 7, 2024

@chenyang78 I think you can repro by using any two AIT model (or maybe they could be the same model), load them on different GPUs, and do inference, see if it works? If it does, would appreciate sharing your inference scripts, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants