multi-gpu at runtime error #988

ecilay · 2024-01-31T00:23:43Z

So say if I have two AIT convered models, model0 on cuda0 and model1 on cuda1.
Even if I used cudaSetDevice to load the models properly on each cuda device, at run time, after running inference on model0 on cuda0, model1 fails to run. Once I move two models into same devices, problem resolved.

Is this expected? Any possible short term fix? I did the experiment on A10g with 4 GPUs.

The text was updated successfully, but these errors were encountered:

chenyang78 · 2024-02-04T07:47:10Z

Hi @ecilay , Thanks for reporting the issue. What's the error message that you got?

ecilay · 2024-02-07T01:17:19Z

File "/home/test/runtime/runtime/ait/eps_ait.py", line 485, in __call__ return self.forward(
File "/home/test/runtime/runtime/ait/eps_ait.py", line 791, in forward noise_pred = self.dispatch_resolution_forward(inputs)
File "/home/test/runtime/runtime/ait/eps_ait.py", line 890, in dispatch_resolution_forward cur_engines[f"{h}x{w}"].run_with_tensors(inputs, ys, graph_mode=False)
File "/opt/conda/envs/test/lib/python3.10/site-packages/aitemplate/compiler/model.py", line 587, in run_with_tensors outputs_ait = self.run(
File "/opt/conda/envs/test/lib/python3.10/site-packages/aitemplate/compiler/model.py", line 490, in run return self._run_impl(
File "/opt/conda/envs/test/lib/python3.10/site-packages/aitemplate/compiler/model.py", line 429, in _run_impl self.DLL.AITemplateModelContainerRun(
File "/opt/conda/envs/test/lib/python3.10/site-packages/aitemplate/compiler/model.py", line 196, in _wrapped_func raise RuntimeError(f"Error in function: {method.__name__}")
RuntimeError: Error in function: AITemplateModelContainerRun

chenyang78 · 2024-02-07T03:39:13Z

Thanks, @ecilay ! Hmm, doesn't have any clue. If it's possible, could you share a small repro that would help us investigate? Thanks!

ecilay · 2024-02-07T18:38:03Z

@chenyang78 I think you can repro by using any two AIT model (or maybe they could be the same model), load them on different GPUs, and do inference, see if it works? If it does, would appreciate sharing your inference scripts, thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-gpu at runtime error #988

multi-gpu at runtime error #988

ecilay commented Jan 31, 2024 •

edited

Loading

chenyang78 commented Feb 4, 2024

ecilay commented Feb 7, 2024 •

edited

Loading

chenyang78 commented Feb 7, 2024

ecilay commented Feb 7, 2024 •

edited

Loading

multi-gpu at runtime error #988

multi-gpu at runtime error #988

Comments

ecilay commented Jan 31, 2024 • edited Loading

chenyang78 commented Feb 4, 2024

ecilay commented Feb 7, 2024 • edited Loading

chenyang78 commented Feb 7, 2024

ecilay commented Feb 7, 2024 • edited Loading

ecilay commented Jan 31, 2024 •

edited

Loading

ecilay commented Feb 7, 2024 •

edited

Loading

ecilay commented Feb 7, 2024 •

edited

Loading