Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop at "Started a local Ray instance." when using Nvidia Nsight compute tool to capture kernel in multi-GPU case. #116

Open
david-beckham-315 opened this issue Dec 16, 2024 · 1 comment
Labels
question Further information is requested

Comments

@david-beckham-315
Copy link

david-beckham-315 commented Dec 16, 2024

A100 8xGPU card,using Nsight Compute tool to capture kernel in mochi inference.

Test condition:
CUDA12.4
python3.10.12
torch2.5.1
ncu 2024.1.1.0 / 2024.3.0.0 (either of them has same issue)
ray 2.40.0

step:

  1. export CUDA_VISIBLE_DEVICES=0,1
  2. ncu --set full -o A100_mochi_2gpu_240P_step2_fps24_frame163_flash_fwd_kernel.ncu-rep -f -s 1 -c 5 --kernel-name "flash_fwd_kernel" --target-processes all python3 ./demos/cli.py --model_dir /data/NousResearch/mochi-1-preview --num_steps 2 --width 422 --height 240

the result show as below:
==PROF== Connected to process 640812 (/usr/bin/nvidia-smi)
==PROF== Disconnected from process 640812
Launching with 2 GPUs. If you want to force single GPU mode use CUDA_VISIBLE_DEVICES=0.
==PROF== Connected to process 640742 (/usr/bin/python3.10)
Attention mode: flash
==PROF== Connected to process 641010 (/usr/bin/nvidia-smi)
==PROF== Disconnected from process 641010
==PROF== Connected to process 641005 (/usr/bin/nvidia-smi)
==PROF== Disconnected from process 641005
==PROF== Connected to process 641131 (/usr/bin/nvidia-smi)
==PROF== Disconnected from process 641131
2024-12-13 10:12:58,746 INFO worker.py:1819 -- Started a local Ray instance.

From the pipeline.py, it stop at ray.int()
class MochiMultiGPUPipeline:
def init(
self,
*,
text_encoder_factory: ModelFactory,
dit_factory: ModelFactory,
decoder_factory: ModelFactory,
world_size: int,
):
ray.init()
RemoteClass = ray.remote(MultiGPUContext)
self.ctxs = [
RemoteClass.options(num_gpus=1).remote(
text_encoder_factory=text_encoder_factory,
dit_factory=dit_factory,
decoder_factory=decoder_factory,
world_size=world_size,
device_id=0,
local_rank=i,
)
for i in range(world_size)
]
for ctx in self.ctxs:
ray.get(ctx.ray_ready.remote())

Do you know whether the mochi support Nsight Compute in multi-GPU case?

Note: Mochi support Nsight Compute in single GPU case.

@ved-genmo
Copy link
Contributor

Interesting. Is the NousResearch team working on Mochi? :)

I haven't tested Mochi with Nsight so it's not officially supported. The compatibility issue likely stems from Ray. If you need Nsight integration, you might want to try the diffusers version of Mochi instead - it doesn't use Ray and might work better.

@ajayjain ajayjain added the question Further information is requested label Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants