Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I debug your code inference_magicdrive.py with pycharm? #15

Open
zhujiagang opened this issue Dec 19, 2024 · 2 comments
Open

How can I debug your code inference_magicdrive.py with pycharm? #15

zhujiagang opened this issue Dec 19, 2024 · 2 comments

Comments

@zhujiagang
Copy link

zhujiagang commented Dec 19, 2024

Thanks for sharing your excellent work.
I usually debug code use pycharm. By copying your inference_magicdrive.py into MagicDriveDiT/, I want to run your code using a gpu. But I encounter the following error:

codes/MagicDriveDiT_code/MagicDriveDiT/magicdrivedit/acceleration/parallel_states.py", line 13, in get_data_parallel_group
    raise RuntimeError("data_parallel_group is None")
RuntimeError: data_parallel_group is None

It seems the code didn't enter into this line
if is_distributed():

        dist.init_process_group(backend="nccl", timeout=timedelta(hours=1))
        torch.cuda.set_device(dist.get_rank() % torch.cuda.device_count())
        coordinator = DistCoordinator()
        cfg.sp_size = dist.get_world_size()
        if cfg.sp_size > 1:
            DP_AXIS, SP_AXIS = 0, 1
            dp_size = dist.get_world_size() // cfg.sp_size
            pg_mesh = ProcessGroupMesh(dp_size, cfg.sp_size)
            dp_group = pg_mesh.get_group_along_axis(DP_AXIS)
            sp_group = pg_mesh.get_group_along_axis(SP_AXIS)
            set_sequence_parallel_group(sp_group)
            print(f"Using sp_size={cfg.sp_size}")
        else:
            # TODO: sequence_parallel_group unset!
            dp_group = dist.group.WORLD
        set_data_parallel_group(dp_group)
        enable_sequence_parallelism = cfg.sp_size > 1
    else:
        # dist.init_process_group(backend="nccl", timeout=timedelta(hours=1))
        # torch.cuda.set_device(dist.get_rank() % torch.cuda.device_count())
        # coordinator = DistCoordinator()
        cfg.sp_size = 1
        coordinator = FakeCoordinator()
        enable_sequence_parallelism = False
    set_random_seed(seed=cfg.get("seed", 1024))

Looking forward to your reply. Thanks a lot.

@flymin
Copy link
Owner

flymin commented Dec 20, 2024

You should launch the program with torchrun, which will set the env params used by is_distributed. I did not know whether it also works with manually set env params, but I think it is ok.

Another workaround is to delete the error caused by: Sorry, this may not work since our dataloader relies on this.

codes/MagicDriveDiT_code/MagicDriveDiT/magicdrivedit/acceleration/parallel_states.py", line 13, in get_data_parallel_group
    raise RuntimeError("data_parallel_group is None")
RuntimeError: data_parallel_group is None

This is a fail-safe design for different parallel groups. However, if you only have one process, it should always be safe to use the default process group.

flymin added a commit that referenced this issue Dec 20, 2024
hard-coded launch from localhost:12355 when not provided

Ref #15
@flymin
Copy link
Owner

flymin commented Dec 20, 2024

Please try the above pr, which should support launching by python command.

flymin added a commit that referenced this issue Dec 20, 2024
hard-coded launch from localhost:12355 when not provided

Ref #15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants