Skip to content

Commit

Permalink
more notes on ddp
Browse files Browse the repository at this point in the history
  • Loading branch information
Jacob Sela committed Dec 27, 2024
1 parent d1ffeca commit e08a447
Showing 1 changed file with 16 additions and 5 deletions.
21 changes: 16 additions & 5 deletions fiftyone/utils/torch.py
Original file line number Diff line number Diff line change
Expand Up @@ -1497,15 +1497,26 @@ class FiftyOneTorchDataset(Dataset):
- When using persistent_workers=True, the overhead of 'spawn' and 'forkserver' is low.
- When using Jupyter notebooks, if you want all of your code to be in the notebook, rather than calling it from a python file, you'll have
to use fork. Do so at your own risk.
you can easily set the start method for all of your torch code
with the following command:
`torch.multiprocessing.set_start_method('forkserver')`
You can easily set the start method for all of your torch code with the following command:
`torch.multiprocessing.set_start_method('forkserver')`
- Make sure to not touch `self.samples` or subscript this object until after
all workers are initialized. This will help you avoid unnecessary memory
usage. If you're using DDP, this will help your code not crash.
DDP:
- When training with DDP, make sure to synchronously load your FiftyOne dataset
in your main training script. This is done to avoid competition over the
same port of the backing dataset server. This can be easily done by setting
up a loop over the local world size with a barrier in the end:
for i in range(world_size):
if get_local_rank(local_process_group) == i:
dataset = fo.load_dataset(dataset_name)
torch.distributed.barrier(local_process_group)
- torchrun doesn't natively allow for the inter-process communcation required for
the cach_fields argument to work. When using it, call the function
`local_broadcast_process_authkey` on each local process group to facilitate
the communcation.
torch.utils.data.Dataloader use:
- DO NOT use torch.Tensor.to in this function, do so in the train loop, or use
the `pin_memory` argument in your torch.utils.data.DataLoader
Expand Down

0 comments on commit e08a447

Please sign in to comment.