Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak using download_file with DDP or FSDP #758

Closed
nagadit opened this issue Aug 16, 2024 · 9 comments
Closed

Memory leak using download_file with DDP or FSDP #758

nagadit opened this issue Aug 16, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@nagadit
Copy link

nagadit commented Aug 16, 2024

Environment

  • OS: [Ubuntu 20.04]
  • Hardware (GPU, or instance type): [H100x16]

To reproduce

Steps to reproduce the behavior:

  1. Use this dataset class
  2. Set remote path to video or images
  3. Start training using N_GPUS>8 with FSDP or DDP, set batch_size>128 and n_workers

e8c955b9-f596-44f6-ab8d-6383dcbeff80

@nagadit nagadit added the bug Something isn't working label Aug 16, 2024
@mvpatel2000
Copy link
Contributor

Just to confirm, do you not observe this if you avoid using streaming?

@XiaohanZhangCMU
Copy link
Member

@nagadit thanks for bringing up the issue.

afaict, the peak memory goes from 13.75 to ~14 is normal, 2% increase. One possibility is the dataset is not fully shuffled so some samples are larger, which requires larger memory at runtime, and pytorch allocated the memory while did not free it until gc. overall, it is expected to have the peak memory accumulated during the training.

Can you provide more details? like if this is already isolated as a streaming issue^, why trainer do you use, a code snippet that your streamingdataset instance is created, and whether the mem leak continues etc. thanks.

@nagadit
Copy link
Author

nagadit commented Aug 17, 2024

@nagadit thanks for bringing up the issue.

afaict, the peak memory goes from 13.75 to ~14 is normal, 2% increase. One possibility is the dataset is not fully shuffled so some samples are larger, which requires larger memory at runtime, and pytorch allocated the memory while did not free it until gc. overall, it is expected to have the peak memory accumulated during the training.

Can you provide more details? like if this is already isolated as a streaming issue^, why trainer do you use, a code snippet that your streamingdataset instance is created, and whether the mem leak continues etc. thanks.

I apologize, I uploaded the wrong screenshot by mistake)
Corrected

@nagadit
Copy link
Author

nagadit commented Aug 17, 2024

Just to confirm, do you not observe this if you avoid using streaming?

There is a leak, but in minimal quantities.

@nagadit
Copy link
Author

nagadit commented Aug 17, 2024

For example, you can create a dataloader as shown below by wrapping the dataloader and model in a DDP.

s3_dataloader = StreamingOutsideGIWebVid(batch_size=360, extra_local="path/to/local", exetra_remote="s3://")

trainer = Trainer(
    model=Model(),
    train_dataloader=s3_dataloader,
    max_duration="2ep"
)
trainer.fit()

You will get a big memory leak when working with images or video files.

@AugustDev
Copy link

Any confirmation on this, seems like deal breaker if there's a memory leak in MosaicML?

@mvpatel2000
Copy link
Contributor

What's the size of your dataset? Note that streaming does not by default evict data as you might need it for multiple passes, but if your dataset is large you can limit the cache size: https://docs.mosaicml.com/projects/streaming/en/stable/dataset_configuration/shard_retrieval.html#cache-limit

@nagadit
Copy link
Author

nagadit commented Sep 8, 2024

Hello everyone! The memory leak problem has been solved (a custom boto3 session handler has been written for mutriprocessing and multithreading). You can learn more about the problem here: boto/boto3#1670

This issue can be closed or made the main one for future searches with such a problem.

@mvpatel2000
Copy link
Contributor

Nice! Glad to see it was't a bug on our end :). Thanks for hunting it down and flagging it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants