Memory leak using download_file with DDP or FSDP #758

nagadit · 2024-08-16T14:17:19Z

Environment

OS: [Ubuntu 20.04]
Hardware (GPU, or instance type): [H100x16]

To reproduce

Steps to reproduce the behavior:

Use this dataset class
Set remote path to video or images
Start training using N_GPUS>8 with FSDP or DDP, set batch_size>128 and n_workers

mvpatel2000 · 2024-08-16T17:56:07Z

Just to confirm, do you not observe this if you avoid using streaming?

XiaohanZhangCMU · 2024-08-16T21:57:39Z

@nagadit thanks for bringing up the issue.

afaict, the peak memory goes from 13.75 to ~14 is normal, 2% increase. One possibility is the dataset is not fully shuffled so some samples are larger, which requires larger memory at runtime, and pytorch allocated the memory while did not free it until gc. overall, it is expected to have the peak memory accumulated during the training.

Can you provide more details? like if this is already isolated as a streaming issue^, why trainer do you use, a code snippet that your streamingdataset instance is created, and whether the mem leak continues etc. thanks.

nagadit · 2024-08-17T05:48:38Z

@nagadit thanks for bringing up the issue.

afaict, the peak memory goes from 13.75 to ~14 is normal, 2% increase. One possibility is the dataset is not fully shuffled so some samples are larger, which requires larger memory at runtime, and pytorch allocated the memory while did not free it until gc. overall, it is expected to have the peak memory accumulated during the training.

Can you provide more details? like if this is already isolated as a streaming issue^, why trainer do you use, a code snippet that your streamingdataset instance is created, and whether the mem leak continues etc. thanks.

I apologize, I uploaded the wrong screenshot by mistake)
Corrected

nagadit · 2024-08-17T05:49:09Z

Just to confirm, do you not observe this if you avoid using streaming?

There is a leak, but in minimal quantities.

nagadit · 2024-08-17T07:01:54Z

For example, you can create a dataloader as shown below by wrapping the dataloader and model in a DDP.

s3_dataloader = StreamingOutsideGIWebVid(batch_size=360, extra_local="path/to/local", exetra_remote="s3://")

trainer = Trainer(
    model=Model(),
    train_dataloader=s3_dataloader,
    max_duration="2ep"
)
trainer.fit()

You will get a big memory leak when working with images or video files.

gluonfield · 2024-09-01T15:29:12Z

Any confirmation on this, seems like deal breaker if there's a memory leak in MosaicML?

mvpatel2000 · 2024-09-03T02:31:16Z

What's the size of your dataset? Note that streaming does not by default evict data as you might need it for multiple passes, but if your dataset is large you can limit the cache size: https://docs.mosaicml.com/projects/streaming/en/stable/dataset_configuration/shard_retrieval.html#cache-limit

nagadit · 2024-09-08T09:39:43Z

Hello everyone! The memory leak problem has been solved (a custom boto3 session handler has been written for mutriprocessing and multithreading). You can learn more about the problem here: boto/boto3#1670

This issue can be closed or made the main one for future searches with such a problem.

mvpatel2000 · 2024-09-08T13:49:37Z

Nice! Glad to see it was't a bug on our end :). Thanks for hunting it down and flagging it.

nagadit added the bug Something isn't working label Aug 16, 2024

mvpatel2000 closed this as completed Sep 8, 2024

nagadit mentioned this issue Sep 8, 2024

Out of Memory when using Streaming Dataloader #652

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak using download_file with DDP or FSDP #758

Memory leak using download_file with DDP or FSDP #758

nagadit commented Aug 16, 2024 •

edited

Loading

mvpatel2000 commented Aug 16, 2024

XiaohanZhangCMU commented Aug 16, 2024

nagadit commented Aug 17, 2024

nagadit commented Aug 17, 2024

nagadit commented Aug 17, 2024 •

edited

Loading

gluonfield commented Sep 1, 2024

mvpatel2000 commented Sep 3, 2024

nagadit commented Sep 8, 2024

mvpatel2000 commented Sep 8, 2024

Memory leak using download_file with DDP or FSDP #758

Memory leak using download_file with DDP or FSDP #758

Comments

nagadit commented Aug 16, 2024 • edited Loading

Environment

To reproduce

mvpatel2000 commented Aug 16, 2024

XiaohanZhangCMU commented Aug 16, 2024

nagadit commented Aug 17, 2024

nagadit commented Aug 17, 2024

nagadit commented Aug 17, 2024 • edited Loading

gluonfield commented Sep 1, 2024

mvpatel2000 commented Sep 3, 2024

nagadit commented Sep 8, 2024

mvpatel2000 commented Sep 8, 2024

nagadit commented Aug 16, 2024 •

edited

Loading

nagadit commented Aug 17, 2024 •

edited

Loading