-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak using download_file with DDP or FSDP #758
Comments
Just to confirm, do you not observe this if you avoid using streaming? |
@nagadit thanks for bringing up the issue. afaict, the peak memory goes from 13.75 to ~14 is normal, 2% increase. One possibility is the dataset is not fully shuffled so some samples are larger, which requires larger memory at runtime, and pytorch allocated the memory while did not free it until gc. overall, it is expected to have the peak memory accumulated during the training. Can you provide more details? like if this is already isolated as a streaming issue^, why trainer do you use, a code snippet that your streamingdataset instance is created, and whether the mem leak continues etc. thanks. |
I apologize, I uploaded the wrong screenshot by mistake) |
There is a leak, but in minimal quantities. |
For example, you can create a dataloader as shown below by wrapping the dataloader and model in a DDP. s3_dataloader = StreamingOutsideGIWebVid(batch_size=360, extra_local="path/to/local", exetra_remote="s3://")
trainer = Trainer(
model=Model(),
train_dataloader=s3_dataloader,
max_duration="2ep"
)
trainer.fit() You will get a big memory leak when working with images or video files. |
Any confirmation on this, seems like deal breaker if there's a memory leak in MosaicML? |
What's the size of your dataset? Note that streaming does not by default evict data as you might need it for multiple passes, but if your dataset is large you can limit the cache size: https://docs.mosaicml.com/projects/streaming/en/stable/dataset_configuration/shard_retrieval.html#cache-limit |
Hello everyone! The memory leak problem has been solved (a custom boto3 session handler has been written for mutriprocessing and multithreading). You can learn more about the problem here: boto/boto3#1670 This issue can be closed or made the main one for future searches with such a problem. |
Nice! Glad to see it was't a bug on our end :). Thanks for hunting it down and flagging it. |
Environment
To reproduce
Steps to reproduce the behavior:
The text was updated successfully, but these errors were encountered: