-
Notifications
You must be signed in to change notification settings - Fork 984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU memory leak when training with CsvDataset-like dataset #849
Comments
@estherxue does it behave differently than with normal CLIP (infonce) loss on the exact setup? |
Below is the running script for siglip: Below is the running script for clip: |
did you tried training with standard clip loss? |
Hi, are there any updates? |
I tried training with standard clip loss. I was wrong. Standard clip loss also has memory leak problem. |
I finally solved this problem by only training with limited batch number for each epoch as after the code finishes training for one epoch, the memory usage goes down. |
It seems that this has nothing to do with the loss. The memory leak exists when doing evaluation. |
I did some research on CPU memory leaks, and people say most of the time memory leaks appear when tensors are accumulated without being detached (as they carry with them the entire computational graph) or data loader issues such as copy-on-access (storing in the dataset definition naive python objects which reference count gets increased when being accessed by multiple processes (dataloader workers)) These resources might help you to debug:
If you find something let me know, I am experiencing RAM memory leaks too when fine-tuning CLIP using LoRA and standard distributed clip loss implemented in this repo, but I use torch ligthing Fabric launcher and mosaic ml streaming dataset instead of WebDataset |
we've done a lot of large scale training, long durations, big datasets and never found any noteworthy issues with dataloader memory leaks and the webdataset code. We don't use csv datasets though so possibly an issue there. There is significant memory churn when you're plowing through really large datasets. Some allocators have issues with fragmentation over time. I usually patch the allocator to use tcmalloc. Should point out that normal 'validation' is VERY memory intensive if you have a lot of samples in your val dataset, it should be treated as a 'gallery' style dataset where it's a hand picked limited set of test samples as it does a full similarity matrix. That could really spike memory. We usually use zero-shot eval to gauge progress as it's more sane to run across larger val sets and often the metric that most focus on (though valid arguments for preferring other val metrics too). A batch-wise (average over batched similarities) would be possible but not implemented. |
Using CSV datasets with the native implementation will lead to an increase in memory. As @miguelalba96 linked, it is not a bug but expected behavior. The solution is either
|
Hi, I tried to train with siglip loss on a large dataset and I found that during training (not evaluation), CPU memory usage kept increasing. The program was finally killed by the system. The data loading process is nothing special, similar to that of what csv dataset does. Has anyone encountered a similar problem?
The text was updated successfully, but these errors were encountered: