CPU memory leak when training with CsvDataset-like dataset #849

estherxue · 2024-03-31T03:29:15Z

Hi, I tried to train with siglip loss on a large dataset and I found that during training (not evaluation), CPU memory usage kept increasing. The program was finally killed by the system. The data loading process is nothing special, similar to that of what csv dataset does. Has anyone encountered a similar problem?

rwightman · 2024-04-05T18:39:40Z

@estherxue does it behave differently than with normal CLIP (infonce) loss on the exact setup?

estherxue · 2024-04-08T02:24:54Z

@estherxue does it behave differently than with normal CLIP (infonce) loss on the exact setup?

Below is the running script for siglip:
torchrun --nproc_per_node 1 \ --nnodes $WORLD_SIZE \ --node_rank $RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT \ -m training.main \ --train-data '' \ --val-data '' \ --dataset-type hugface \ --batch-size 512 \ --precision amp \ --csv-img-key image_hash \ --csv-caption-key caption \ --local-loss \ --gather-with-grad \ --logs /home/work/data_mm_pretrain/models/siglip_b16_60m_large_bs_no_wd/ \ --name large_bs \ --workers 12 \ --epochs 10 \ --model ViT-B-16-SigLIP \ --pretrained webli \ --warmup 0 \ --beta2 0.95 \ --lr 5e-5 \ --wd 0. \ --torchcompile \ --siglip

Below is the running script for clip:
torchrun --nproc_per_node 1 \ --nnodes $WORLD_SIZE \ --node_rank $RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT \ -m training.main \ --train-data '' \ --val-data '' \ --dataset-type hugface \ --batch-size 768 \ --precision amp \ --csv-img-key image_hash \ --csv-caption-key caption \ --local-loss \ --gather-with-grad \ --logs /home/work/data_mm_pretrain/models/clip_b32_id_2.5m_baseline/ \ --name large_bs \ --workers 12 \ --epochs 4 \ --model ViT-B-32-quickgelu \ --pretrained openai \ --use-thumbnail \ --warmup 0 \ --lr 5e-5 \ --wd 0. \ --torchcompile
The only difference that I can notice is the memory usage on CPU.

miguelalba96 · 2024-04-13T17:01:04Z

did you tried training with standard clip loss?

darkasevgen · 2024-04-15T05:55:29Z

Hi, are there any updates?

estherxue · 2024-04-19T10:49:42Z

did you tried training with standard clip loss?

I tried training with standard clip loss. I was wrong. Standard clip loss also has memory leak problem.

estherxue · 2024-04-19T10:52:57Z

Hi, are there any updates?

I finally solved this problem by only training with limited batch number for each epoch as after the code finishes training for one epoch, the memory usage goes down.

estherxue · 2024-04-26T04:07:11Z

It seems that this has nothing to do with the loss. The memory leak exists when doing evaluation.

miguelalba96 · 2024-04-26T08:54:40Z

I did some research on CPU memory leaks, and people say most of the time memory leaks appear when tensors are accumulated without being detached (as they carry with them the entire computational graph) or data loader issues such as copy-on-access (storing in the dataset definition naive python objects which reference count gets increased when being accessed by multiple processes (dataloader workers))

These resources might help you to debug:

If you find something let me know, I am experiencing RAM memory leaks too when fine-tuning CLIP using LoRA and standard distributed clip loss implemented in this repo, but I use torch ligthing Fabric launcher and mosaic ml streaming dataset instead of WebDataset

rwightman · 2024-05-09T22:49:33Z

we've done a lot of large scale training, long durations, big datasets and never found any noteworthy issues with dataloader memory leaks and the webdataset code. We don't use csv datasets though so possibly an issue there.

There is significant memory churn when you're plowing through really large datasets. Some allocators have issues with fragmentation over time. I usually patch the allocator to use tcmalloc. LD_PRELOAD=/lib/<system dependent>/libtcmalloc.so.4 ... apt get install google-perftools to get the lib.

Should point out that normal 'validation' is VERY memory intensive if you have a lot of samples in your val dataset, it should be treated as a 'gallery' style dataset where it's a hand picked limited set of test samples as it does a full similarity matrix. That could really spike memory. We usually use zero-shot eval to gauge progress as it's more sane to run across larger val sets and often the metric that most focus on (though valid arguments for preferring other val metrics too).

A batch-wise (average over batched similarities) would be possible but not implemented.

jn2clark · 2024-05-14T20:05:12Z

Using CSV datasets with the native implementation will lead to an increase in memory. As @miguelalba96 linked, it is not a bug but expected behavior. The solution is either

use a different dataset format like webdataset. This will be streaming though and not map style.
if you want map style with minimal changes then you can move to another structure that has no copy-on-read. Pyarrow has worked in the past.

estherxue changed the title ~~CPU memory leak when training with siglip loss~~ CPU memory leak when training with CsvDataset-like dataset May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU memory leak when training with CsvDataset-like dataset #849

CPU memory leak when training with CsvDataset-like dataset #849

estherxue commented Mar 31, 2024

rwightman commented Apr 5, 2024

estherxue commented Apr 8, 2024

miguelalba96 commented Apr 13, 2024

darkasevgen commented Apr 15, 2024

estherxue commented Apr 19, 2024

estherxue commented Apr 19, 2024

estherxue commented Apr 26, 2024

miguelalba96 commented Apr 26, 2024 •

edited

Loading

rwightman commented May 9, 2024 •

edited

Loading

jn2clark commented May 14, 2024

CPU memory leak when training with CsvDataset-like dataset #849

CPU memory leak when training with CsvDataset-like dataset #849

Comments

estherxue commented Mar 31, 2024

rwightman commented Apr 5, 2024

estherxue commented Apr 8, 2024

miguelalba96 commented Apr 13, 2024

darkasevgen commented Apr 15, 2024

estherxue commented Apr 19, 2024

estherxue commented Apr 19, 2024

estherxue commented Apr 26, 2024

miguelalba96 commented Apr 26, 2024 • edited Loading

rwightman commented May 9, 2024 • edited Loading

jn2clark commented May 14, 2024

miguelalba96 commented Apr 26, 2024 •

edited

Loading

rwightman commented May 9, 2024 •

edited

Loading