High Memory Usage in DataLoader Workers Leading to Out-of-Memory (OOM) #196

inigopm · 2024-08-30T17:43:11Z

I'm experiencing high memory usage in the DataLoader workers when using a custom dataset class for lazy loading large datasets. This leads to Out-of-Memory (OOM) errors during training. I've observed that the MaxRSS (maximum resident set size) steadily increases during training, indicating potential memory leaks or improper memory management in the DataLoader or dataset preprocessing.

Error Message Example:
RuntimeError: DataLoader worker (pid XXXX) is killed by signal: Killed

Setup: Distributed training with 3 nodes, 4 GPUs per node
Memory: 512 GB RAM

Training Configuration
Here are the relevant training configurations used:

#!/bin/bash
source use_env.sh

NNODES=$SLURM_NNODES
WORLD_SIZE=$((NNODES * NUM_GPUS))
NODE_RANK=$SLURM_NODEID
MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
MASTER_PORT=12802

ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node="${NUM_GPUS}" --nnodes="${NNODES}" --node_rank="${NODE_RANK}" --master_addr="${MASTER_ADDR}" --master_port="${MASTER_PORT}" \
    llava/train/train_mem.py \
    --deepspeed scripts/zero3.json \
    --model_name_or_path ${CKPT_PATH} \
    --version ${PROMPT_VERSION} \
    --data_path ./playground/data/llava_v1_5_mix665k_no-ocr.json \
    --image_folder ./playground/data \
    --pretrain_mm_mlp_adapter="./checkpoints/projectors/${BASE_RUN_NAME}/checkpoint-5/mm_projector.bin" \
    --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
    --mm_vision_tower_lr=2e-6 \
    --vision_tower ${VISION_MODEL_VERSION} \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --group_by_modality_length True \
    --image_aspect_ratio anyres \
    --image_grid_pinpoints "[(384, 768), (768, 384), (768, 768), (1152, 384), (384, 1152)]" \
    --mm_patch_merge_type spatial_unpad \
    --bf16 True \
    --run_name $MID_RUN_NAME \
    --output_dir "./checkpoints/${MID_RUN_NAME}" \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --dataloader_num_workers 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 5000 \
    --save_total_limit 1 \
    --learning_rate 1e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 32768 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb \
    --torch_compile True \
    --torch_compile_backend "inductor" \
    --dataloader_drop_last True

Could the lazy preprocessing in the dataset be failing to release memory properly?

The text was updated successfully, but these errors were encountered:

mylesgoose · 2024-09-09T02:23:13Z

--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \ why is yoru gradient accumulation so low? also waht is yoru deep speed config  size for the bin         "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "15099494", waht is this size? try reducing it. 
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
},

leexinhao · 2024-09-09T12:32:42Z

Meet same problems!

Luodian · 2024-09-10T03:15:39Z

I checked our wandb logs (randomly selecting the runs larger than 300 minutes), and there's indeed the sign of leakage (mainly due to decord).

But we didnt encounter this, one reason is that our instance is with 1.8T memory so it's hard to meet the bar. Another reason may be a coincidence that our infra is unstable so the runs are repeatedly stopped due to nccl timeout. So we need to set shorter checkpointing steps, and each time the run re-started, the mem gets back to normal threshold.

inigopm · 2024-09-10T08:53:19Z

Thank you for your response and for checking the logs. I've been investigating further, and I was able to mitigate the issue somewhat by reducing the number of workers in the DataLoader. Although the MaxRSS still increases over time, it's now stable enough that the process doesn't hit OOM before 5000 steps, which is manageable for my experiment.

ftgreat · 2024-10-13T07:00:42Z

I checked our wandb logs (randomly selecting the runs larger than 300 minutes), and there's indeed the sign of leakage (mainly due to `decord`).
But we didnt encounter this, one reason is that our instance is with 1.8T memory so it's hard to meet the bar. Another reason may be a coincidence that our infra is unstable so the runs are repeatedly stopped due to nccl timeout. So we need to set shorter checkpointing steps, and each time the run re-started, the mem gets back to normal threshold.

@Luodian “I checked our wandb logs (randomly selecting the runs larger than 300 minutes), and there's indeed the sign of leakage (mainly due to decord).”

what the decord refers to? Thanks

guanyanchu · 2024-10-28T07:33:35Z

I checked our wandb logs (randomly selecting the runs larger than 300 minutes), and there's indeed the sign of leakage (mainly due to `decord`).
But we didnt encounter this, one reason is that our instance is with 1.8T memory so it's hard to meet the bar. Another reason may be a coincidence that our infra is unstable so the runs are repeatedly stopped due to nccl timeout. So we need to set shorter checkpointing steps, and each time the run re-started, the mem gets back to normal threshold.

I found this problem with mid_stage training as well, and did not use decord.

slyforce · 2024-11-07T14:54:31Z

Don't have time to provide a MR to fix this, but I found the issue on this line

https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/llava/train/train.py#L566

Don't deepcopy the tokeniser and instead just add the token on the fly / pass an appropriate tokeniser object that you want to modify.

Good luck!

CuriousCat-7 · 2024-11-14T09:36:28Z

Don't have time to provide a MR to fix this, but I found the issue on this line

https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/llava/train/train.py#L566

Don't deepcopy the tokeniser and instead just add the token on the fly / pass an appropriate tokeniser object that you want to modify.

Good luck!

Are you sure? It seems that the deepcopy is ok there. The tokenizer will be removed after the code run out of the scope of the preprocess_qwen

slyforce · 2024-11-14T09:47:27Z

Are you sure? It seems that the deepcopy is ok there. The tokenizer will be removed after the code run out of the scope of the preprocess_qwen

It was definitely a source of memory leakage for me. Give it a try, sadly I'm still unable to prepare the MR :(

CuriousCat-7 · 2024-11-18T08:36:06Z

Are you sure? It seems that the deepcopy is ok there. The tokenizer will be removed after the code run out of the scope of the preprocess_qwen

It was definitely a source of memory leakage for me. Give it a try, sadly I'm still unable to prepare the MR :(

We add:

    del tokenizer

and it is solved. Hardly know why, but it works. Ahhh, I guess I have to say "amazing".

goodstudent9 · 2024-11-28T00:21:00Z

Are you sure? It seems that the deepcopy is ok there. The tokenizer will be removed after the code run out of the scope of the preprocess_qwen

It was definitely a source of memory leakage for me. Give it a try, sadly I'm still unable to prepare the MR :(

We add:
    del tokenizer
and it is solved. Hardly know why, but it works. Ahhh, I guess I have to say "amazing".

That seems like del tokenizer doesn't work for me. The memory usage still becomes higher and higher......
DO you know why?
I added the del tokeizer in the end of the preprocess_qwen(). But at first, the memory usage is 40G, but after 7 hours, it becomes 60G, so every new epoch will add 10GB in memory......
Why, that is so wired!

inigopm closed this as completed Sep 10, 2024

goodstudent9 mentioned this issue Nov 28, 2024

del tokenizer doesn't work for me in OOM #352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Memory Usage in DataLoader Workers Leading to Out-of-Memory (OOM) #196

High Memory Usage in DataLoader Workers Leading to Out-of-Memory (OOM) #196

inigopm commented Aug 30, 2024 •

edited

Loading

mylesgoose commented Sep 9, 2024

leexinhao commented Sep 9, 2024

Luodian commented Sep 10, 2024 •

edited

Loading

inigopm commented Sep 10, 2024

ftgreat commented Oct 13, 2024

guanyanchu commented Oct 28, 2024

slyforce commented Nov 7, 2024

CuriousCat-7 commented Nov 14, 2024

slyforce commented Nov 14, 2024

CuriousCat-7 commented Nov 18, 2024

goodstudent9 commented Nov 28, 2024

High Memory Usage in DataLoader Workers Leading to Out-of-Memory (OOM) #196

High Memory Usage in DataLoader Workers Leading to Out-of-Memory (OOM) #196

Comments

inigopm commented Aug 30, 2024 • edited Loading

mylesgoose commented Sep 9, 2024

leexinhao commented Sep 9, 2024

Luodian commented Sep 10, 2024 • edited Loading

inigopm commented Sep 10, 2024

ftgreat commented Oct 13, 2024

guanyanchu commented Oct 28, 2024

slyforce commented Nov 7, 2024

CuriousCat-7 commented Nov 14, 2024

slyforce commented Nov 14, 2024

CuriousCat-7 commented Nov 18, 2024

goodstudent9 commented Nov 28, 2024

inigopm commented Aug 30, 2024 •

edited

Loading

Luodian commented Sep 10, 2024 •

edited

Loading