OOM with RAM with Lhotse #11303

riqiang-dp · 2024-11-15T23:08:08Z

Describe the bug

When training a model consuming more memory, I noticed that my training would stop after a constant number of epochs. Upon further investigation, I found that during training / validation, the memory usage of the CPU memory (RAM) continues to rise and never get released, which results in OOM after a number of epochs. This happens while I was using the Lhotse dataloader, while I was able to train a small fast-conformer CTC model for hundreds of epochs, the same version of NeMo only is able to run the training for an XL fast-conformer CTC model for ~20 epochs. ~110 epochs if I use 1/4 of the works for the dataloader. So somehow the dataloader is not releasing memory for the data loaded.

Steps/Code to reproduce bug

shows up in XL fast-conformer CTC and medium fast-conformer CTC RNNT hybrid
standard configs for these models in terms of hyperparameters
other lhotse related configs:

trainer:
  devices: -1
  num_nodes: 1
  max_epochs: 150
  max_steps: 150000
  val_check_interval: 1000
  accelerator: auto
  strategy: ddp
  accumulate_grad_batches: 1
  gradient_clip_val: 1.0
  precision: bf16-mixed
  log_every_n_steps: 200
  enable_progress_bar: true
  num_sanity_val_steps: 1
  check_val_every_n_epoch: 1
  sync_batchnorm: true
  enable_checkpointing: false
  logger: false
  benchmark: false
  use_distributed_sampler: false
  limit_train_batches: 1000
 train_ds:
  manifest_filepath: null
  sample_rate: 16000
  batch_size: null
  shuffle: true
  num_workers: 8
  pin_memory: true
  max_duration: 45
  min_duration: 1
  is_tarred: false
  tarred_audio_filepaths: null
  shuffle_n: 2048
  bucketing_strategy: synced_randomized
  bucketing_batch_size: null
  shar_path:
  - xxxxx
  use_lhotse: true
  bucket_duration_bins:
  - xxx
    batch_duration: 600
    quadratic_duration: 30
    num_buckets: 30
    bucket_buffer_size: 10000
    shuffle_buffer_size: 10000
    num_cuts_for_bins_estimate: 10000
    use_bucketing: true

Expected behavior

Training should continue until specified stop.

Environment overview (please complete the following information)

GCP
Method of NeMo install: poetry: nemo-toolkit = {version = "2.0.0rc1", extras = ["asr"]}

Environment details

nemo-toolkit 2.0.0rc1 / 2.0.0
PyTorch-lightning 2.4.0
PyTorch 2.2.2+cu121
Python 3.11.8

Additional context

GPU: A100 40G.

@nithinraok ~~has told me to try using limit_validation_batches, use smaller duration audios, use fully_randomized which I haven't fully tested. I'll report back when these are tested but the issue persists so far~~

Edit: tried these suggestions, still OOM.

The text was updated successfully, but these errors were encountered:

nithinraok · 2024-11-15T23:22:38Z

@pzelasko fyi

riqiang-dp added the bug Something isn't working label Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM with RAM with Lhotse #11303

OOM with RAM with Lhotse #11303

riqiang-dp commented Nov 15, 2024 •

edited

Loading

nithinraok commented Nov 15, 2024

OOM with RAM with Lhotse #11303

OOM with RAM with Lhotse #11303

Comments

riqiang-dp commented Nov 15, 2024 • edited Loading

nithinraok commented Nov 15, 2024

riqiang-dp commented Nov 15, 2024 •

edited

Loading