Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM with RAM with Lhotse #11303

Open
riqiang-dp opened this issue Nov 15, 2024 · 1 comment
Open

OOM with RAM with Lhotse #11303

riqiang-dp opened this issue Nov 15, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@riqiang-dp
Copy link

riqiang-dp commented Nov 15, 2024

Describe the bug

When training a model consuming more memory, I noticed that my training would stop after a constant number of epochs. Upon further investigation, I found that during training / validation, the memory usage of the CPU memory (RAM) continues to rise and never get released, which results in OOM after a number of epochs. This happens while I was using the Lhotse dataloader, while I was able to train a small fast-conformer CTC model for hundreds of epochs, the same version of NeMo only is able to run the training for an XL fast-conformer CTC model for ~20 epochs. ~110 epochs if I use 1/4 of the works for the dataloader. So somehow the dataloader is not releasing memory for the data loaded.

Steps/Code to reproduce bug

  • shows up in XL fast-conformer CTC and medium fast-conformer CTC RNNT hybrid
  • standard configs for these models in terms of hyperparameters
  • other lhotse related configs:
trainer:
  devices: -1
  num_nodes: 1
  max_epochs: 150
  max_steps: 150000
  val_check_interval: 1000
  accelerator: auto
  strategy: ddp
  accumulate_grad_batches: 1
  gradient_clip_val: 1.0
  precision: bf16-mixed
  log_every_n_steps: 200
  enable_progress_bar: true
  num_sanity_val_steps: 1
  check_val_every_n_epoch: 1
  sync_batchnorm: true
  enable_checkpointing: false
  logger: false
  benchmark: false
  use_distributed_sampler: false
  limit_train_batches: 1000
 train_ds:
  manifest_filepath: null
  sample_rate: 16000
  batch_size: null
  shuffle: true
  num_workers: 8
  pin_memory: true
  max_duration: 45
  min_duration: 1
  is_tarred: false
  tarred_audio_filepaths: null
  shuffle_n: 2048
  bucketing_strategy: synced_randomized
  bucketing_batch_size: null
  shar_path:
  - xxxxx
  use_lhotse: true
  bucket_duration_bins:
  - xxx
    batch_duration: 600
    quadratic_duration: 30
    num_buckets: 30
    bucket_buffer_size: 10000
    shuffle_buffer_size: 10000
    num_cuts_for_bins_estimate: 10000
    use_bucketing: true

Expected behavior

Training should continue until specified stop.

Environment overview (please complete the following information)

  • GCP
  • Method of NeMo install: poetry: nemo-toolkit = {version = "2.0.0rc1", extras = ["asr"]}

Environment details

  • nemo-toolkit 2.0.0rc1 / 2.0.0
  • PyTorch-lightning 2.4.0
  • PyTorch 2.2.2+cu121
  • Python 3.11.8

Additional context

GPU: A100 40G.

@nithinraok has told me to try using limit_validation_batches, use smaller duration audios, use fully_randomized which I haven't fully tested. I'll report back when these are tested but the issue persists so far

Edit: tried these suggestions, still OOM.

@riqiang-dp riqiang-dp added the bug Something isn't working label Nov 15, 2024
@nithinraok
Copy link
Collaborator

@pzelasko fyi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants