Lhotse and the best way to use it #10087

FredSRichardson · 2024-08-08T18:16:07Z

FredSRichardson
Aug 8, 2024

This is mainly a question for @pzelasko 😀

I understand that Lhotse was used in recent very large scale training runs. But I have a few questions:

If I already have Lhotse format datasets (i.e. in the format of Lhotse JSON manifests), is it optimal to use those or is it actually better to use something else like NeMo tarred manifests with the NeMo lhotse data loading option enabled?
What type of data set format did you use for most of your recent large scale training runs which used lhotse?
I'm guessing shards are important with Lhotse data sets. Should I create shards before using Lhotse with NeMo?
Any advice on how many shards to create? Should it match the available number of gpus?

Aug 8, 2024

If I already have Lhotse format datasets (i.e. in the format of Lhotse JSON manifests), is it optimal to use those or is it actually better to use something else like NeMo tarred manifests with the NeMo lhotse data loading option enabled?

You can use your existing Lhotse data; we support all Lhotse formats and all NeMo formats. You may find this doc helpful to navigate the relevant options: https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#enabling-lhotse-via-configuration

What type of data set format did you use for most of your recent large scale training runs which used lhotse?

We mainly use tarred formats…

View full answer

pzelasko · 2024-08-08T21:52:13Z

pzelasko
Aug 8, 2024
Collaborator

Hi @FredSRichardson, it's been a while!

If I already have Lhotse format datasets (i.e. in the format of Lhotse JSON manifests), is it optimal to use those or is it actually better to use something else like NeMo tarred manifests with the NeMo lhotse data loading option enabled?

You can use your existing Lhotse data; we support all Lhotse formats and all NeMo formats. You may find this doc helpful to navigate the relevant options: https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#enabling-lhotse-via-configuration

What type of data set format did you use for most of your recent large scale training runs which used lhotse?

We mainly use tarred formats to leverage optimized I/O. You can either use NeMo tarred manifests or Lhotse Shar format (you should get very similar if not identical performance). If you end up going for the NeMo tarred format, I recommend not splitting the data into separate buckets on disk; Lhotse will blend your datasets and bucket your data dynamically with no noticeable overhead.

I'm guessing shards are important with Lhotse data sets. Should I create shards before using Lhotse with NeMo?

Yes, the Lhotse workflows here are built primarily with sharding in mind, because it helps us preserve a good degree of randomness (especially when blending multiple datasets).

Any advice on how many shards to create? Should it match the available number of gpus?

With Lhotse there are no specific requirements to the number of shards; I generally aim for shard_size of about 1000. Lhotse makes the data iterator infinite (but gives a different seed to each DDP rank / dataloading worker) so you won't run into uneven number of steps / GPU issue. Please refer to the doc linked above for more details, and LMK if you have any other questions.

As a bonus: you may be interested in our most recent work that (significantly) increases the batch sizes in training; the PR contains new documentation how to leverage this #9763

5 replies

FredSRichardson Aug 8, 2024
Author

Thank you @pzelasko! So far this is very clear. Do you recommend using Lhotse to store Fbank features for use with NeMo? I realize I might need to get the extraction to match what we're doing now, but would this generally be a good idea?

pzelasko Aug 9, 2024
Collaborator

In my experience that’s not needed and it’s sufficient to use waveform as input and leverage GPU feature extractor while training (it takes milliseconds after cuFFT plan initialization in ~50 training steps). Even with fairly slow network filesystems (just make sure to use a tarred format; and FLAC is preferable to WAV for saving IO).

FredSRichardson Aug 9, 2024
Author

Thank you! The flac feature is very nice. That saves quite a bit of space - a factor of 2! And that is great confirmation - I've been using waveform data in other frameworks and I suspected that was a reasonable thing to do, but I hadn't really looked into it.

pzelasko Aug 9, 2024
Collaborator

I suspect OPUS (32-64kpbs) will also perform very well but I never find the time to actually check. You can get huge space savings from that, and libandfile supports it.

scarecrow1123 Aug 28, 2024

You might have to benchmark OPUS decoding performance as it outweighs storage benefits. The decoding incurs an I/O overhead as with any other compressed formats. I've noticed significant performance drop in data loading when OPUS decoding is done on the fly during training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lhotse and the best way to use it #10087

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Lhotse and the best way to use it #10087

FredSRichardson Aug 8, 2024

Replies: 1 comment · 5 replies

pzelasko Aug 8, 2024 Collaborator

FredSRichardson Aug 8, 2024 Author

pzelasko Aug 9, 2024 Collaborator

FredSRichardson Aug 9, 2024 Author

pzelasko Aug 9, 2024 Collaborator

scarecrow1123 Aug 28, 2024

FredSRichardson
Aug 8, 2024

Replies: 1 comment 5 replies

pzelasko
Aug 8, 2024
Collaborator

FredSRichardson Aug 8, 2024
Author

pzelasko Aug 9, 2024
Collaborator

FredSRichardson Aug 9, 2024
Author

pzelasko Aug 9, 2024
Collaborator