Lhotse and the best way to use it #10087
-
This is mainly a question for @pzelasko 😀 I understand that Lhotse was used in recent very large scale training runs. But I have a few questions:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
Hi @FredSRichardson, it's been a while!
You can use your existing Lhotse data; we support all Lhotse formats and all NeMo formats. You may find this doc helpful to navigate the relevant options: https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#enabling-lhotse-via-configuration
We mainly use tarred formats to leverage optimized I/O. You can either use NeMo tarred manifests or Lhotse Shar format (you should get very similar if not identical performance). If you end up going for the NeMo tarred format, I recommend not splitting the data into separate buckets on disk; Lhotse will blend your datasets and bucket your data dynamically with no noticeable overhead.
Yes, the Lhotse workflows here are built primarily with sharding in mind, because it helps us preserve a good degree of randomness (especially when blending multiple datasets).
With Lhotse there are no specific requirements to the number of shards; I generally aim for shard_size of about 1000. Lhotse makes the data iterator infinite (but gives a different seed to each DDP rank / dataloading worker) so you won't run into uneven number of steps / GPU issue. Please refer to the doc linked above for more details, and LMK if you have any other questions. As a bonus: you may be interested in our most recent work that (significantly) increases the batch sizes in training; the PR contains new documentation how to leverage this #9763 |
Beta Was this translation helpful? Give feedback.
Hi @FredSRichardson, it's been a while!
You can use your existing Lhotse data; we support all Lhotse formats and all NeMo formats. You may find this doc helpful to navigate the relevant options: https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#enabling-lhotse-via-configuration
We mainly use tarred formats…