sharded manifests docs (#6751)

Signed-off-by: Dima Rekesh <drekesh@nvidia.com> Co-authored-by: Dima Rekesh <drekesh@nvidia.com>
NVIDIA · May 29, 2023 · c1abc04 · c1abc04
1 parent 8b814bc
commit c1abc04
Showing 1 changed file with 14 additions and 2 deletions.
diff --git a/docs/source/asr/datasets.rst b/docs/source/asr/datasets.rst
@@ -216,7 +216,12 @@ of filepaths, e.g. ``['/data/shard1.tar', '/data/shard2.tar']``, or in a single
   tag ``_CL_``. For SLURM based tasks, we suggest the use of the special tags for ease of use.
 
 As with non-tarred datasets, the manifest file should be passed in ``manifest_filepath``. The dataloader assumes that the length
-of the manifest after filtering is the correct size of the dataset for reporting training progress.
+of the manifest after filtering is the correct size of the dataset for reporting training progress. 
+
+If the manifest is large, you may wish to reference sharded manifest files instead of a single manifest file. The naming convention 
+is identical to the audio tarballs and there should be a 1:1 relationship between a sharded audio tarfile and its manifest shard; e.g. 
+``'/data/sharded_manifests/manifest__OP_1..64_CL_'`` in the above example. Using sharded manifests improves job startup times and 
+decreases memory usage, as each worker only loads manifest shards for the corresponding audio shards instead of the entire manifest. 
 
 The ``tarred_shard_strategy`` field of the config file can be set if you have multiple shards and are running an experiment with
 multiple workers. It defaults to ``scatter``, which preallocates a set of shards per worker which do not change during runtime.
@@ -266,13 +271,20 @@ The files in the target directory should look similar to the following:
   ├── audio_2.tar
   ├── ...
   ├── metadata.yaml
-  └── tarred_audio_manifest.json
+  ├── tarred_audio_manifest.json
+  ├── sharded_manifests/
+      ├── manifest_1.json
+      ├── ...
+      └── manifest_N.json
+
 
 Note that file structures are flattened such that all audio files are at the top level in each tarball. This ensures that
 filenames are unique in the tarred dataset and the filepaths do not contain "-sub" and forward slashes in each ``audio_filepath`` are
 simply converted to underscores. For example, a manifest entry for ``/data/directory1/file.wav`` would be ``_data_directory1_file.wav``
 in the tarred dataset manifest, and ``/data/directory2/file.wav`` would be converted to ``_data_directory2_file.wav``.
 
+Sharded manifests are generated by default; this behavior can be toggled via the ``no_shard_manifests`` flag.
+
 Bucketing Datasets
 ------------------