Skip to content

Commit

Permalink
sharded manifests docs (#6751)
Browse files Browse the repository at this point in the history
Signed-off-by: Dima Rekesh <drekesh@nvidia.com>
Co-authored-by: Dima Rekesh <drekesh@nvidia.com>
  • Loading branch information
bmwshop and Dima Rekesh authored May 29, 2023
1 parent 8b814bc commit c1abc04
Showing 1 changed file with 14 additions and 2 deletions.
16 changes: 14 additions & 2 deletions docs/source/asr/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -216,7 +216,12 @@ of filepaths, e.g. ``['/data/shard1.tar', '/data/shard2.tar']``, or in a single
tag ``_CL_``. For SLURM based tasks, we suggest the use of the special tags for ease of use.

As with non-tarred datasets, the manifest file should be passed in ``manifest_filepath``. The dataloader assumes that the length
of the manifest after filtering is the correct size of the dataset for reporting training progress.
of the manifest after filtering is the correct size of the dataset for reporting training progress.

If the manifest is large, you may wish to reference sharded manifest files instead of a single manifest file. The naming convention
is identical to the audio tarballs and there should be a 1:1 relationship between a sharded audio tarfile and its manifest shard; e.g.
``'/data/sharded_manifests/manifest__OP_1..64_CL_'`` in the above example. Using sharded manifests improves job startup times and
decreases memory usage, as each worker only loads manifest shards for the corresponding audio shards instead of the entire manifest.

The ``tarred_shard_strategy`` field of the config file can be set if you have multiple shards and are running an experiment with
multiple workers. It defaults to ``scatter``, which preallocates a set of shards per worker which do not change during runtime.
Expand Down Expand Up @@ -266,13 +271,20 @@ The files in the target directory should look similar to the following:
├── audio_2.tar
├── ...
├── metadata.yaml
└── tarred_audio_manifest.json
├── tarred_audio_manifest.json
├── sharded_manifests/
├── manifest_1.json
├── ...
└── manifest_N.json
Note that file structures are flattened such that all audio files are at the top level in each tarball. This ensures that
filenames are unique in the tarred dataset and the filepaths do not contain "-sub" and forward slashes in each ``audio_filepath`` are
simply converted to underscores. For example, a manifest entry for ``/data/directory1/file.wav`` would be ``_data_directory1_file.wav``
in the tarred dataset manifest, and ``/data/directory2/file.wav`` would be converted to ``_data_directory2_file.wav``.

Sharded manifests are generated by default; this behavior can be toggled via the ``no_shard_manifests`` flag.

Bucketing Datasets
------------------

Expand Down

0 comments on commit c1abc04

Please sign in to comment.