enabling diverse datasets in val / test #6306

bmwshop · 2023-03-27T23:04:23Z

What does this PR do ?

Allows to specify multiple datasets with different configs in validation and test

ALL

Changelog

When processing val / test dataset configs, if a dict is detected instead of a list, overload the entire dataset config, not just the manifest path.

Usage

Specify multiple, different dataset configs as shown below.

  validation_ds:
    ds_item:
    - name: en
      manifest_filepath:
      - ${d}/en/val_test/fisher/audio_manifest_dev_clean_en.json
      - ${d}/en/val_test/europarl/audio_manifest_dev_clean_en.json
      sample_rate: ${model.sample_rate}
      batch_size: 16
      shuffle: false
      num_workers: 8
      pin_memory: true
      max_duration: 20.0
      min_duration: 0.1
      use_start_end_token: false
    - name: es
      manifest_filepath:
      - ${d}/es/nemo_sp_asr_set_3pt0/dev/fisher/dev_fisher_manifest_es.json
      - ${d}/es/nemo_sp_asr_set_3pt0/dev/mcv12/dev_mcv12_manifest_es.json
      - ${d}/es/nemo_sp_asr_set_3pt0/dev/mls/dev_mls_manifest_es.json
      - ${d}/es/nemo_sp_asr_set_3pt0/dev/voxpopuli/dev_voxpopuli_manifest_es.json
      sample_rate: ${model.sample_rate}
      batch_size: 32
      shuffle: false
      num_workers: 8
      pin_memory: true
      max_duration: 10.0
      min_duration: 1.0
      use_start_end_token: false

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

[ x] New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

titu1994

Very neat way of handling this !
Two things -

why sample dev and test set ? It will give non determinating results for comparison. How about disabling sampling as a default config example? This should be added to the PRs example.
this capability needs to be fully documented in the ASR dataset / config docs.

Preferably, thete should be 2-3 model multilingual config with this commented into the dev/test section.

bmwshop · 2023-03-28T01:10:18Z

Actually, concat should give deterministic results when shuffle is turned off and the seed is set! The reason to enable concat / sampling is to save time; in the overall multilingual example the validation dataset is over 400 hours long. However, I understand the point and will provide a simpler example.

titu1994 · 2023-03-28T03:22:59Z

But is the subset of dataset replicable between two runs ? Ie seed is set to same per dataset and therefore the slice of samples is always exactly the same no matter if you call it the first run vs Nth / final chained run ?

bmwshop · 2023-04-01T16:11:31Z

Yes, the result is the same across multiple validation runs if the config does not change - when concat_sampling is set to random but concat_sampling_seed is set and concat_shuffle is set to False.

However, I have removed the concat flags from this example for simplicity and because this PR does not touch the concat dataset code. It just deals with how validation and test dataset configs are processed.

But is the subset of dataset replicable between two runs ? Ie seed is set to same per dataset and therefore the slice of samples is always exactly the same no matter if you call it the first run vs Nth / final chained run ?

titu1994

Ok, looks fine then

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

orena1 · 2024-02-21T21:28:02Z

Thanks @bmwshop, is it feasible to only append the names and paths of the new datasets while maintaining uniform configurations for all datasets?

E.g:

  validation_ds:
    batch_size: 16
    shuffle: false
    num_workers: 8
    pin_memory: true
    max_duration: 20.0
    min_duration: 0.1
    use_start_end_token: false
    sample_rate: ${model.sample_rate}
    ds_item:
    - name: en
      manifest_filepath:
      - ${d}/en/val_test/fisher/audio_manifest_dev_clean_en.json
    - name: es
      manifest_filepath:
      - ${d}/es/nemo_sp_asr_set_3pt0/dev/fisher/dev_fisher_manifest_es.json
      - ${d}/es/nemo_sp_asr_set_3pt0/dev/mcv12/dev_mcv12_manifest_es.json
      - ${d}/es/nemo_sp_asr_set_3pt0/dev/mls/dev_mls_manifest_es.json
      - ${d}/es/nemo_sp_asr_set_3pt0/dev/voxpopuli/dev_voxpopuli_manifest_es.json

enabling nesting in val / test datasets

b0216fe

bmwshop requested a review from titu1994 March 27, 2023 23:04

titu1994 reviewed Mar 27, 2023

View reviewed changes

Merge branch 'main' into nest_valtest_ds

8832f52

titu1994 approved these changes Apr 1, 2023

View reviewed changes

bmwshop merged commit 93f9a93 into main Apr 1, 2023

bmwshop deleted the nest_valtest_ds branch April 1, 2023 17:56

bmwshop mentioned this pull request Apr 3, 2023

docs on the use of heterogeneous test / val manifests #6352

Merged

7 tasks

hsiehjackson pushed a commit to hsiehjackson/NeMo that referenced this pull request Jun 2, 2023

enabling heterogeneous val / test datasets (NVIDIA#6306)

15f2d25

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

orena1 mentioned this pull request May 30, 2024

Issue Resuming Training from Checkpoint with Small Validation Dataset #9317

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enabling diverse datasets in val / test #6306

enabling diverse datasets in val / test #6306

bmwshop commented Mar 27, 2023 •

edited

Loading

titu1994 left a comment

bmwshop commented Mar 28, 2023 •

edited

Loading

titu1994 commented Mar 28, 2023

bmwshop commented Apr 1, 2023 •

edited

Loading

titu1994 left a comment

orena1 commented Feb 21, 2024

enabling diverse datasets in val / test #6306

enabling diverse datasets in val / test #6306

Conversation

bmwshop commented Mar 27, 2023 • edited Loading

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

titu1994 left a comment

Choose a reason for hiding this comment

bmwshop commented Mar 28, 2023 • edited Loading

titu1994 commented Mar 28, 2023

bmwshop commented Apr 1, 2023 • edited Loading

titu1994 left a comment

Choose a reason for hiding this comment

orena1 commented Feb 21, 2024

bmwshop commented Mar 27, 2023 •

edited

Loading

bmwshop commented Mar 28, 2023 •

edited

Loading

bmwshop commented Apr 1, 2023 •

edited

Loading