-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enabling diverse datasets in val / test #6306
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very neat way of handling this !
Two things -
- why sample dev and test set ? It will give non determinating results for comparison. How about disabling sampling as a default config example? This should be added to the PRs example.
- this capability needs to be fully documented in the ASR dataset / config docs.
Preferably, thete should be 2-3 model multilingual config with this commented into the dev/test section.
Actually, concat should give deterministic results when shuffle is turned off and the seed is set! The reason to enable concat / sampling is to save time; in the overall multilingual example the validation dataset is over 400 hours long. However, I understand the point and will provide a simpler example. |
But is the subset of dataset replicable between two runs ? Ie seed is set to same per dataset and therefore the slice of samples is always exactly the same no matter if you call it the first run vs Nth / final chained run ? |
Yes, the result is the same across multiple validation runs if the config does not change - when However, I have removed the concat flags from this example for simplicity and because this PR does not touch the concat dataset code. It just deals with how validation and test dataset configs are processed.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, looks fine then
Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>
Thanks @bmwshop, is it feasible to only append the names and paths of the new datasets while maintaining uniform configurations for all datasets? E.g: validation_ds:
batch_size: 16
shuffle: false
num_workers: 8
pin_memory: true
max_duration: 20.0
min_duration: 0.1
use_start_end_token: false
sample_rate: ${model.sample_rate}
ds_item:
- name: en
manifest_filepath:
- ${d}/en/val_test/fisher/audio_manifest_dev_clean_en.json
- name: es
manifest_filepath:
- ${d}/es/nemo_sp_asr_set_3pt0/dev/fisher/dev_fisher_manifest_es.json
- ${d}/es/nemo_sp_asr_set_3pt0/dev/mcv12/dev_mcv12_manifest_es.json
- ${d}/es/nemo_sp_asr_set_3pt0/dev/mls/dev_mls_manifest_es.json
- ${d}/es/nemo_sp_asr_set_3pt0/dev/voxpopuli/dev_voxpopuli_manifest_es.json |
What does this PR do ?
Allows to specify multiple datasets with different configs in validation and test
ALL
Changelog
When processing val / test dataset configs, if a dict is detected instead of a list, overload the entire dataset config, not just the manifest path.
Usage
Specify multiple, different dataset configs as shown below.
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information