-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tar codec #7867
Tar codec #7867
Conversation
@staticmethod | ||
def get_dataset(cfg): | ||
if 'is_sharded' in cfg.dataset: | ||
is_sharded = cfg.dataset.is_sharded | ||
del cfg.dataset.is_sharded | ||
else: | ||
is_sharded = False | ||
|
||
if is_sharded: | ||
cfg.dataset._target_ = 'nemo.collections.tts.data.vocoder_dataset.TarredVocoderDataset' | ||
dataset = instantiate(cfg.dataset) | ||
sampler = None | ||
else: | ||
dataset = instantiate(cfg.dataset) | ||
sampler = dataset.get_sampler(cfg.dataloader_params.batch_size) | ||
return dataset, sampler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the same outcome as defining this in TarredVocoderDataset
?:
get_sampler():
return None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point i will create a simpler case here, but I created this function to set the target so as to use existing config files and provide toggle option for sharding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that these are experimental recipes that are undergoing breaking changes and have no public models, I think it should be OK to require config updates. The tarred dataset also requires different inputs to instantiate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am against creating one new config for each feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean you are against creating separate config files for new features in general (eg. sampling rate, new architectures)? Or specifically for needing separate config for tarred vs non-tarred datasets?
The former of creating lots of different overlapping configs is expected across NeMo. The latter can be addressed in several different ways, if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don;t want to create multiple configs with same model architecture but different setting (like tarred/sampling_rate etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the same model architecture with different configs are main options I have heard before are:
- Create different example configs.
- Have one example config and provide web documentation and tutorials clearly indicating to users how to manually change fields for different valid configs (eg. sample rate, language).
From past discussions around this, generally (1) was preferable, especially for changes like sample rate which are relatively complicated and error prone if you attempt to update all needed parameters manually. A combination of (1) and (2) is needed if you have too many configs to support, such as when having to support multiple sample rates and languages at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case, there are 2 problems with the existing approach.
- The model class takes the data loader class/config as input, however the constructor requires passing information
global_rank
andworld_size
which are known only to the model class and trainer. - Different data loaders require significantly different inputs to instantiate (and this will diverge more as we add additional sampling strategies).
Perhaps the cleanest way to address these would be to change it so the model is expected to receive a subset of dataset parameters, and then add a module which maps the parameters to the logic to create each dataset.
For example:
In .yaml
train_ds:
dataset_type: "vocoder_tarred"
dataset_args:
batch_size: ${batch_size}
...
In vocoder_dataset.py
def create_dataset(dataset_type, dataset_args, global_rank, world_size, ...):
if dataset_type == "vocoder":
return VocoderDataset(*dataset_args)
elif dataset_type == "vocoder_tarred":
return TarrredVocoderDataset(*dataset_args, global_rank=global_rank, world_size=world_size)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, missed it. Updated to use those as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for logging, testing, validation we rely only on non tarred datasets.
@staticmethod | ||
def get_dataset(cfg): | ||
if 'is_sharded' in cfg.dataset: | ||
is_sharded = cfg.dataset.is_sharded | ||
del cfg.dataset.is_sharded | ||
else: | ||
is_sharded = False | ||
|
||
if is_sharded: | ||
cfg.dataset._target_ = 'nemo.collections.tts.data.vocoder_dataset.TarredVocoderDataset' | ||
dataset = instantiate(cfg.dataset) | ||
sampler = None | ||
else: | ||
dataset = instantiate(cfg.dataset) | ||
sampler = dataset.get_sampler(cfg.dataloader_params.batch_size) | ||
return dataset, sampler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that these are experimental recipes that are undergoing breaking changes and have no public models, I think it should be OK to require config updates. The tarred dataset also requires different inputs to instantiate it.
self.world_size = 1 | ||
if trainer is not None: | ||
self.world_size = trainer.num_nodes * trainer.num_devices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we need to figure out how to make the model class agnostic to the data loader. Especially given that the logic will need to be kept consistent across all vocoder recipes (audio codec, hifigan, and univnet at least).
If we think upgrading webdataset version is a blocker to streamline the code, then we can document that with some TODOs and leave these workarounds. Otherwise, we should redesign the contract between the dataset and the model classes so they work with both.
self._dataset.rename(audio=VALID_FILE_FORMATS, key='__key__') | ||
.to_tuple('audio', 'key') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: it is not very important for audio only dataset, but code would be more readable and extensible using original dictionary format instead of converting to tuple.
jenkins |
jenkins |
1 similar comment
jenkins |
@staticmethod | ||
def get_dataset(cfg): | ||
if 'is_sharded' in cfg.dataset: | ||
is_sharded = cfg.dataset.is_sharded | ||
del cfg.dataset.is_sharded | ||
else: | ||
is_sharded = False | ||
|
||
if is_sharded: | ||
cfg.dataset._target_ = 'nemo.collections.tts.data.vocoder_dataset.TarredVocoderDataset' | ||
dataset = instantiate(cfg.dataset) | ||
sampler = None | ||
else: | ||
dataset = instantiate(cfg.dataset) | ||
sampler = dataset.get_sampler(cfg.dataloader_params.batch_size) | ||
return dataset, sampler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the same model architecture with different configs are main options I have heard before are:
- Create different example configs.
- Have one example config and provide web documentation and tutorials clearly indicating to users how to manually change fields for different valid configs (eg. sample rate, language).
From past discussions around this, generally (1) was preferable, especially for changes like sample rate which are relatively complicated and error prone if you attempt to update all needed parameters manually. A combination of (1) and (2) is needed if you have too many configs to support, such as when having to support multiple sample rates and languages at the same time.
@staticmethod | ||
def get_dataset(cfg): | ||
if 'is_sharded' in cfg.dataset: | ||
is_sharded = cfg.dataset.is_sharded | ||
del cfg.dataset.is_sharded | ||
else: | ||
is_sharded = False | ||
|
||
if is_sharded: | ||
cfg.dataset._target_ = 'nemo.collections.tts.data.vocoder_dataset.TarredVocoderDataset' | ||
dataset = instantiate(cfg.dataset) | ||
sampler = None | ||
else: | ||
dataset = instantiate(cfg.dataset) | ||
sampler = dataset.get_sampler(cfg.dataloader_params.batch_size) | ||
return dataset, sampler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case, there are 2 problems with the existing approach.
- The model class takes the data loader class/config as input, however the constructor requires passing information
global_rank
andworld_size
which are known only to the model class and trainer. - Different data loaders require significantly different inputs to instantiate (and this will diverge more as we add additional sampling strategies).
Perhaps the cleanest way to address these would be to change it so the model is expected to receive a subset of dataset parameters, and then add a module which maps the parameters to the logic to create each dataset.
For example:
In .yaml
train_ds:
dataset_type: "vocoder_tarred"
dataset_args:
batch_size: ${batch_size}
...
In vocoder_dataset.py
def create_dataset(dataset_type, dataset_args, global_rank, world_size, ...):
if dataset_type == "vocoder":
return VocoderDataset(*dataset_args)
elif dataset_type == "vocoder_tarred":
return TarrredVocoderDataset(*dataset_args, global_rank=global_rank, world_size=world_size)
...
a7a4dab
to
9c52b7a
Compare
jenkins |
The benefit of replication is that it allows each node to sample data points from the entire | ||
dataset independently of other nodes, and reduces dependence on value of `shuffle_n`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am not mistaken, shuffle_n
is the size of the shuffle buffer as the gpu reads in data from its shards. Allocating more shards to the gpu would not really reduce its dependence on shuffle_n
. Intuitively, I would expect it to become more important (larger dataset requires more shuffling).
Somewhat related, wd. WebDataset()
has a shardshuffle=True
parameter that ensures that the shard list itself is fully shuffled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From webdataset Docs:
https://webdataset.github.io/webdataset/gettingstarted/
shuffle(n): shuffle the dataset with a buffer of size n; also shuffles shards (see below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the thread linked below, it sounds like the comment is effectively saying "larger datasets are less sensitive to how they are shuffled".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to have both buffer and inital arguments to have more control over shuffling
self._dataset = wd.WebDataset(audio_tar_filepaths, nodesplitter=None) | ||
|
||
if shuffle_n > 0: | ||
self._dataset = self._dataset.shuffle(shuffle_n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do something like dataset.shuffle(size=shuffle_n, initial=shuffle_n)
? The initial buffer size of 100 is really small, which I saw slow convergence a lot on my small test datasets (but might be irrelevant on very large datasets).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the set is small then yes number of samples per worker would become small and shuffle
might not have an effect and wouln;t be a problem for larger sets as number of samples per worker would be large. As per this discussion, it looks like if we want to get a true shuffle of over all sets, its better to unbatch() on loader and shuffle with a larger number.
sample_rate: int, | ||
sample_rate: int = 24000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: other classes in the TTS collection either provide no default sample_rate or defaults to 22050.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default for our audio codec models is 24000. Did we have a 22050 Hz model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but this is a data loader for all TTS vocoders. We have never used 24khz in TTS before (and when I brought up the question before of whether we should do so to synchronize with SpeechLM models, the concensus was to stick to 22.05khz). So I suppose I would favor not having a default sample rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
9c52b7a
to
7303148
Compare
jenkins |
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
7303148
to
cc9380a
Compare
jenkins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks a lot for the feature and changes.
Signed-off-by: Chen Cui <chcui@nvidia.com> support packed dataset Signed-off-by: Chen Cui <chcui@nvidia.com> [Codec] Finite scalar quantizer (NVIDIA#7886) * Finite scalar quantizer Signed-off-by: Ante Jukić <ajukic@nvidia.com> * Updated test Signed-off-by: Ante Jukić <ajukic@nvidia.com> --------- Signed-off-by: Ante Jukić <ajukic@nvidia.com> upgrade to latest mcore and TE (NVIDIA#7908) * reimport module Signed-off-by: dimapihtar <dpihtar@gmail.com> * update mcore and TE commits Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Tar codec (NVIDIA#7867) added missing torch import (NVIDIA#7913) Signed-off-by: David Mosallanezhad <dmosallanezh@nvidia.com> add cpu init check (NVIDIA#7889) Signed-off-by: Chen Cui <chcui@nvidia.com> Fix pinned triton version (NVIDIA#7925) * Fix pinned triton version Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * Remove comment Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change README Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * Remove flash-attn in Dockerfile Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * Revert Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> --------- Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> fix tp_overlap config var name (NVIDIA#7928) Signed-off-by: Xiaowei Ren <xren@nvidia.com> add Dutch P&C FC model info (NVIDIA#7892) * add Dutch P&C FC model info Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * update order of the results Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> --------- Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> fix issues with convert_nemo_llama_to_hf.py (NVIDIA#7922) [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci fix collate_fn bug for TP > 1 Signed-off-by: Chen Cui <chcui@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci make packed dataset work Signed-off-by: Chen Cui <chcui@nvidia.com> fix nan bug Signed-off-by: Chen Cui <chcui@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci support answer only loss Signed-off-by: Chen Cui <chcui@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci account for padding in cu_seqlens during dataloading for attn kernel Signed-off-by: Chen Cui <chcui@nvidia.com> fix path for answer_only_loss = false Signed-off-by: Chen Cui <chcui@nvidia.com> Modify GPTSFTPackedDataset to respond to pad_to_max_length setting Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com> support packed dataset Signed-off-by: Chen Cui <chcui@nvidia.com> [Codec] Finite scalar quantizer (NVIDIA#7886) * Finite scalar quantizer Signed-off-by: Ante Jukić <ajukic@nvidia.com> * Updated test Signed-off-by: Ante Jukić <ajukic@nvidia.com> --------- Signed-off-by: Ante Jukić <ajukic@nvidia.com> upgrade to latest mcore and TE (NVIDIA#7908) * reimport module Signed-off-by: dimapihtar <dpihtar@gmail.com> * update mcore and TE commits Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Tar codec (NVIDIA#7867) added missing torch import (NVIDIA#7913) Signed-off-by: David Mosallanezhad <dmosallanezh@nvidia.com> add cpu init check (NVIDIA#7889) Signed-off-by: Chen Cui <chcui@nvidia.com> Fix pinned triton version (NVIDIA#7925) * Fix pinned triton version Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * Remove comment Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change README Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * Remove flash-attn in Dockerfile Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * Revert Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> --------- Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> fix tp_overlap config var name (NVIDIA#7928) Signed-off-by: Xiaowei Ren <xren@nvidia.com> add Dutch P&C FC model info (NVIDIA#7892) * add Dutch P&C FC model info Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * update order of the results Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> --------- Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> fix issues with convert_nemo_llama_to_hf.py (NVIDIA#7922) [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci fix collate_fn bug for TP > 1 Signed-off-by: Chen Cui <chcui@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci make packed dataset work Signed-off-by: Chen Cui <chcui@nvidia.com> fix nan bug Signed-off-by: Chen Cui <chcui@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci support answer only loss Signed-off-by: Chen Cui <chcui@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci account for padding in cu_seqlens during dataloading for attn kernel Signed-off-by: Chen Cui <chcui@nvidia.com> fix path for answer_only_loss = false Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Add webdataset support for VocoderDataset
Collection: TTS
Changelog
Known issue:
Usage
is_sharded=True
in cfg for train_ds to load webdatasetBefore your PR is Ready for review
Pre checks:
PR Type: