-
Notifications
You must be signed in to change notification settings - Fork 626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add core code of valle #4
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave the comments as the next improvement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace the complex name with a easy-to-understand one
|
||
from utils.tokenizer import G2PModule, tokenize_text | ||
from utils.symbol_table import SymbolTable | ||
from text.g2p import preprocess_english, read_lexicon | ||
|
||
""" | ||
Extractor for content features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Provide the comments of extract_phoneme
related code
@@ -539,3 +541,49 @@ def extract_utt_content_features_dataloader(cfg, metadata, num_workers): | |||
) | |||
for index, utt in enumerate(_metadata): | |||
extractor.save_feature(utt, batch_content_features[index]) | |||
|
|||
if cfg.preprocess.extract_phoneme: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code will make SVC and TTS entangle unnecessarily. Move Line545-589 to a new function.
02bff49
to
d333659
Compare
print("args: ", args) | ||
|
||
parser = build_parser() | ||
VALLEInference.add_arguments(parser) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like VALLEInference is used no matter what type of the model.
if 'test' not in types: | ||
types.append('test') | ||
if "eval" in dataset: | ||
types = ["test"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repeating lines: 32 - 39
metadata = [] | ||
for dataset_type in types: | ||
dataset_output = os.path.join(output_path, dataset) | ||
# dataset_file = os.path.join(dataset_output, "{}.json".format(dataset_type)) | ||
dataset_file = os.path.join(dataset_output, "{}.json".format(dataset_type)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicating line 78?
@@ -77,12 +93,13 @@ def main(): | |||
new_datasets_list.extend(filter(None, new_datasets)) | |||
cfg.dataset.extend(new_datasets_list) | |||
|
|||
# CUDA settings | |||
# # CUDA settings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to add one more '#'
We should provide demos/samples in a PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move all the configs about TTS into a TTS's config base json
@@ -211,6 +212,11 @@ def __extract_utt_acoustic_features(dataset_output, cfg, utt): | |||
label = audio_to_label(wav, cfg.preprocess.bits) | |||
save_feature(dataset_output, cfg.preprocess.label_dir, uid, label) | |||
|
|||
if cfg.preprocess.extract_acoustic_token: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not modify __extract_utt_acoustic_features
anymore. It is not a common extraction pipeline now. See extract_utt_acoustic_features_tts
(line221) and extract_utt_acoustic_features_vocoder
(line233) as reference. Please move all the functions of TTS's acoustic feature extraction into line221.
bins/tts/inference.py
Outdated
@@ -75,9 +73,9 @@ def build_parser(): | |||
) | |||
parser.add_argument( | |||
"--text", | |||
help="Text to be synthesized", | |||
help="Text", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Text to be synthesized' is more informative than 'Text'
Vall-E is a zero-shot TTS architecture that uses a neural codec language model with discrete codes. This PR is to support Vall-E in Amphion.