Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hackathon 7th] 修复 s2t 示例错误 #3950

Merged
merged 2 commits into from
Dec 18, 2024
Merged

Conversation

megemini
Copy link
Contributor

PR types

Bug fixes

PR changes

Others

Describe

修复 s2t 示例错误:

  • paddlespeech/s2t/io/dataloader.py 中,如果走 train 分支,那么 config 会缺少很多配置项
  • paddlespeech/s2t/models/u2_st/u2_st.py 中,使用的是 TransformerDecoder,这个类中 forward3 个返回值,因此使用 *_ 屏蔽掉第一个之后的返回值。之所以不使用 decoder_out, _, _ = self.decode... 的方式,是因为,TransformerDecoderforward 原来的输出可能是 2 个(之前的 typing 只有两个返回值,这里同时修改为三个),因此,用 *_ 做兼容性处理。
  • paddlespeech/s2t/frontend/featurizer/text_featurizer.py 的输入可能是嵌套的 list,因此这里也做了判断处理。
  • 修复了测试过程中其他问题

目前测试暂时没啥问题,日志:

aistudio@jupyter-942478-8657745:~/PaddleSpeech/examples/ted_en_zh/st0$ bash run.sh --stage 0 --stop_stage 0
checkpoint name transformer_mtl_noam
Creating manifest data/manifest ...
train Processed: 1000
train Processed: 2000
train Processed: 3000
train Processed: 4000
manifest prepare done!
Complete raw data pre-process.
/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:686: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
  warnings.warn(warning_message)
----------- compute_mean_std.py Configuration Arguments -----------
delta_delta: 0
feat_dim: 80
manifest_path: data/manifest.train.raw
num_samples: -1
num_workers: 24
output_path: data/mean_std.json
sample_rate: 16000
spectrum_type: fbank
stride_ms: 10
target_dB: -20
use_dB_normalization: 0
window_ms: 25
-----------------------------------------------------------
2024-12-12 21:34:00.167 | INFO     | paddlespeech.s2t.frontend.augmentor.augmentation:__init__:122 - Augmentation: []
/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:686: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
  warnings.warn(warning_message)
----------- build_vocab.py Configuration Arguments -----------
count_threshold: 0
manifest_paths: ['data/manifest.train.raw']
spm_character_coverage: 1.0
spm_mode: unigram
spm_model_prefix: data/lang_char/bpe_unigram_8000
spm_vocab_size: 8000
text_keys: ['text']
unit_type: spm
vocab_path: data/lang_char/vocab.txt
-----------------------------------------------------------
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: /tmp/tmpliv9c922
  input_format: 
  model_prefix: data/lang_char/bpe_unigram_8000
  model_type: UNIGRAM
  vocab_size: 8000
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 100000000
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy: 0
  differential_privacy_noise_level: 0
  differential_privacy_clipping_threshold: 0
}
normalizer_spec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv: 
}
denormalizer_spec {}
trainer_interface.cc(353) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(185) LOG(INFO) Loading corpus: /tmp/tmpliv9c922
trainer_interface.cc(409) LOG(INFO) Loaded all 9996 sentences
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <s>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(430) LOG(INFO) Normalizing sentences...
trainer_interface.cc(539) LOG(INFO) all chars count=722011
trainer_interface.cc(560) LOG(INFO) Alphabet size=2614
trainer_interface.cc(561) LOG(INFO) Final character coverage=1
trainer_interface.cc(592) LOG(INFO) Done! preprocessed 9996 sentences.
unigram_model_trainer.cc(265) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(269) LOG(INFO) Extracting frequent sub strings... node_num=330954
unigram_model_trainer.cc(312) LOG(INFO) Initialized 28324 seed sentencepieces
trainer_interface.cc(598) LOG(INFO) Tokenizing input sentences with whitespace: 9996
trainer_interface.cc(609) LOG(INFO) Done! 18607
unigram_model_trainer.cc(602) LOG(INFO) Using 18607 sentences for EM training
unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=0 size=13035 obj=11.1349 num_tokens=35546 num_tokens/piece=2.72697
unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=1 size=11469 obj=9.40258 num_tokens=35695 num_tokens/piece=3.1123
unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=0 size=8799 obj=9.68791 num_tokens=39376 num_tokens/piece=4.47505
unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=1 size=8797 obj=9.63443 num_tokens=39452 num_tokens/piece=4.48471
trainer_interface.cc(687) LOG(INFO) Saving model: data/lang_char/bpe_unigram_8000.model
trainer_interface.cc(699) LOG(INFO) Saving vocabs: data/lang_char/bpe_unigram_8000.vocab
2024-12-12 21:35:45.976 | WARNING  | paddlespeech.s2t.frontend.featurizer.text_featurizer:__init__:58 - TextFeaturizer: not have vocab file or vocab list. Only Tokenizer can use, can not convert to token idx
/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:686: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
  warnings.warn(warning_message)
/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:686: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
  warnings.warn(warning_message)
/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:686: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
  warnings.warn(warning_message)
----------- format_data.py Configuration Arguments -----------
cmvn_path: data/mean_std.json
manifest_paths: ['data/manifest.train.raw']
output_path: data/manifest.train
spm_model_prefix: data/lang_char/bpe_unigram_8000
unit_type: spm
vocab_path: data/lang_char/vocab.txt
-----------------------------------------------------------
Feature dim: 80
Vocab size: 7953
----------- format_data.py Configuration Arguments -----------
cmvn_path: data/mean_std.json
manifest_paths: ['data/manifest.test.raw']
output_path: data/manifest.test
spm_model_prefix: data/lang_char/bpe_unigram_8000
unit_type: spm
vocab_path: data/lang_char/vocab.txt
-----------------------------------------------------------
Feature dim: 80
Vocab size: 7953
['data/manifest.test.raw'] Examples number: 0
----------- format_data.py Configuration Arguments -----------
cmvn_path: data/mean_std.json
manifest_paths: ['data/manifest.dev.raw']
output_path: data/manifest.dev
spm_model_prefix: data/lang_char/bpe_unigram_8000
unit_type: spm
vocab_path: data/lang_char/vocab.txt
-----------------------------------------------------------
Feature dim: 80
Vocab size: 7953
['data/manifest.dev.raw'] Examples number: 0
['data/manifest.train.raw'] Examples number: 4998
Ted En-Zh Data preparation done.






aistudio@jupyter-942478-8657745:~/PaddleSpeech/examples/ted_en_zh/st0$ CUDA_VISIBLE_DEVICES=0 ./local/train.sh conf/transformer_mtl_noam.yaml transformer_mtl_noam
...
2024-12-12 22:11:13.705 | INFO     | paddlespeech.s2t.exps.u2_st.model:valid:163 - Valid: Rank: 0, epoch: 1, step: 590, batch: 300/313, val_loss: 175.348162, val_att_loss: 151.819175, val_ctc_loss: 417.401444, val_history_st_loss: 175.311639
2024-12-12 22:11:15.551 | INFO     | paddlespeech.s2t.exps.u2_st.model:valid:165 - Rank 0 Val info st_val_loss 170.31878152974346
2024-12-12 22:11:15.553 | INFO     | paddlespeech.s2t.training.timer:__exit__:44 - Eval Time Cost: 0:02:21.742158
2024-12-12 22:11:15.553 | INFO     | paddlespeech.s2t.exps.u2_st.model:do_train:234 - Epoch 1 Val info val_loss 170.31878152974346
2024-12-12 22:11:16.311 | INFO     | paddlespeech.s2t.utils.checkpoint:_save_parameters:286 - Saved model to exp/transformer_mtl_noam/checkpoints/1.pdparams
2024-12-12 22:11:17.600 | INFO     | paddlespeech.s2t.utils.checkpoint:_save_parameters:292 - Saved optimzier state to exp/transformer_mtl_noam/checkpoints/1.pdopt
2024-12-12 22:11:19.459 | INFO     | paddlespeech.s2t.utils.checkpoint:_save_parameters:286 - Saved model to exp/transformer_mtl_noam/checkpoints/1.pdparams
2024-12-12 22:11:22.560 | INFO     | paddlespeech.s2t.utils.checkpoint:_save_parameters:292 - Saved optimzier state to exp/transformer_mtl_noam/checkpoints/1.pdopt
2024-12-12 22:11:22.563 | INFO     | paddlespeech.s2t.training.timer:__exit__:44 - Training Done: 0:10:24.706460
LAUNCH INFO 2024-12-12 22:11:25,656 Pod completed
LAUNCH INFO 2024-12-12 22:11:25,656 Exit code 0


aistudio@jupyter-942478-8657745:~/PaddleSpeech/examples/ted_en_zh/st0$ avg.sh best exp/transformer_mtl_noam/checkpoints 2
/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:686: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
  warnings.warn(warning_message)
Namespace(dst_model='exp/transformer_mtl_noam/checkpoints/avg_2.pdparams', ckpt_dir='exp/transformer_mtl_noam/checkpoints', val_best=True, num=2, min_epoch=0, max_epoch=65536)
selected val scores = [170.31878153 191.35575619]
selected epochs = [1 0]
averaged val score = 180.8372688606325
['exp/transformer_mtl_noam/checkpoints/1.pdparams', 'exp/transformer_mtl_noam/checkpoints/0.pdparams']
Processing exp/transformer_mtl_noam/checkpoints/1.pdparams
Processing exp/transformer_mtl_noam/checkpoints/0.pdparams
Saving to exp/transformer_mtl_noam/checkpoints/avg_2.pdparams


aistudio@jupyter-942478-8657745:~/PaddleSpeech/examples/ted_en_zh/st0$ CUDA_VISIBLE_DEVICES=0 ./local/test.sh conf/transformer_mtl_noam.yaml conf/tuning/decode.yaml exp/transformer_mtl_noam/checkpoints/avg_2
...
2024-12-12 22:38:38.203 | INFO     | paddlespeech.s2t.exps.u2_st.model:compute_translation_metrics:400 - Hyp: 
2024-12-12 22:38:38.204 | INFO     | paddlespeech.s2t.exps.u2_st.model:compute_translation_metrics:401 - One example BLEU = 0.0/0.0/0.0/0.0
2024-12-12 22:38:38.207 | INFO     | paddlespeech.s2t.exps.u2_st.model:test:441 - RTF: 0.000048, instance (78), batch BELU   = 0.000000
2024-12-12 22:38:39.578 | INFO     | paddlespeech.s2t.exps.u2_st.model:compute_translation_metrics:398 - Utt: 127247_0517890-0539637
2024-12-12 22:38:39.579 | INFO     | paddlespeech.s2t.exps.u2_st.model:compute_translation_metrics:399 - Ref: 科学 只能 暂时 改变 我们 自动 生成 的 假设 但是 我们 知道 如果 让 你 拿出 一张 照片 , 上面 是 一个 你 知道 的 、 可恶 的 白人 然后 你 把 这张 照片 贴 到 一个 有色人种 旁边 贴 到 一位 出色 的 黑人 旁边 有时候 这样 做 , 也 可以 帮助 我们 解除 脑内 自动 生成 的 联系
2024-12-12 22:38:39.579 | INFO     | paddlespeech.s2t.exps.u2_st.model:compute_translation_metrics:400 - Hyp: 
2024-12-12 22:38:39.580 | INFO     | paddlespeech.s2t.exps.u2_st.model:compute_translation_metrics:401 - One example BLEU = 0.0/0.0/0.0/0.0
2024-12-12 22:38:39.583 | INFO     | paddlespeech.s2t.exps.u2_st.model:test:441 - RTF: 0.000048, instance (79), batch BELU   = 0.000000
^C2024-12-12 22:38:40.270 | INFO     | paddlespeech.s2t.training.timer:__exit__:44 - Test/Decode Done: 0:01:44.328190

@zxcd @Liyulingyue @GreatV @enkilee @yinfan98

Copy link

paddle-bot bot commented Dec 12, 2024

Thanks for your contribution!

@@ -404,6 +404,12 @@ def get_dataloader(mode: str, config, args):
config['subsampling_factor'] = 1
config['num_encs'] = 1
config['shortest_first'] = False
config['minibatches'] = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load the params from config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

细说???
这里的 config 本来就是 clone 过来的,应该是本来就木有这几个值,还需要从哪里 load?最好是有默认值 ~

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

ys_in_lens: paddle.Tensor,
r_ys_in_pad: paddle.Tensor=paddle.empty([0]),
reverse_weight: float=0.0) -> Tuple[paddle.Tensor, paddle.Tensor]:
def forward(self,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only code style changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typing hint for the output changed from

-> Tuple[paddle.Tensor, paddle.Tensor]:

to

-> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

@megemini megemini requested a review from zxcd December 16, 2024 13:46
Copy link
Collaborator

@zxcd zxcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zxcd zxcd merged commit b4c2f3b into PaddlePaddle:develop Dec 18, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants