You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A popular text-based VC approach is to use an automatic speech recognition (ASR) model to extract phonetic posteriorgram (PPG) as content representation
Typical text-free approaches include information bottleneck , vector quantization, instance normalization, etc.
However, text-free generally lags behind text-based approaches . This can be attributed to the fact that the content information they extract is more easily to have source speaker information leaked in.
主要参考了 SpeechSplit 的音高提取过程(要看下这个模型的训练时间),需要有男女性别信息;AutoVC 的 Mel extractor and Vocoder
The proposed model outperforms all the baseline models in terms of speech naturalness, and has a comparable performance with VQMIVC in terms of speaker similarity.
FastSpeech VC duration predictor and the length regulator are removed from the original FastSpeech network, so that the input PPG sequence and the output LPCNet feature sequence have the same length.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
1 基本概念
1.1 根据是否对语音特征进行解耦(如 timbre/content/pitch/rhythm 等的解耦)
Feature Disentangle
Direct Transformation
1.2 根据单个 VC 系统可以支持的源说话者和目标说话者的数量
one-to-one
many-to-one
many-to-many
any-to-many
any-to-any (one-shot / free shot VC)
FreeVC 中提到 text-based VC and text-free VC
2 参考资料
3 Feature Disentangle
3.1 Text-Free Voice Conversion
VITS-VC
[27 Oct 2022] FreeVC
[2 Jun 2021] NVC-Net
[Interspeech 2022 18 Aug 2022] SRD-VC
[ICASSP 2022]SpeechSplit2
[PMLR 2020]SpeechSplit
[31 Mar 2022]DYGANVC
[29 Sep 2021]ClsVC
[Interspeech 2021]VQMIVC
cascaded ASR+TTS
[Interspeech 2020]SkipVQVC
(VQVC+)
[31 Oct 2020]AGAIN-VC
[27 Oct 2020]FragmentVC
[ICML 2019 14 May 2019] AutoVC
[Interspeech 2019 10 Apr 2019] AdaIN-VC
3.2 Text-based Voice Conversion 基于 PPG 的 VC
[12 Oct 2021]S3PRL-VC
[TASLP 2021]BNE-PPG-VC
[ICASSP 2021 3 Feb 2021]FastSpeech VC
Q:
1.PPG extractor 是在中文数据集训练的还是英文数据集,是音素级别还是字级别
2.是否使用了 Log-F0 特征,用什么工具提取的
3.提取 PPG 的声学参数(fs, hop_size, windows_size )是否和 TTS 的声学参数一致,是否做了长度映射(因为输出 mel 特征长度要和 PPG 长度一致)
A:
1.英文场景用的是英文的 asr,音素级别。其实用模型中间的隐状态会更好一些
2.pitch,好像没有取 log,应该取一下 log 的,用的是 pyworld
3.是一样的,asr 和 tts 的 hop_size 一样,所以帧数就一样。都是 10ms 还是12ms 来着。如果不一样做一个插值将长度对齐就行,fs,windows_size 关系都不大,主要和 hop size 的时间有关
[INTERSPEECH 2020 16 Oct 2020]
Tacotron2 VC
PaddleSpeech/paddlespeech/s2t/models/u2/u2.py
Line 737 in 96d76c8
PaddleSpeech/paddlespeech/s2t/models/u2/u2.py
Line 379 in 96d76c8
4 Direct Transformation
[Interspeech 2021Best Paper Award]StarGANv2-VC
5 音频 demo 听感结论
从 SRD-VC 的 demo 听效果:
SRDVC > ClsVC(更清晰,但是语调不符合 source) > VQMIVC > SkipVQVC(咬字不清) > AutoVC(沙哑)> AdaIN-VC(发声困难)
音色相似度上 VQMIVC 最像
5.1 One-Shot
从 AGAIN-VC 的 demo 听效果
不太自然
从 DYGANVC 的 demo 听效果
DYGAN-VC 效果不错 > cascaded ASR+TTS
从 AutoVC 的 demo 听效果
AutoVC > StarGAN VC
LIMIVC 和 FragmentVC 效果比较
LIMIVC() > FragmentVC(沙哑)
良杰的判断
SRDVC > LIMIVC > FragmentVC
SRD-VC、DYGANVC、StarGANv2-VC、VQMIVC、LIMIVC 择优复现
5.2 text-based
从 BNE-PPG-VC demo 听效果
any-to-many 的音色不太像,any-to-any 的音质比 any-to-many 要差
从 FastSpeech VC demo 听效果
同语言的听起来FastSpeech VC 的音色比 BNE-PPG-VC 更像一点
VITS-VC -> 是 GlowTTS 的能力
pr:
#2268
相关细节:
TODO
freevc 、fastspeech VC VQMIVC 择优选择
Beta Was this translation helpful? Give feedback.
All reactions