Skip to content

不同训练集的跨语种能力(Cross‐Language Ability of Different Training Sets)

RVC-Boss edited this page Oct 23, 2024 · 1 revision


I. Cross-Language Definition


Reference audio, reference text = Language A; text to be synthesized = Language B, where A != B.




Training set = Language A; text to be synthesized = Language B, where A != B.


Currently, we do not consider the selection of reference audio with voices outside the training set, so 1 and 2 are considered equivalent for now.


The base model supports five languages, so can the model cross languages when it's fine-tuned?


If the fine-tuning training set is relatively small (e.g., 1-30 minutes): Training with any Language A allows inference across the texts of all five languages because the base model possesses cross-language capabilities.


If the fine-tuning training set is relatively large, then the cross-language capabilities of the base model may be overwritten by the fine-tuning set. For example:


If the training set includes languages A, B, and C, then the voice (and its corresponding reference audio) can cross languages among A, B, and C, but the cross-language ability outside of these (D and E) is lost.


If the training set includes only Language A, then the model will not have cross-language capabilities.