How to evaluate the performance of Voice Conversion systems #44637
Replies: 11 comments
-
In voice conversion (VC), it is highly desirable to obtain transformed speech signals that are perceptually close to a target speaker’s voice. To this end, a perceptually meaningful criterion where the human auditory system was taken into consideration in measuring the distances between the converted and the target voices was adopted in the proposed VC scheme. The conversion rules for the features associated with the spectral envelope and the pitch modification factor were jointly constructed so that perceptual distance measurement was minimized. This minimization problem was solved using a deep neural network (DNN) framework where input features and target features were derived from source speech signals and time-aligned version of target speech signals, respectively. The validation tests were carried out for the CMU ARCTIC database to evaluate the effectiveness of the proposed method, especially in terms of perceptual quality. The experimental results showed that the proposed method yielded perceptually preferred results compared with independent conversion using conventional mean-square error (MSE) criterion. The maximum improvement in perceptual evaluation of speech quality (PESQ) was 0.312, compared with the conventional VC method. |
Beta Was this translation helpful? Give feedback.
-
a new performance evaluation measure for assessing the capacity of voice conversion systems to modify the speech of one speaker (source) so that it sounds as if it was uttered by another speaker (target). This measure relies on a GMM-UBM-based likelihood estimator that estimates the degree of proximity between an utterance of the converted voice and the predefined models of the source and target voices. Existing objective evaluation metrics for voice conversion (VC) are not always correlated well with human perception. Therefore, training VC models with such criteria may not effectively improve naturalness and similarity of converted speech. we propose deep learning based assessment models to predict human ratings of converted speech. We adopt the convolutional and recurrent neural network models to build a mean opinion score (MOS) predictor, termed MOSNet. And also, in my project, due to the change of the power spectrum and the change of genes, it has reached its desired level, and due to random CNN, it has caused a change in the sound style. |
Beta Was this translation helpful? Give feedback.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
-
This measure relies on a GMM-UBM-based likelihood estimator that estimates the degree of proximity between an utterance of the converted voice and the predefined models of the source and target voices. Existing objective evaluation metrics for voice conversion (VC) are not always correlated well with human perception. Therefore, training VC models with such criteria may not effectively improve naturalness and similarity of converted speech. we propose deep learning based assessment models to predict human ratings of converted speech. We adopt the convolutional and recurrent neural network models to build a mean opinion score (MOS) predictor, termed MOSNet. And also, in my project, due to the change of the power spectrum and the change of genes, it has reached its desired level, and due to random CNN, it has caused a change in the sound style. |
Beta Was this translation helpful? Give feedback.
-
Emotional Voice Conversion, or emotional VC, is a technique of |
Beta Was this translation helpful? Give feedback.
-
i have used it |
Beta Was this translation helpful? Give feedback.
-
They introduced a new performance evaluation measure for assessing the capacity of voice conversion systems to modify the speech of one speaker (source) so that it sounds as if it was uttered by another speaker (target). This measure relies on a GMM-UBM-based likelihood estimator that estimates the degree of proximity between an utterance of the converted voice and the predefined models of the source and target voices. The proposed approach allows the formulation of an objective criterion, which is applicable for both evaluation of the virtue of a single system and for direct comparison (benchmarking) among different voice conversion systems. To illustrate the functionality and the practical usefulness of the proposed measure, we contrast it with four well-known objective evaluation criteria. |
Beta Was this translation helpful? Give feedback.
-
In the completed project, a new one-shot VC approach is proposed, which is able to perform VC with only one example utterance of the source and target speakers respectively, and the source and target speakers do not even need to be seen during training. In this algorithm, by separating speaker and content representations by sample normalization (IN). Objective and subjective evaluation shows that the proposed model is capable of producing a voice similar to the target speaker. In addition to the performance measurement, we have also shown that the model can learn the speaker's meaningful representations without any supervision. |
Beta Was this translation helpful? Give feedback.
-
We use recurrent convolutional neural network models to develop a mean opinion (MOS) predictor, called MOSNet. |
Beta Was this translation helpful? Give feedback.
-
🕒 Discussion Activity Reminder 🕒 This Discussion has been labeled as dormant by an automated system for having no activity in the last 60 days. Please consider one the following actions: 1️⃣ Close as Out of Date: If the topic is no longer relevant, close the Discussion as 2️⃣ Provide More Information: Share additional details or context — or let the community know if you've found a solution on your own. 3️⃣ Mark a Reply as Answer: If your question has been answered by a reply, mark the most helpful reply as the solution. Note: This dormant notification will only apply to Discussions with the Thank you for helping bring this Discussion to a resolution! 💬 |
Beta Was this translation helpful? Give feedback.
-
To evaluate the performance of Voice Conversion systems, different aspects like: speaker similarity, quality, and intelligibility could be evaluated. In your opinion, which objective or subjective tests necessarily correspond to human judgments? In your repository, what method have you used to evaluate the Voice conversion system, if any?
Beta Was this translation helpful? Give feedback.
All reactions