You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey, firstly - thank you very much for sharing your work, it really is interesting.
I have a few issues regarding the implication of the paper: "AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss":
In section 3.1 "Problem Formulation", it is explained (and showed in figure 1) that the output from the speaker encoder (input was target speaker utterance) is fed directly into the decoder (after the bottleneck).
In the code implementation on the other hand, it seems that the output from the speaker encoder is actually concatenated with the Mel_spectrogram and fed into the content encoder and not later after the bottleneck.
Again, in figure 1, it is shown that during train stage the "style" is used from the same speaker but in another file/section for comparison. Is that implemented in the code too? it didn't seem like it but I might be missing something.
In Table 1 (page 8), you present results for classification testing, for the output of the content encoder. Is there a way I can try to regenerate the same results? (can you share this part of the code too?)
Thanks very much,
Aaron
The text was updated successfully, but these errors were encountered:
Hey, firstly - thank you very much for sharing your work, it really is interesting.
I have a few issues regarding the implication of the paper: "AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss":
In section 3.1 "Problem Formulation", it is explained (and showed in figure 1) that the output from the speaker encoder (input was target speaker utterance) is fed directly into the decoder (after the bottleneck).
In the code implementation on the other hand, it seems that the output from the speaker encoder is actually concatenated with the Mel_spectrogram and fed into the content encoder and not later after the bottleneck.
Again, in figure 1, it is shown that during train stage the "style" is used from the same speaker but in another file/section for comparison. Is that implemented in the code too? it didn't seem like it but I might be missing something.
In Table 1 (page 8), you present results for classification testing, for the output of the content encoder. Is there a way I can try to regenerate the same results? (can you share this part of the code too?)
Thanks very much,
Aaron
The text was updated successfully, but these errors were encountered: