Differences in Architecture Between Code and Paper #100

taubaaron · 2021-10-21T06:45:31Z

Hey, firstly - thank you very much for sharing your work, it really is interesting.

I have a few issues regarding the implication of the paper: "AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss":

In section 3.1 "Problem Formulation", it is explained (and showed in figure 1) that the output from the speaker encoder (input was target speaker utterance) is fed directly into the decoder (after the bottleneck).
In the code implementation on the other hand, it seems that the output from the speaker encoder is actually concatenated with the Mel_spectrogram and fed into the content encoder and not later after the bottleneck.
Again, in figure 1, it is shown that during train stage the "style" is used from the same speaker but in another file/section for comparison. Is that implemented in the code too? it didn't seem like it but I might be missing something.
In Table 1 (page 8), you present results for classification testing, for the output of the content encoder. Is there a way I can try to regenerate the same results? (can you share this part of the code too?)

Thanks very much,
Aaron

auspicious3000 · 2021-10-21T14:00:14Z

The speaker emb is also concatenated with the encoder output before feeding into the decoder.
yes, the speaker emb is extracted from the same speaker but most likely different utterances.
just train a classifier on the encoder output

Provide feedback