-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why not adopt bert like MaskGIT to reconstruct Tokens? #52
Comments
In fact we started from MaskGIT's BERT architecture, but we find both linear probing and unconditional generation performance are poor (57.4% accuracy, 20.7 FID). Then we find that adopting the encoder-decoder architecture similar to MAE can largely improve the performance. My assumption is that such an encoder-decoder design is better for representation learning, and such a good representation can then also help with generation. |
Thanks for your reply! But I have another question. In the second stage, will better results be obtained if the masked images are adopted as input to reconstruct the tokens? Table 4 of your paper shows that scratches on pixels lead to better performance. |
We must use image tokens as both input and output to enable image generation, because image generation takes multiple steps. In the middle of generation, only part of the tokens are generated which cannot decode to images. If we only consider representation learning, using masked images as input and tokens as output is similar to BeiT. |
I got it!Thanks! |
Dear author,
Thanks for sharing the code. I am greatly interested in your work. I have a question for you and would like your reply.
In the second stage, you adopt an encoder-decoder Transformer to reconstruct Tokens. Why not directly adopt the bidirectional Transformer in MaskGIT. Therefore, I want to know what are the advantages of the encoder-decoder Transformer.
Waiting for your reply!
The text was updated successfully, but these errors were encountered: