Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why not adopt bert like MaskGIT to reconstruct Tokens? #52

Open
tanbuzheng opened this issue Jan 27, 2024 · 4 comments
Open

Why not adopt bert like MaskGIT to reconstruct Tokens? #52

tanbuzheng opened this issue Jan 27, 2024 · 4 comments

Comments

@tanbuzheng
Copy link

tanbuzheng commented Jan 27, 2024

Dear author,
Thanks for sharing the code. I am greatly interested in your work. I have a question for you and would like your reply.
In the second stage, you adopt an encoder-decoder Transformer to reconstruct Tokens. Why not directly adopt the bidirectional Transformer in MaskGIT. Therefore, I want to know what are the advantages of the encoder-decoder Transformer.

Waiting for your reply!

@LTH14
Copy link
Owner

LTH14 commented Jan 27, 2024

In fact we started from MaskGIT's BERT architecture, but we find both linear probing and unconditional generation performance are poor (57.4% accuracy, 20.7 FID). Then we find that adopting the encoder-decoder architecture similar to MAE can largely improve the performance. My assumption is that such an encoder-decoder design is better for representation learning, and such a good representation can then also help with generation.

@tanbuzheng
Copy link
Author

Thanks for your reply! But I have another question. In the second stage, will better results be obtained if the masked images are adopted as input to reconstruct the tokens? Table 4 of your paper shows that scratches on pixels lead to better performance.

@LTH14
Copy link
Owner

LTH14 commented Jan 29, 2024

We must use image tokens as both input and output to enable image generation, because image generation takes multiple steps. In the middle of generation, only part of the tokens are generated which cannot decode to images. If we only consider representation learning, using masked images as input and tokens as output is similar to BeiT.

@tanbuzheng
Copy link
Author

I got it!Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants