Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query regarding model capabilities #1

Open
vishaal27 opened this issue May 23, 2022 · 1 comment
Open

Query regarding model capabilities #1

vishaal27 opened this issue May 23, 2022 · 1 comment

Comments

@vishaal27
Copy link

Hey!

I just finished reading your paper -- amazing work and the results look awesome!

I had one query regarding your model capabilities -- As I understand, at inference time, you do take in the most similar images from the cached index and perform the attention over the image features as keys and values. I wanted to know if this model could be repurposed to do prompt-specific image captioning for a given image. For example, given an image of an elephant standing near a lake next to a tree, can the model be prompted with something like "Describe the background of the image" or "In the distance, we can see" to output a caption that solely describes the background of the image (lake and tree) rather than the foreground consisting of the elephant.
Since your model is trained auto-regressively, it seems to me that this should be feasible. Please let me know your thoughts!

@Victorwz
Copy link
Owner

Thanks for the great comments and ideas! We are currently working on adapting VaLM to vision-language tasks, especially image captioning and vqa. We would add more experimental results to the later version of VaLM. Thank you great again for your nice brainstorm with us!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants