You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just finished reading your paper -- amazing work and the results look awesome!
I had one query regarding your model capabilities -- As I understand, at inference time, you do take in the most similar images from the cached index and perform the attention over the image features as keys and values. I wanted to know if this model could be repurposed to do prompt-specific image captioning for a given image. For example, given an image of an elephant standing near a lake next to a tree, can the model be prompted with something like "Describe the background of the image" or "In the distance, we can see" to output a caption that solely describes the background of the image (lake and tree) rather than the foreground consisting of the elephant.
Since your model is trained auto-regressively, it seems to me that this should be feasible. Please let me know your thoughts!
The text was updated successfully, but these errors were encountered:
Thanks for the great comments and ideas! We are currently working on adapting VaLM to vision-language tasks, especially image captioning and vqa. We would add more experimental results to the later version of VaLM. Thank you great again for your nice brainstorm with us!
Hey!
I just finished reading your paper -- amazing work and the results look awesome!
I had one query regarding your model capabilities -- As I understand, at inference time, you do take in the most similar images from the cached index and perform the attention over the image features as keys and values. I wanted to know if this model could be repurposed to do prompt-specific image captioning for a given image. For example, given an image of an elephant standing near a lake next to a tree, can the model be prompted with something like "Describe the background of the image" or "In the distance, we can see" to output a caption that solely describes the background of the image (lake and tree) rather than the foreground consisting of the elephant.
Since your model is trained auto-regressively, it seems to me that this should be feasible. Please let me know your thoughts!
The text was updated successfully, but these errors were encountered: