Query regarding model capabilities #1

vishaal27 · 2022-05-23T13:14:44Z

Hey!

I just finished reading your paper -- amazing work and the results look awesome!

I had one query regarding your model capabilities -- As I understand, at inference time, you do take in the most similar images from the cached index and perform the attention over the image features as keys and values. I wanted to know if this model could be repurposed to do prompt-specific image captioning for a given image. For example, given an image of an elephant standing near a lake next to a tree, can the model be prompted with something like "Describe the background of the image" or "In the distance, we can see" to output a caption that solely describes the background of the image (lake and tree) rather than the foreground consisting of the elephant.
Since your model is trained auto-regressively, it seems to me that this should be feasible. Please let me know your thoughts!

Victorwz · 2022-05-24T00:47:06Z

Thanks for the great comments and ideas! We are currently working on adapting VaLM to vision-language tasks, especially image captioning and vqa. We would add more experimental results to the later version of VaLM. Thank you great again for your nice brainstorm with us!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query regarding model capabilities #1

Query regarding model capabilities #1

vishaal27 commented May 23, 2022

Victorwz commented May 24, 2022

Query regarding model capabilities #1

Query regarding model capabilities #1

Comments

vishaal27 commented May 23, 2022

Victorwz commented May 24, 2022