-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-Image or Multi-Video Inference Example #97
Comments
Hello, I think in the VILA code, images are embed specifically in the locations of the VILA/llava/model/llava_arch.py Lines 371 to 391 in 0085724
|
Thank you @DtYXs for the clarification about the |
* add egoschema full test set eval scripts * merge egoschema val set and test set evals into one file * minor updates * update slurm scripts
Hello, and thanks for such a great contribution to the field of interleaved LMMs! This is really great work. I was wondering if there was an example of the format for multiple image or multiple video inference (similar to what is shown in the in-context learning examples)? Does it involve appending multiple
<image>
tokens at the specified locations? And then are the images and videos inserted sequentially?From my understanding of the
run_vila.py
script, the way to have an ICL input for images (and the corresponding structure for videos, of course) would be as follows:However, I am not sure if the positions of the
<image>
tokens are considered by the model during generation because looking at thellava_llama.py
, the method for preparing the multimodal inputs is inherited from LLaVA, which I believe just concatenates the image features and does not embed them specifically in the locations of the<image>
tokens.I may have missed something as I am still new to the codebase and exploring the model more deeply. Would appreciate any clarification on the point about multi-image and multi-video inputs. Thanks!
Edit: After having looked more deeply, it seems to me at least that the way I have formatted the prompt (with '\n' included) aligns with your code. However, I see in your paper that the image tokens are enumerated:
Edit 2:
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:128001 for open-end generation.
I think the pad token is fine as it is automatically set to the eos_token. But what about the mask? I see no mention of that when I try to evaluate on datasets like SEEDBench. I do seem to get uncharacteristically low acc. on these benchmarks, and I am trying to find out why.
The text was updated successfully, but these errors were encountered: