Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seems video token isn't used in the model during video inference #87

Closed
Virtualexistence opened this issue Jul 9, 2024 · 2 comments
Closed

Comments

@Virtualexistence
Copy link

Virtualexistence commented Jul 9, 2024

Great work! I just have a question/doubt on using the <video> token in the video inference.
The constants defined don't seem to recognise any video token but only image tokens:

DEFAULT_IMAGE_TOKEN = "<image>"

Am I missing something or does the model take <video> as just text context while the image token is added automatically N times i.e. length of frames to the query?

if IMAGE_PLACEHOLDER in qs:

python -W ignore llava/eval/run_vila.py \ --model-path Efficient-Large-Model/VILA1.5-3b \ --conv-mode vicuna_v1 \ --query "<video>\n Please describe this video." \ --video-file "demo.mp4"

@Virtualexistence Virtualexistence changed the title Seems video token isn't used in the model Seems video token isn't used in the model during video inference Jul 9, 2024
@Lyken17
Copy link
Collaborator

Lyken17 commented Jul 10, 2024

You are right, current open source version does not handle the

@Virtualexistence
Copy link
Author

Aw, it's still a great repository. Thank you for reply!

gheinrich pushed a commit to gheinrich/VILA that referenced this issue Dec 16, 2024
VILA Benchmark using GPT-4 for evaluation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants