Seems video token isn't used in the model during video inference #87

Virtualexistence · 2024-07-09T03:14:13Z

Great work! I just have a question/doubt on using the <video> token in the video inference.
The constants defined don't seem to recognise any video token but only image tokens:

VILA/llava/constants.py

Line 27 in 0085724

DEFAULT_IMAGE_TOKEN = "<image>"

Am I missing something or does the model take <video> as just text context while the image token is added automatically N times i.e. length of frames to the query?

VILA/llava/eval/run_vila.py

Line 66 in 0085724

if IMAGE_PLACEHOLDER in qs:

python -W ignore llava/eval/run_vila.py \ --model-path Efficient-Large-Model/VILA1.5-3b \ --conv-mode vicuna_v1 \ --query "<video>\n Please describe this video." \ --video-file "demo.mp4"

The text was updated successfully, but these errors were encountered:

Lyken17 · 2024-07-10T02:05:18Z

You are right, current open source version does not handle the

Virtualexistence · 2024-07-10T13:14:15Z

Aw, it's still a great repository. Thank you for reply!

VILA Benchmark using GPT-4 for evaluation

Virtualexistence changed the title ~~Seems video token isn't used in the model~~ Seems video token isn't used in the model during video inference Jul 9, 2024

Virtualexistence closed this as completed Jul 10, 2024

danigarciaoca mentioned this issue Dec 12, 2024

What is the conv_mode for VILA1.5-40b in video inference? #145

Open

gheinrich pushed a commit to gheinrich/VILA that referenced this issue Dec 16, 2024

Merge pull request NVlabs#87 from yukang2017/main

8dbee2c

VILA Benchmark using GPT-4 for evaluation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seems video token isn't used in the model during video inference #87

Seems video token isn't used in the model during video inference #87

Virtualexistence commented Jul 9, 2024 •

edited

Loading

Lyken17 commented Jul 10, 2024

Virtualexistence commented Jul 10, 2024

Seems video token isn't used in the model during video inference #87

Seems video token isn't used in the model during video inference #87

Comments

Virtualexistence commented Jul 9, 2024 • edited Loading

Lyken17 commented Jul 10, 2024

Virtualexistence commented Jul 10, 2024

Virtualexistence commented Jul 9, 2024 •

edited

Loading