Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not reproduce the results on MSVD-QA and TGIF-QA #197

Open
Jingchensun opened this issue Nov 7, 2024 · 0 comments
Open

Can not reproduce the results on MSVD-QA and TGIF-QA #197

Jingchensun opened this issue Nov 7, 2024 · 0 comments

Comments

@Jingchensun
Copy link

Jingchensun commented Nov 7, 2024


First, thank you for the amazing work.

I am using the checkpoint LanguageBind/Video-LLaVA-7B and have set do_sample=False and temperature=0.0 in run_inference_video_qa.py. For inference on the MSVD-QA dataset, I used 4 * A6000 GPUs, and the process took about one hour. However, when I evaluated the prediction results using GPT-3.5 (default setting), I only achieved an accuracy of 36.27% and a score of 2.87, which is significantly lower than the results reported in the paper. Similarly, on the TGIF-QA dataset, I obtained an accuracy of only 19.6% and a score of 2.4. For other evaluation tasks, such as VQA (e.g., VQAv2, GQA), my results perfectly matched those reported in the paper.

Could the authors provide feedback on the evaluation for the video-QA task? Is there an alternative evaluation method to using GPT?

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=[video_tensor],
        do_sample=False,
        temperature=0.0,
        max_new_tokens=1024,
        use_cache=True,
        stopping_criteria=[stopping_criteria]
    )
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant