Skip to content

kimihiroh/promqa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding

This repository contains code and data for "ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding" (Hasegawa et al., arXiv 2024).

News

  • 2024/10/29: 401 QA pairs are now available.

Overview

ProMQA is an evaluation QA dataset for multimodal procedural activity understanding.

Overview

Given a recipe (text), a recording (video), and a question (text), the task is to predict an answer (text).

Formulation

Environment Setup

Virtual environment

conda create -y -n promqa python=3.11
conda activate promqa
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt

Download video data

  1. Follow CaptainCook4D/downloader to download recordings.
    • e.g., python download_gopro_data.py --resolution4K --output_dir <dirpath_original>
  2. Segment original recordings & sample frames
bash script/preprocess/video.sh \
    <dirpath_original> \
    <dirpath_preprocessed> \
    

Benchmarking

Prediction

  1. Set an API key for each, e.g., export OPENAI_API_KEY=<your_key>
  2. Run multimodal models from OpenAI, Google, and Anthropic:
bash script/benchmark/predict.sh \
    <dirpath_sampled_frames> \  # e.g., <dirpath_preprocessed>/sampled-frames/<resolution>/
    <model_id> \  # e.g., gpt-4o-2024-08-06
    <num_frames>  # e.g., 10

Evaluation

  1. Set an API key for each, e.g., export OPENAI_API_KEY=<your_key>
  2. Run LLM-as-a-judge:
bash script/benchmark/evaluate.sh \
    <filepath_prediction>  # e.g., gpt-4o-2024-08-06_50_examples.json

Data Annotation

Interface

ToDo

  • Add data annotation code (preprocess, QA generation, verification)
  • Add prediction code for other baselines (unimodal, socratic, open multimodal models)

Citation

If you find this work helpful in your research, please consider citing our work.

@article{hasegawa-etal-2024-promqa,
      title={ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding},
      author={Hasegawa, Kimihiro and Imrattanatrai, Wiradee and Cheng, Zhi-Qi and Asada, Masaki and Holm, Susan and Wang, Yuran and Fukuda, Ken and Mitamura, Teruko},
      publisher = {arXiv},
      year={2024},
      url={https://arxiv.org/abs/2410.22211},
}

Issues/Questions

For any issues, questions, or requests, please create a GitHub Issue.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published