This repository contains code and data for "ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding" (Hasegawa et al., arXiv 2024).
- 2024/10/29: 401 QA pairs are now available.
ProMQA is an evaluation QA dataset for multimodal procedural activity understanding.
Given a recipe (text), a recording (video), and a question (text), the task is to predict an answer (text).
conda create -y -n promqa python=3.11
conda activate promqa
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt
- Follow CaptainCook4D/downloader to download recordings.
- e.g.,
python download_gopro_data.py --resolution4K --output_dir <dirpath_original>
- e.g.,
- Segment original recordings & sample frames
bash script/preprocess/video.sh \
<dirpath_original> \
<dirpath_preprocessed> \
- Set an API key for each, e.g.,
export OPENAI_API_KEY=<your_key>
- Run multimodal models from OpenAI, Google, and Anthropic:
bash script/benchmark/predict.sh \
<dirpath_sampled_frames> \ # e.g., <dirpath_preprocessed>/sampled-frames/<resolution>/
<model_id> \ # e.g., gpt-4o-2024-08-06
<num_frames> # e.g., 10
- Set an API key for each, e.g.,
export OPENAI_API_KEY=<your_key>
- Run LLM-as-a-judge:
bash script/benchmark/evaluate.sh \
<filepath_prediction> # e.g., gpt-4o-2024-08-06_50_examples.json
- Add data annotation code (preprocess, QA generation, verification)
- Add prediction code for other baselines (unimodal, socratic, open multimodal models)
If you find this work helpful in your research, please consider citing our work.
@article{hasegawa-etal-2024-promqa,
title={ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding},
author={Hasegawa, Kimihiro and Imrattanatrai, Wiradee and Cheng, Zhi-Qi and Asada, Masaki and Holm, Susan and Wang, Yuran and Fukuda, Ken and Mitamura, Teruko},
publisher = {arXiv},
year={2024},
url={https://arxiv.org/abs/2410.22211},
}
For any issues, questions, or requests, please create a GitHub Issue.