Official Implementation of our EMLNP 2024 Paper "Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant"
paper | arxiv | project page
To setup environment
We recomment to use the following docker from here.
nvcr.io/nvidia/pytorch:23.12-py3
conda env create -n kalma --file kalma.yml
conda activate kalma
Download the all three splits of Text-KVQA data (Singh et al., ICCV'19) from here.
To run VisTEL, training and inference. The following files are required:
- Images.
- Knowledge Base with all the entities.
- OCR output of every image (to be stored at
dataset/{split_name}/ocr_output.json
). [OCR pipeline: DBNet + ParSeq. Refer to DBNet implementation here and ParSeq implementation here.] - Top 5 candidates for each image based on Normalised edit distance scores between the OCR-ed text of the image with the entity names in the knowledge base. (to be stored at
dataset/{split_name}/filtered_ranked_titles_train.json
)
Set respective paths train.json
and test.json
in the configs/vistel_configs/{split_name}/
folder.
python src/vistel/train.py configs/vistel_configs/{split_name}/train.json
python src/vistel/test.py configs/vistel_configs/{split_name}/test.json
Results should be postprocessed as follows: Text between 'ASSISTANT:' and '[END]' will be the linked entity. Have to be saved at dataset/{split_name}/vistel_titles.json
To run KaLMA, training and inference. The following files are required:
- Images.
- Knowledge Base with all the entities.
- QA files.
- vistel_titles.json for all splits.
python src/kalma/train.py configs/kalma/{split_name}/train.json
python src/kalma/test.py configs/kalma/{split_name}/test.json
Results should be postprocessed as follows: Text between 'ASSISTANT:' and '[END]' will be the generated answer.
This code and data are released under the MIT license.
If you find this data/code/paper useful for your research, please consider citing.
@inproceedings{retvqa,
author = {Abhirama Subramanyam Penamakuri and
Anand Mishra},
title = {Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant},
booktitle = {EMNLP},
year = {2024},
}
- We used code-base and pre-trained models of LLaVA, DBNet, ParSeq.
- Abhirama S. Penamakuri is supported by Prime Minister Research Fellowship (PMRF), Minsitry of Education, Government of India.
- This work was partly supported by the IIT Jodhpur Seed Research Grant and National Language Translation Mission (NLTM): Bhashini project by the MeitY, Government of India.