Skip to content
/ KaLMA Public

Official Implementation of our EMLNP 2024 Paper

License

Notifications You must be signed in to change notification settings

vl2g/KaLMA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Official Implementation of our EMLNP 2024 Paper "Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant"

paper | arxiv | project page

Environment

To setup environment

We recomment to use the following docker from here.

nvcr.io/nvidia/pytorch:23.12-py3
conda env create -n kalma --file kalma.yml
conda activate kalma

Data

Download the all three splits of Text-KVQA data (Singh et al., ICCV'19) from here.

VisTEL

Preprocessing the data

To run VisTEL, training and inference. The following files are required:

  1. Images.
  2. Knowledge Base with all the entities.
  3. OCR output of every image (to be stored at dataset/{split_name}/ocr_output.json). [OCR pipeline: DBNet + ParSeq. Refer to DBNet implementation here and ParSeq implementation here.]
  4. Top 5 candidates for each image based on Normalised edit distance scores between the OCR-ed text of the image with the entity names in the knowledge base. (to be stored at dataset/{split_name}/filtered_ranked_titles_train.json)

Set respective paths train.json and test.json in the configs/vistel_configs/{split_name}/ folder.

Training

python src/vistel/train.py configs/vistel_configs/{split_name}/train.json

Testing

python src/vistel/test.py configs/vistel_configs/{split_name}/test.json

Post processing results

Results should be postprocessed as follows: Text between 'ASSISTANT:' and '[END]' will be the linked entity. Have to be saved at dataset/{split_name}/vistel_titles.json

KaLMA

Required data

To run KaLMA, training and inference. The following files are required:

  1. Images.
  2. Knowledge Base with all the entities.
  3. QA files.
  4. vistel_titles.json for all splits.

Training

python src/kalma/train.py configs/kalma/{split_name}/train.json

Testing

python src/kalma/test.py configs/kalma/{split_name}/test.json

Post processing results

Results should be postprocessed as follows: Text between 'ASSISTANT:' and '[END]' will be the generated answer.

License

This code and data are released under the MIT license.

Cite

If you find this data/code/paper useful for your research, please consider citing.

@inproceedings{retvqa,
  author       = {Abhirama Subramanyam Penamakuri and
                  Anand Mishra},
  title        = {Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant},
  booktitle    = {EMNLP},
  year         = {2024},
}

Acknowledgements

  1. We used code-base and pre-trained models of LLaVA, DBNet, ParSeq.
  2. Abhirama S. Penamakuri is supported by Prime Minister Research Fellowship (PMRF), Minsitry of Education, Government of India.
  3. This work was partly supported by the IIT Jodhpur Seed Research Grant and National Language Translation Mission (NLTM): Bhashini project by the MeitY, Government of India.

About

Official Implementation of our EMLNP 2024 Paper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages