[ACL 2024 Oral] Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach
This repository is the official implementation of Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach, published as a main (oral) paper at ACL 2024. Our code is based on
VisDial
├── train
│ ├── images
│ └── visdial_1.0_train.json
└── val
├── images
└── visdial_1.0_val.json
Our method, PlugIR, actively utilizes the general instruction-following capability of LLMs in two ways. First, by transforming the dialogue-form context into a caption-style query, we eliminate the need to fine-tune a retrieval model on existing visual dialogue data, thereby enabling the use of any arbitrary black-box model. Second, we construct the LLM questioner to generate non-redundant questions about the attributes of the target image, based on the information of retrieval candidate images in the current context. Extracting the textual information about each cluster of retreival candidate for the LLM questioner and filtering the redundant questions generated by LLM questioner, this approach mitigates the issues of noisiness and redundancy in the generated questions.
python generate_dialog.py --api_key ${YOUR_OPENAI_API_KEY} --q_n 5 --thres_low 500 --n_clusters 10 --reconstruct --referring --filtering --select
In addition to existing metrics (Hits@K and Recall@K), we introduce the Best log Rank Integral (BRI) metric. BRI is a novel metric aligned with human judgment, specifically designed to provide a comprehensive and quantifiable evaluation of interactive retrieval systems.
model=blip
python eval.py --retriever ${model} --cache-corpus cache/corpus_${model}.pth --data-dir <a directory path containing "unlabeled2017">
model=blip
python eval.py --retriever ${model} --cache-corpus cache/corpus_finetuned_${model}.pth --data-dir <a directory path containing "unlabeled2017"> --ft-model-path <fine-tuned-model.pth>
model=blip
python eval.py --retriever ${model} --cache-corpus cache/corpus_finetuned_${model}.pth --data-dir <a directory path containing "unlabeled2017"> --queries-path <our_queries.json> --split
cd finetune
torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:0 --nnodes=1 --nproc_per_node=4 train.py \
--data-path <path to the VisDial directory> \
--amp