[ACL 2024 Oral] Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach

This repository is the official implementation of Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach, published as a main (oral) paper at ACL 2024. Our code is based on

Prerequisites

Packages

datasets

    VisDial  
    ├── train                    
    │    ├── images          
    │    └── visdial_1.0_train.json                
    └── val
         ├── images  
	 └── visdial_1.0_val.json

Context Reformulation and Context-aware Dialogue Generation

Our method, PlugIR, actively utilizes the general instruction-following capability of LLMs in two ways. First, by transforming the dialogue-form context into a caption-style query, we eliminate the need to fine-tune a retrieval model on existing visual dialogue data, thereby enabling the use of any arbitrary black-box model. Second, we construct the LLM questioner to generate non-redundant questions about the attributes of the target image, based on the information of retrieval candidate images in the current context. Extracting the textual information about each cluster of retreival candidate for the LLM questioner and filtering the redundant questions generated by LLM questioner, this approach mitigates the issues of noisiness and redundancy in the generated questions.

python generate_dialog.py --api_key ${YOUR_OPENAI_API_KEY} --q_n 5 --thres_low 500 --n_clusters 10 --reconstruct --referring --filtering --select

Evaluation

In addition to existing metrics (Hits@K and Recall@K), we introduce the Best log Rank Integral (BRI) metric. BRI is a novel metric aligned with human judgment, specifically designed to provide a comprehensive and quantifiable evaluation of interactive retrieval systems.

Zero-shot baseline

model=blip
python eval.py --retriever ${model} --cache-corpus cache/corpus_${model}.pth --data-dir <a directory path containing "unlabeled2017">

Fine-tuning baseline

model=blip
python eval.py --retriever ${model} --cache-corpus cache/corpus_finetuned_${model}.pth --data-dir <a directory path containing "unlabeled2017"> --ft-model-path <fine-tuned-model.pth>

PlugIR (ours)

model=blip
python eval.py --retriever ${model} --cache-corpus cache/corpus_finetuned_${model}.pth --data-dir <a directory path containing "unlabeled2017"> --queries-path <our_queries.json> --split

BLIP Text Encoder Fine-tuning

cd finetune
torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:0 --nnodes=1 --nproc_per_node=4 train.py \
	--data-path <path to the VisDial directory> \
	--amp

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Protocol		Protocol
dialogues		dialogues
finetune		finetune
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
generate_dialog.py		generate_dialog.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ACL 2024 Oral] Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach

Prerequisites

Packages

datasets

Context Reformulation and Context-aware Dialogue Generation

Evaluation

Zero-shot baseline

Fine-tuning baseline

PlugIR (ours)

BLIP Text Encoder Fine-tuning

About

Releases

Packages

Contributors 2

Languages

License

Saehyung-Lee/PlugIR

Folders and files

Latest commit

History

Repository files navigation

[ACL 2024 Oral] Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach

Prerequisites

Packages

datasets

Context Reformulation and Context-aware Dialogue Generation

Evaluation

Zero-shot baseline

Fine-tuning baseline

PlugIR (ours)

BLIP Text Encoder Fine-tuning

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages