Skip to content

Official repository of "Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach" (ACL 2024 Oral)

License

Notifications You must be signed in to change notification settings

Saehyung-Lee/PlugIR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[ACL 2024 Oral] Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach

This repository is the official implementation of Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach, published as a main (oral) paper at ACL 2024. Our code is based on

  1. ChatIR (Levy et al., NeurIPS 2023)
  2. PyTorch image classification reference
  3. CLIP

Prerequisites

Packages

datasets

    VisDial  
    ├── train                    
    │    ├── images          
    │    └── visdial_1.0_train.json                
    └── val
         ├── images  
	 └── visdial_1.0_val.json

Context Reformulation and Context-aware Dialogue Generation

Our method, PlugIR, actively utilizes the general instruction-following capability of LLMs in two ways. First, by transforming the dialogue-form context into a caption-style query, we eliminate the need to fine-tune a retrieval model on existing visual dialogue data, thereby enabling the use of any arbitrary black-box model. Second, we construct the LLM questioner to generate non-redundant questions about the attributes of the target image, based on the information of retrieval candidate images in the current context. Extracting the textual information about each cluster of retreival candidate for the LLM questioner and filtering the redundant questions generated by LLM questioner, this approach mitigates the issues of noisiness and redundancy in the generated questions.

python generate_dialog.py --api_key ${YOUR_OPENAI_API_KEY} --q_n 5 --thres_low 500 --n_clusters 10 --reconstruct --referring --filtering --select

Evaluation

In addition to existing metrics (Hits@K and Recall@K), we introduce the Best log Rank Integral (BRI) metric. BRI is a novel metric aligned with human judgment, specifically designed to provide a comprehensive and quantifiable evaluation of interactive retrieval systems.

Zero-shot baseline

model=blip
python eval.py --retriever ${model} --cache-corpus cache/corpus_${model}.pth --data-dir <a directory path containing "unlabeled2017">

Fine-tuning baseline

model=blip
python eval.py --retriever ${model} --cache-corpus cache/corpus_finetuned_${model}.pth --data-dir <a directory path containing "unlabeled2017"> --ft-model-path <fine-tuned-model.pth>

PlugIR (ours)

model=blip
python eval.py --retriever ${model} --cache-corpus cache/corpus_finetuned_${model}.pth --data-dir <a directory path containing "unlabeled2017"> --queries-path <our_queries.json> --split

BLIP Text Encoder Fine-tuning

cd finetune
torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:0 --nnodes=1 --nproc_per_node=4 train.py \
	--data-path <path to the VisDial directory> \
	--amp

About

Official repository of "Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach" (ACL 2024 Oral)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages