This is the code for ZRIGF: An Innovative Multimodal Framework for Zero-Resource Image-Grounded Dialogue Generation.
If you use any source code included in this repo in your work, please cite the following paper.
@inproceedings{10.1145/3581783.3611810,
author = {Zhang, Bo and Wang, Jian and Ma, Hui and Xu, Bo and Lin, Hongfei},
title = {ZRIGF: An Innovative Multimodal Framework for Zero-Resource Image-Grounded Dialogue Generation},
year = {2023},
isbn = {9798400701085},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3581783.3611810},
doi = {10.1145/3581783.3611810},
booktitle = {Proceedings of the 31st ACM International Conference on Multimedia},
pages = {5464–5473},
numpages = {10},
location = {Ottawa ON, Canada},
series = {MM '23}
}
- Python 3.10
- Pytorch 2.0
- CUDA 11.8
To install the Python dependencies, run:
pip install -r requirements.txt
To install nlg-eval, run:
git clone https://github.com/Maluuba/nlg-eval
cd nlg-eval
pip install -e .
To make the code work, some files need to be modified:
nlg-eval/requirements.txt
: changegensim~=3.8.3
togensim>=3.8.3
nlg-eval/nlgeval/word2vec/evaluate.py
: replace line 40 with the following line:
return vectors[self.m.key_to_index[key]]
This example uses COCO dataset (2017) through a custom dataset script, which requires users to manually download the COCO dataset before training.
cd data/coco
wget http://images.cocodataset.org/zips/train2017.zip
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/zips/test2017.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
wget http://images.cocodataset.org/annotations/image_info_test2017.zip
This example uses Open Images images as candidate images for retrieval. To download the images, refer to here. You can build the image index with the appropriate size (500,000 in our experiments) as needed.
If you already have Open Images dataset on disk, save them as follows:
data
|-- open_images
|-- images
|-- 14928b4f367c217e.jpg
|-- 289d643a8761aa83.jpg
|-- ......
Please download the Reddit data from here.
The Image-Chat dataset can be accessed via ParlAI, with -t image_chat.
Contrastive pre-training:
bash scripts/run_contrastive_train.sh
Extracting the vision features and tokenizing dialogue corpus:
bash scripts/run_extract_vokenize.sh
Generative pre-training:
bash scripts/run_generative_train.sh