Multimodal Composite Editing and Retrieval

🔥🔥 This is a collection of awesome articles about multimodal composite editing and retrieval🔥🔥

[NEWS.20240909] The related survey paper has been released.

If you find this repository is useful for you, please cite our paper:

@misc{li2024survey,
      title={A Survey of Multimodal Composite Editing and Retrieval}, 
      author={Suyan Li, Fuxiang Huang, and Lei Zhang},
      year={2024},
      eprint={2409.05405},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Papers and related codes
Datasets
Experimental Results

Papers and related codes

Image-text composite editing

2024

[WACV, 2024] Text-to-Image Editing by Image Information Removal
Zhongping Zhang, Jian Zheng, Zhiyuan Fang, Bryan A. Plummer
[Paper]

[WACV, 2024] Shape-Guided Diffusion with Inside-Outside Attention
Dong Huk Park, Grace Luo, Clayton Toste, Samaneh Azadi, Xihui Liu, Maka Karalashvili, Anna Rohrbach, Trevor Darrell
[Paper]

2023

[IEEE Access, 2023] Text-Guided Image Manipulation via Generative Adversarial Network With Referring Image Segmentation-Based Guidance
Yuto Watanabe, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
[Paper]

[arXiv, 2023] InstructEdit: Improving Automatic Masks for Diffusion-Based Image Editing with User Instructions
Qian Wang, Biao Zhang, Michael Birsak, Peter Wonka
[Paper] [GitHub]

[ICLR, 2023] DiffEdit: Diffusion-Based Semantic Image Editing with Mask Guidance
Guillaume Couairon, Jakob Verbeek, Holger Schwenk, Matthieu Cord
[Paper] [GitHub]

[CVPR, 2023] SINE: Single Image Editing with Text-to-Image Diffusion Models
Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris N Metaxas, Jian Ren
[Paper] [GitHub]

[CVPR, 2023] Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation
Narek Tumanyan, Michal Geyer, Shai Bagon, Tali Dekel
[Paper] [GitHub]

[arXiv, 2023] PRedItOR: Text Guided Image Editing with Diffusion Prior
Hareesh Ravi, Sachin Kelkar, Midhun Harikumar, Ajinkya Kale
[Paper]

[TOG, 2023] Unitune: Text-Driven Image Editing by Fine Tuning a Diffusion Model on a Single Image
Dani Valevski, Matan Kalman, Eyal Molad, Eyal Segalis, Yossi Matias, Yaniv Leviathan
[Paper] [GitHub]

[arXiv, 2023] Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models
Jooyoung Choi, Yunjey Choi, Yunji Kim, Junho Kim, Sungroh Yoon
[Paper] [GitHub]

[CVPR, 2023] Imagic: Text-Based Real Image Editing with Diffusion Models
Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, Michal Irani
[Paper] [GitHub]

[ICLR, 2023] Diffusion-Based Image Translation Using Disentangled Style and Content Representation
Gihyun Kwon, Jong Chul Ye
[Paper] [GitHub]

[arXiv, 2023] MDP: A Generalized Framework for Text-Guided Image Editing by Manipulating the Diffusion Path
Qian Wang, Biao Zhang, Michael Birsak, Peter Wonka
[Paper] [GitHub]

[CVPR, 2023] InstructPix2Pix: Learning to Follow Image Editing Instructions
Tim Brooks, Aleksander Holynski, Alexei A. Efros
[Paper] [GitHub]

[ICCV, 2023] Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models
Wenkai Dong, Song Xue, Xiaoyue Duan, Shumin Han
[Paper]

[arXiv, 2023] DeltaSpace: A Semantic-Aligned Feature Space for Flexible Text-Guided Image Editing
Yueming Lyu, Kang Zhao, Bo Peng, Yue Jiang, Yingya Zhang, Jing Dong
[Paper]

[AAAI, 2023] DE-Net: Dynamic Text-Guided Image Editing Adversarial Networks
Ming Tao, Bing-Kun Bao, Hao Tang, Fei Wu, Longhui Wei, Qi Tian
[Paper] [GitHub]

2022

[ACM MM, 2022] LS-GAN: Iterative Language-Based Image Manipulation via Long and Short Term Consistency Reasoning
Gaoxiang Cong, Liang Li, Zhenhuan Liu, Yunbin Tu, Weijun Qin, Shenyuan Zhang, Chengang Yan, Wenyu Wang, Bin Jiang
[Paper]

[arXiv, 2022] FEAT: Face Editing with Attention
Xianxu Hou, Linlin Shen, Or Patashnik, Daniel Cohen-Or, Hui Huang
[Paper]

[ECCV, 2022] VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance
Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, Edward Raff
[Paper] [GitHub]

[ICML, 2022] GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, Mark Chen
[Paper] [GitHub]

[WACV, 2022] StyleMC: Multi-Channel Based Fast Text-Guided Image Generation and Manipulation
Umut Kocasari, Alara Dirik, Mert Tiftikci, Pinar Yanardag
[Paper] [GitHub] [website]

[CVPR, 2022] HairCLIP: Design Your Hair by Text and Reference Image
Tianyi Wei, Dongdong Chen, Wenbo Zhou, Jing Liao, Zhentao Tan, Lu Yuan, Weiming Zhang, Nenghai Yu
[Paper] [GitHub]

[NeurIPS, 2022] One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations
Yiming Zhu, Hongyu Liu, Yibing Song, Ziyang Yuan, Xintong Han, Chun Yuan, Qifeng Chen, Jue Wang
[Paper] [GitHub]

[CVPR, 2022] Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model
Zipeng Xu, Tianwei Lin, Hao Tang, Fu Li, Dongliang He, Nicu Sebe, Radu Timofte, Luc Van Gool, Errui Ding
[Paper] [GitHub]

[CVPR, 2022] Blended Diffusion for Text-Driven Editing of Natural Images
Omri Avrahami, Dani Lischinski, Ohad Fried
[Paper] [GitHub]

[CVPR, 2022] DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation
Gwanghyun Kim, Taesung Kwon, Jong Chul Ye
[Paper] [GitHub]

[ICLR, 2022] SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, Stefano Ermon
[Paper] [GitHub] [Website]

2021

[CVPR, 2021] TediGAN: Text-guided diverse face image generation and manipulation
Weihao Xia, Yujiu Yang, Jing-Hao Xue, Baoyuan Wu
[Paper] [GitHub]

[ICIP, 2021] Segmentation-Aware Text-Guided Image Manipulation
Tomoki Haruyama, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
[Paper] [GitHub]

[IJPR, 2021] FocusGAN: Preserving Background in Text-Guided Image Editing
Liuqing Zhao, Linyan Li, Fuyuan Hu, Zhenping Xia, Rui Yao
[Paper]

[ICCV, 2021] StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, Dani Lischinski
[Paper] [GitHub]

[MM, 2021] Text as Neural Operator: Image Manipulation by Text Instruction
Tianhao Zhang, Hung-Yu Tseng, Lu Jiang, Weilong Yang, Honglak Lee, Irfan Essa
[Paper] [GitHub]

[CVPR, 2021] TediGAN: Text-guided diverse face image generation and manipulation
Weihao Xia, Yujiu Yang, Jing-Hao Xue, Baoyuan Wu
[Paper] [GitHub]

[arXiv, 2021] Paint by Word
Alex Andonian, Sabrina Osmany, Audrey Cui, YeonHwan Park, Ali Jahanian, Antonio Torralba, David Bau
[Paper] [GitHub] [Website]

[CVPR, 2021] Learning by Planning: Language-Guided Global Image Editing
Jing Shi, Ning Xu, Yihang Xu, Trung Bui, Franck Dernoncourt, Chenliang Xu
[Paper] [GitHub]

2020

[ACM MM, 2020] IR-GAN: Image Manipulation with Linguistic Instruction by Increment Reasoning
Zhenhuan Liu, Jincan Deng, Liang Li, Shaofei Cai, Qianqian Xu, Shuhui Wang, Qingming Huang
[Paper] [GitHub]

[CVPR, 2020] ManiGAN: Text-Guided Image Manipulation
Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, Philip HS Torr
[Paper] [GitHub]

[NeurIPS, 2020] Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation
Bowen Li, Xiaojuan Qi, Philip Torr, Thomas Lukasiewicz
[Paper] [GitHub]

[LNCS, 2020] CAFE-GAN: Arbitrary Face Attribute Editing with Complementary Attention Feature
Jeong-gi Kwak, David K. Han, Hanseok Ko
[Paper] [GitHub]

[ECCV, 2020] Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary Instructions
Xihui Liu, Zhe Lin, Jianming Zhang, Handong Zhao, Quan Tran, Xiaogang Wang, Hongsheng Li
[Paper] [GitHub]

[CVPR, 2020] Composed Query Image Retrieval Using Locally Bounded Features
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3596–3605, 2020. [Paper]

2019

[ICASSP, 2019] Bilinear Representation for Language-based Image Editing Using Conditional Generative Adversarial Networks
Xiaofeng Mao, Yuefeng Chen, Yuhong Li, Tao Xiong, Yuan He, Hui Xue
[Paper] [GitHub]

2018

[NeurIPS, 2018] Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language
Seonghyeon Nam, Yunji Kim, Seon Joo Kim
[Paper] [GitHub]

[CVPR, 2018] Language-based image editing with recurrent attentive models
Jianbo Chen, Yelong Shen, Jianfeng Gao, Jingjing Liu, Xiaodong Liu
[Paper]

[arXiv, 2018] Interactive Image Manipulation with Natural Language Instruction Commands
Seitaro Shinagawa, Koichiro Yoshino, Sakriani Sakti, Yu Suzuki, Satoshi Nakamura
[Paper]

[CVPR, 2018] Language-based image editing with recurrent attentive models
Jianbo Chen, Yelong Shen, Jianfeng Gao, Jingjing Liu, Xiaodong Liu
[Paper]

2017

[ICCV, 2017] Semantic image synthesis via adversarial learning
Hao Dong, Simiao Yu, Chao Wu, Yike Guo
[Paper] [GitHub]

Image-text composite retrieval

2024

[AAAI, 2024] Dynamic weighted combiner for mixed-modal image retrieval
Fuxiang Huang, Lei Zhang, Xiaowei Fu, Suqi Song
[Paper] [GitHub]

[ICMR, 2024] Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models
Hongyi Zhu, Jia-Hong Huang, Stevan Rudinac, Evangelos Kanoulas
[Paper] [GitHub]

[ACM MM, 2024] Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives
Zhangchi Feng, Richong Zhang, Zhijie Nie
[Paper] [GitHub]

[CVPR, 2024] Language-only Training of Zero-shot Composed Image Retrieval
Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, Sangdoo Yun
[Paper] [GitHub]

[AAAI, 2024] Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image Retrieval
Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Yue Hu, Qi Wu
[Paper] [GitHub]

[CVPR, 2024] Knowledge-enhanced dual-stream zero-shot composed image retrieval
Yucheng Suo, Fan Ma, Linchao Zhu, Yi Yang
[Paper]

[WACV, 2024] Bi-directional training for composed image retrieval via text prompt learning
Zheyuan Liu, Weixuan Sun, Yicong Hong, Damien Teney, Stephen Gould
[Paper]

[AAAI, 2024] Data roaming and quality assessment for composed image retrieval
Matan Levy, Rami Ben-Ari, Nir Darshan, Dani Lischinski
[Paper]

[TMLR, 2024] Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder
Zheyuan Liu, Weixuan Sun, Damien Teney, Stephen Gould
[Paper]

[ICML, 2024] MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, Ming-Wei Chang
[Paper]

[ICLR, 2024] Vision-by-Language for Training-Free Compositional Image Retrieval
Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata
[Paper]

[CVPR, 2024] CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion
Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, Sangdoo Yun
[Paper]

[ACM SIGIR, 2024] Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image Retrieval
Haokun Wen, Xuemeng Song, Xiaolin Chen, Yinwei Wei, Liqiang Nie, Tat-Seng Chua
[Paper]

[IEEE TIP, 2024] Multimodal Composition Example Mining for Composed Query Image Retrieval
Gangjian Zhang, Shikun Li, Shikui Wei, Shiming Ge, Na Cai, Yao Zhao
[Paper]

[IEEE TMM, 2024] Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text Feedback
Yahui Xu, Yi Bin, Jiwei Wei, Yang Yang, Guoqing Wang, Heng Tao Shen
[Paper]

2023

[CVPR, 2023] Fame-vil: Multi-tasking vision-language model for heterogeneous fashion tasks
Xiao Han, Xiatian Zhu, Licheng Yu, Li Zhang, Yi-Zhe Song, Tao Xiang
[Paper] [GitHub]

[ICCV, 2023] FashionNTM: Multi-turn fashion image retrieval via cascaded memory
Anwesan Pal, Sahil Wadhwa, Ayush Jaiswal, Xu Zhang, Yue Wu, Rakesh Chada, Pradeep Natarajan, Henrik I Christensen
[Paper]

[CVPR, 2023] Pic2word: Mapping pictures to words for zero-shot composed image retrieval
Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, Tomas Pfister
[Paper] [GitHub]

[arXiv, 2023] Pretrain like you inference: Masked tuning improves zero-shot composed image retrieval
Junyang Chen, Hanjiang Lai
[Paper]

[ICCV, 2023] Zero-shot composed image retrieval with textual inversion
Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Alberto Del Bimbo
[Paper] [GitHub]

[ACM, 2023] Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features
Alberto Baldrati, Marco Bertini, Tiberio Uricchio, Alberto Del Bimbo
[Paper] [GitHub]

[arXiv, 2023] Ranking-aware Uncertainty for Text-guided Image Retrieval
Junyang Chen, Hanjiang Lai
[Paper]

[IEEE TIP, 2023] Composed Image Retrieval via Cross Relation Network With Hierarchical Aggregation Transformer
Qu Yang, Mang Ye, Zhaohui Cai, Kehua Su, Bo Du
[Paper]

[IEEE TMM, 2023] Multi-Modal Transformer With Global-Local Alignment for Composed Query Image Retrieval
Yahui Xu, Yi Bin, Jiwei Wei, Yang Yang, Guoqing Wang, Heng Tao Shen
[Paper]

[ACM MM, 2023] Target-Guided Composed Image Retrieval
Haokun Wen, Xian Zhang, Xuemeng Song, Yinwei Wei, Liqiang Nie
[Paper]

[ICCV, 2023] ProVLA: Compositional Image Search with Progressive Vision-Language Alignment and Multimodal Fusion
Zhizhang Hu, Xinliang Zhu, Son Tran, René Vidal, Arnab Dhua
[Paper]

[CVPR 2023] Language Guided Local Infiltration for Interactive Image Retrieval
Fuxiang Huang, Lei Zhang
[Paper]

2022

[IEEE TMM, 2022] Adversarial and isotropic gradient augmentation for image retrieval with text feedback
Fuxiang Huang, Lei Zhang, Yuhang Zhou, Xinbo Gao
[Paper]

[ACM TOMM, 2022] Tell, imagine, and search: End-to-end learning for composing text and image to image retrieval
Feifei Zhang, Mingliang Xu, Changsheng Xu
[Paper]

[arXiv, 2022] Image Search with Text Feedback by Additive Attention Compositional Learning
Yuxin Tian, Shawn Newsam, Kofi Boakye
[Paper]

[IEEE TMM, 2022] Heterogeneous feature alignment and fusion in cross-modal augmented space for composed image retrieval
Huaxin Pang, Shikui Wei, Gangjian Zhang, Shiyin Zhang, Shuang Qiu, Yao Zhao
[Paper]

[IEEE TIP, 2022] Composed Image Retrieval via Explicit Erasure and Replenishment With Semantic Alignment
Gangjian Zhang, Shikui Wei, Huaxin Pang, Shuang Qiu, Yao Zhao
[Paper]

[ICLR, 2022] ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity
Ginger Delmas, Rafael S. Rezende, Gabriela Csurka, Diane Larlus
[Paper] [GitHub]

[WACV, 2022] SAC: Semantic attention composition for text-conditioned image retrieval
Surgan Jandial, Pinkesh Badjatiya, Pranit Chawla, Ayush Chopra, Mausoom Sarkar, Balaji Krishnamurthy
[Paper]

[ACM TOMCCAP, 2022] AMC: Adaptive Multi-expert Collaborative Network for Text-guided Image Retrieval
Hongguang Zhu, Yunchao Wei, Yao Zhao, Chunjie Zhang, Shujuan Huang
[Paper][GitHub]

[CVPR, 2022] FashionVLP: Vision Language Transformer for Fashion Retrieval With Feedback
Sonam Goenka, Zhaoheng Zheng, Ayush Jaiswal, Rakesh Chada, Yue Wu, Varsha Hedau, Pradeep Natarajan
[Paper]

[arXiv, 2022] Composed image retrieval with text feedback via multi-grained uncertainty regularization
Yiyang Chen, Zhedong Zheng, Wei Ji, Leigang Qu, Tat-Seng Chua
[Paper]

[ACM MM, 2022] Comprehensive Relationship Reasoning for Composed Query Based Image Retrieval
Feifei Zhang, Ming Yan, Ji Zhang, Changsheng Xu
[Paper]

[ACM SIGIR, 2022] Progressive learning for image retrieval with hybrid-modality queries
Yida Zhao, Yuqing Song, Qin Jin
[Paper]

[CVPR, 2022] Conditioned and composed image retrieval combining and partially fine-tuning clip-based features
Alberto Baldrati, Marco Bertini, Tiberio Uricchio, Alberto Del Bimbo
[Paper]

[CVPR, 2022] Effective conditioned and composed image retrieval combining clip-based features
Alberto Baldrati, Marco Bertini, Tiberio Uricchio, Alberto Del Bimbo
[Paper]

[ECCV, 2022] “This is my unicorn, Fluffy”: Personalizing frozen vision-language representations
Niv Cohen, Rinon Gal, Eli A. Meirom, Gal Chechik, Yuval Atzmon
[Paper]

[MMAsia, 2022] Hierarchical Composition Learning for Composed Query Image Retrieval
Yahui Xu, Yi Bin, Guoqing Wang, Yang Yang
Paper

[IEEE TIP, 2022] Geometry Sensitive Cross-Modal Reasoning for Composed Query Based Image Retrieval
Feifei Zhang, Mingliang Xu, Changsheng Xu
Paper

2021

[ACM SIGIR, 2021] Comprehensive linguistic-visual composition network for image retrieval
Haokun Wen, Xuemeng Song, Xin Yang, Yibing Zhan, Liqiang Nie
[Paper]

[AAAI, 2021] Dual compositional learning in interactive image retrieval
Jongseok Kim, Youngjae Yu, Hoeseong Kim, Gunhee Kim
[Paper] [GitHub]

[CVPR, 2021] Leveraging Style and Content features for Text Conditioned Image Retrieval
Pranit Chawla, Surgan Jandial, Pinkesh Badjatiya, Ayush Chopra, Mausoom Sarkar, Balaji Krishnamurthy
[Paper]

[ICCV, 2021] Image retrieval on real-life images with pre-trained vision-and-language models
Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, Stephen Gould
[Paper] [GitHub]

[ACM SIGIR, 2021] Conversational fashion image retrieval via multiturn natural language feedback
Yifei Yuan, Wai Lam
[Paper] [GitHub]

[WACV, 2021] Compositional learning of image-text query for image retrieval
Muhammad Umer Anwaar, Egor Labintcev, Martin Kleinsteuber
[Paper] [GitHub]

[ACM MM, 2021] Cross-modal Joint Prediction and Alignment for Composed Query Image Retrieval
Yuchen Yang, Min Wang, Wengang Zhou, Houqiang Li
[Paper]

[ACM MM, 2021] Image Search with Text Feedback by Deep Hierarchical Attention Mutual Information Maximization
Chunbin Gu, Jiajun Bu, Zhen Zhang, Zhi Yu, Dongfang Ma, Wei Wang
[Paper]

[CVPR, 2021] CoSMo: Content-Style Modulation for Image Retrieval With Text Feedback
Seungmin Lee, Dongwan Kim, Bohyung Han
[Paper]

[arXiv, 2021] RTIC: Residual Learning for Text and Image Composition using Graph Convolutional Network
Minchul Shin, Yoonjae Cho, Byungsoo Ko, Geonmo Gu
[Paper]

[CVPR, 2021] Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, Rogerio Feris
[Paper]

2020

[ECCV, 2020] Learning joint visual semantic matching embeddings for language-guided retrieval
Yanbei Chen, Loris Bazzani
[Paper]

[arXiv, 2020] CurlingNet: Compositional Learning between Images and Text for Fashion IQ Data
Youngjae Yu, Seunghwan Lee, Yuncheol Choi, Gunhee Kim
[Paper]

[CVPR, 2020] Image search with text feedback by visiolinguistic attention learning
Yanbei Chen, Shaogang Gong, Loris Bazzani
[Paper] [GitHub]

[arXiv, 2020] Modality-Agnostic Attention Fusion for visual search with text feedback
Eric Dodds, Jack Culpepper, Simao Herdade, Yang Zhang, Kofi Boakye
[Paper] [GitHub]

[ACM MM, 2020] Joint Attribute Manipulation and Modality Alignment Learning for Composing Text and Image to Image Retrieval
Feifei Zhang, Mingliang Xu, Qirong Mao, Changsheng Xu
[Paper]

2019

[CVPR, 2019] Composing Text and Image for Image Retrieval - An Empirical Odyssey
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Fei-Fei Li, James Hays
[Paper]

2018

[CVPR, 2018] Language-based image editing with recurrent attentive models
Jianbo Chen, Yelong Shen, Jianfeng Gao, Jingjing Liu, Xiaodong Liu
[Paper]

[NeurIPS, 2018] Dialog-based interactive image retrieval
Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, Rogerio Feris
[Paper] [GitHub]

2017

[ICCV, 2017] Automatic spatially-aware fashion concept discovery
Xintong Han, Zuxuan Wu, Phoenix X Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, Larry S Davis
[Paper]

[ICCV, 2017] Be your own prada: Fashion synthesis with structural coherence
Shizhan Zhu, Raquel Urtasun, Sanja Fidler, Dahua Lin, Chen Change Loy
[Paper] [GitHub]

Other mutimodal composite retrieval

2024

[CVPR, 2024] Tri-modal motion retrieval by learning a joint embedding space
Kangning Yin, Shihao Zou, Yuxuan Ge, Zheng Tian
[Paper]

[WACV, 2024] Modality-Aware Representation Learning for Zero-shot Sketch-based Image Retrieval
Eunyi Lyou, Doyeon Lee, Jooeun Kim, Joonseok Lee
[Paper] [GitHub]

[CVPR, 2024] Pros: Prompting-to-simulate generalized knowledge for universal cross-domain retrieval
Kaipeng Fang, Jingkuan Song, Lianli Gao, Pengpeng Zeng, Zhi-Qi Cheng, Xiyao Li, Heng Tao Shen
[Paper] [GitHub]

[CVPR, 2024] You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval
Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Yi-Zhe Song
[Paper]

[AAAI, 2024] T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models
Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan
[Paper] [GitHub]

[IEEE/CVF, 2024] TriCoLo: Trimodal contrastive loss for text to shape retrieval
Yue Ruan, Han-Hung Lee, Yiming Zhang, Ke Zhang, Angel X Chang
[Paper] [GitHub]

2023

[CVPR, 2023] SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo and Text
Pinaki Nath Chowdhury, Ayan Kumar Bhunia, Aneeshan Sain, Subhadeep Koley, Tao Xiang, Yi-Zhe Song
[Paper]

2022

[ECCV, 2022] A sketch is worth a thousand words: Image retrieval with text and sketch
Patsorn Sangkloy, Wittawat Jitkrittum, Diyi Yang, James Hays
[Paper]

[ECCV, 2022] Motionclip: Exposing human motion generation to clip space
Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, Daniel Cohen-Or
[Paper] [GitHub]

[IEEE J-STARS, 2022] Multimodal Fusion Remote Sensing Image–Audio Retrieval
Rui Yang, Shuang Wang, Yingzhi Sun, Huan Zhang, Yu Liao, Yu Gu, Biao Hou, Licheng Jiao
[Paper]

2021

[CVPR, 2021] Connecting what to say with where to look by modeling human attention traces
Zihang Meng, Licheng Yu, Ning Zhang, Tamara L Berg, Babak Damavandi, Vikas Singh, Amy Bearman
[Paper] [GitHub]

[ICCV, 2021] Telling the what while pointing to the where: Multimodal queries for image retrieval
Soravit Changpinyo, Jordi Pont-Tuset, Vittorio Ferrari, Radu Soricut
[Paper]

2020

[arXiv, 2020] A Feature Analysis for Multimodal News Retrieval
Golsa Tahmasebzadeh, Sherzod Hakimov, Eric Müller-Budack, Ralph Ewerth
[Paper] [GitHub]

2019

[MTA, 2019] Efficient and interactive spatial-semantic image retrieval
Ryosuke Furuta, Naoto Inoue, Toshihiko Yamasaki
[Paper]

[arXiv, 2019] Query by Semantic Sketch
Luca Rossetto, Ralph Gasser, Heiko Schuldt
[Paper]

2017

[IJCNLP, 2017] Draw and tell: Multimodal descriptions outperform verbal-or sketch-only descriptions in an image retrieval task
Ting Han, David Schlangen
[Paper]

[CVPR, 2017] Spatial-Semantic Image Search by Visual Feature Synthesis
Long Mai, Hailin Jin, Zhe Lin, Chen Fang, Jonathan Brandt, Feng Liu
[Paper]

[ACM MM, 2017] Region-based image retrieval revisited
Ryota Hinami, Yusuke Matsui, Shin'ichi Satoh
[Paper]

2014

[Cancer Informatics, 2014] Medical image retrieval: a multimodal approach
Yu Cao, Shawn Steffey, Jianbiao He, Degui Xiao, Cui Tao, Ping Chen, Henning Müller
[Paper]

2013

[SIGIR, 2013] NovaMedSearch: a multimodal search engine for medical case-based retrieval
André Mourão, Flávio Martins
[Paper]

[ICDAR, 2013] Multi-modal Information Integration for Document Retrieval
Ehtesham Hassan, Santanu Chaudhury, M. Gopal
[Paper]

2003

[EURASIP, 2003] Semantic indexing of multimedia content using visual, audio, and text cues
WH Adams, Giridharan Iyengar, Ching-Yung Lin, Milind Ramesh Naphade, Chalapathy Neti, Harriet J Nock, John R Smith
[Paper]

Datasets

Datasets for image-text composite editing

Dataset	Modalities	Scale	Link
Caltech-UCSD Birds(CUB)	Images, Captions	11K images, 11K attributes	Link
Oxford-102 flower	Images, Captions	8K images, 8K attributes	Link
CelebFaces Attributes (CelebA)	Images, Captions	202K images, 8M attributes	Link
DeepFashion (Fashion Synthesis)	Images, Captions	78K images, -	Link
MIT-Adobe 5k	Images, Captions	5K images, 20K texts	Link
MS-COCO	Image, Caption	164K images, 616K texts	Link
ReferIt	Image, Caption	19K images, 130K text	Link
CLEVR	3D images, Questions	100K images, 865K questions	Link
i-CLEVR	3D image, Instruction	10K sequences, 50K instructions	Link
CSS	3D images, 2D images, Instructions	34K images, -	Link
CoDraw	images, text instructions	9K images, -	Link
Cityscapes	images, Captions	25K images, -	Link
Zap-Seq	image sequences, Captions	8K images, 18K texts	-
DeepFashion-Seq	image sequences, Captions	4K images, 12K texts	-
FFHQ	Images	70K images	Link
LSUN	Images	1M images	Link
Animal FacesHQ (AFHQ)	Images	15K images	Link
CelebA-HQ	Images	30K images	Link
Animal faces	Images	16K images	Link
Landscapes	Images	4K images	Link

Datasets for image-text composite retrieval

Dataset	Modalities	Scale	Link
Fashion200k	Image, Captions	200K images, 200K text	Link
MIT-States	Image, Captions	53K images, 53K text	Link
Fashion IQ	Image, Captions	77K images, -	Link
CIRR	Image, Captions	21K images, -	Link
CSS	3D images, 2D images, Instructions	34K images, -	Link
Shoes	Images	14K images	Link
Birds-to-Words	Images, Captions	-	Link
SketchyCOCO	Images, Sketches	14K sketches, 14K photos	Link
FSCOCO	Images, Sketches	10K sketches	Link

Datasets for other mutimodal composite retrieval

Dataset	Modalities	Scale	Link
HumanML3D	Motions, Captions	14K motion sequences, 44K text	Link
KIT-ML	Motions, Captions	3K motion sequences, 6K text	Link
Text2Shape	Shapes, Captions	6K chairs, 8K tables, 70K text	Link
Flickr30k LocNar	Images, Captions	31K images, 155K texts	Link
Conceptual Captions	Images, Captions	3.3M images, 33M texts	Link
Sydney_IV	RS Images, Audio Captions	613 images, 3K audio descriptions	Link
UCM_IV	Images, Audio Captions	2K images, 10K audio descriptions	Link
RSICD_IV	Image, Audio Captions	11K images, 55K audio descriptions	Link

Datasets for other mutimodal composite retrieval

Dataset	Modalities	Scale	Link
HumanML3D	Motions, Captions	14K motion sequences, 44K text	Link
KIT-ML	Motions, Captions	3K motion sequences, 6K text	Link
Text2Shape	Shapes, Captions	6K chairs, 8K tables, 70K text	Link
Flickr30k LocNar	Images, Captions	31K images, 155K texts	Link
Conceptual Captions	Images, Captions	3.3M images, 33M texts	Link
Sydney_IV	RS Images, Audio Captions	613 images, 3K audio descriptions	Link
UCM_IV	Images, Audio Captions	2K images, 10K audio descriptions	Link
RSICD_IV	Image, Audio Captions	11K images, 55K audio descriptions	Link

Experimental Results

Performance comparison on the Fashion-IQ datase((VAL split)

Methods	Image Encoder	Dress R@10	Dress R@50	Shirt R@10	Shirt R@50	Toptee R@10	Toptee R@50	Average R@10	Average R@50	Avg.
ARTEMIS+LSTM	ResNet-18	25.23	48.64	20.35	43.67	23.36	46.97	22.98	46.43	34.70
ARTEMIS+BiGRU	ResNet-18	24.84	49.00	20.40	43.22	23.63	47.39	22.95	46.54	34.75
JPM(VAL,MSE)	ResNet-18	21.27	43.12	21.88	43.30	25.81	50.27	22.98	45.59	34.29
JPM(VAL,Tri)	ResNet-18	21.38	45.15	22.81	45.18	27.78	51.70	23.99	47.34	35.67
EER	ResNet-50	30.02	55.44	25.32	49.87	33.20	60.34	29.51	55.22	42.36
Ranking-aware	ResNet-50	34.80	60.22	45.01	69.06	47.68	74.85	42.50	68.04	55.27
CRN	ResNet-50	30.20	57.15	29.17	55.03	33.70	63.91	31.02	58.70	44.86
DWC	ResNet-50	32.67	57.96	35.53	60.11	40.13	66.09	36.11	61.39	48.75
DATIR	ResNet-50	21.90	43.80	21.90	43.70	27.20	51.60	23.70	46.40	35.05
CoSMo	ResNet-50	25.64	50.30	24.90	49.18	29.21	57.46	26.58	52.31	39.45
FashionVLP	ResNet-50	32.42	60.29	31.89	58.44	38.51	68.79	34.27	62.51	48.39
CLVC-Net	ResNet-50	29.85	56.47	28.75	54.76	33.50	64.00	30.70	58.41	44.56
SAC w/BERT	ResNet-50	26.52	51.01	28.02	51.86	32.70	61.23	29.08	54.70	41.89
SAC w/ Random Emb.	ResNet-50	26.13	52.10	26.20	50.93	31.16	59.05	27.83	54.03	40.93
DCNet	ResNet-50	28.95	56.07	23.95	47.30	30.44	58.29	27.78	53.89	40.83
AMC	ResNet-50	31.73	59.25	30.67	59.08	36.21	66.60	32.87	61.64	47.25
VAL(Lvv)	ResNet-50	21.12	42.19	21.03	43.44	25.64	49.49	22.60	45.04	33.82
ARTEMIS+LSTM	ResNet-50	27.34	51.71	21.05	44.18	24.91	49.87	24.43	48.59	36.51
ARTEMIS+BiGRU	ResNet-50	27.16	52.40	21.78	43.64	29.20	54.83	26.05	50.29	38.17
VAL(Lvv + Lvs)	ResNet-50	21.47	43.83	21.03	42.75	26.71	51.81	23.07	46.13	34.60
VAL(GloVe)	ResNet-50	22.53	44.00	22.38	44.15	27.53	51.68	24.15	46.61	35.38
AlRet	ResNet-50	30.19	58.80	29.39	55.69	37.66	64.97	32.36	59.76	46.12
RTIC	ResNet-50	19.40	43.51	16.93	38.36	21.58	47.88	19.30	43.25	31.28
RTIC-GCN	ResNet-50	19.79	43.55	16.95	38.67	21.97	49.11	19.57	43.78	31.68
Uncertainty (CLVC-Net)	ResNet-50	30.60	57.46	31.54	58.29	37.37	68.41	33.17	61.39	47.28
Uncertainty (CLIP4CIR)	ResNet-50	32.61	61.34	33.23	62.55	41.40	72.51	35.75	65.47	50.61
CRR	ResNet-101	30.41	57.11	33.67	64.48	30.73	58.02	31.60	59.87	45.74
CIRPLANT	ResNet-152	14.38	34.66	13.64	33.56	16.44	38.34	14.82	35.52	25.17
CIRPLANT w/OSCAR	ResNet-152	17.45	40.41	17.53	38.81	21.64	45.38	18.87	41.53	30.20
ComqueryFormer	Swin	33.86	61.08	35.57	62.19	42.07	69.30	37.17	64.19	50.68
CRN	Swin	30.34	57.61	29.83	55.54	33.91	64.04	31.36	59.06	45.21
CRN	Swin-L	32.67	59.30	30.27	56.97	37.74	65.94	33.56	60.74	47.15
BLIP4CIR1	BLIP-B	43.78	67.38	45.04	67.47	49.62	72.62	46.15	69.15	57.65
CASE	BLIP	47.44	69.36	48.48	70.23	50.18	72.24	48.79	70.68	59.74
BLIP4CIR2	BLIP	40.65	66.34	40.38	64.13	46.86	69.91	42.63	66.79	54.71
BLIP4CIR2+Bi	BLIP	42.09	67.33	41.76	64.28	46.61	70.32	43.49	67.31	55.40
CLIP4CIR3	CLIP	39.46	64.55	44.41	65.26	47.48	70.98	43.78	66.93	55.36
CLIP4CIR	CLIP	33.81	59.40	39.99	60.45	41.41	65.37	38.32	61.74	50.03
AlRet	CLIP-RN50	40.23	65.89	47.15	70.88	51.05	75.78	46.10	70.80	58.50
Combiner	CLIP-RN50	31.63	56.67	36.36	58.00	38.19	62.42	35.39	59.03	47.21
DQU-CIR	CLIP-H	57.63	78.56	62.14	80.38	66.15	85.73	61.97	81.56	71.77
PL4CIR	CLIP-L	38.18	64.50	48.63	71.54	52.32	76.90	46.37	70.98	58.68
TG-CIR	CLIP-B	45.22	69.66	52.60	72.52	56.14	77.10	51.32	73.09	62.21
PL4CIR	CLIP-B	33.22	59.99	46.17	68.79	46.46	73.84	41.98	67.54	54.76

Performance comparison on the Fashion-IQ dataset(original split)

Methods	Image Encoder	Dress R@10	Dress R@50	Shirt R@10	Shirt R@50	Toptee R@10	Toptee R@50	Average R@10	Average R@50	Avg.
ComposeAE	ResNet-18	10.77	28.29	9.96	25.14	12.74	30.79	-	-	-
TIRG	ResNet-18	14.87	34.66	18.26	37.89	19.08	39.62	17.40	37.39	27.40
MAAF	ResNet-50	23.80	48.60	21.30	44.20	27.90	53.60	24.30	48.80	36.60
Leveraging	ResNet-50	19.33	43.52	14.47	35.47	19.73	44.56	17.84	41.18	29.51
MCR	ResNet-50	26.20	51.20	22.40	46.01	29.70	56.40	26.10	51.20	38.65
MCEM ((L_CE))	ResNet-50	30.07	56.13	23.90	47.60	30.90	57.52	28.29	53.75	41.02
MCEM ((L_FCE))	ResNet-50	31.50	58.41	25.01	49.73	32.77	61.02	29.76	56.39	43.07
MCEM ((L_AFCE))	ResNet-50	33.23	59.16	26.15	50.87	33.83	61.40	31.07	57.14	44.11
AlRet	ResNet-50	27.34	53.42	21.30	43.08	29.07	54.21	25.86	50.17	38.02
MCEM ((L_AFCE) w/ BERT)	ResNet-50	32.11	59.21	27.28	52.01	33.96	62.30	31.12	57.84	44.48
JVSM	MobileNet-v1	10.70	25.90	12.00	27.10	13.00	26.90	11.90	26.63	19.27
FashionIQ (Dialog Turn 1)	EfficientNet-b	12.45	35.21	11.05	28.99	11.24	30.45	11.58	31.55	21.57
FashionIQ (Dialog Turn 5)	EfficientNet-b	41.35	73.63	33.91	63.42	33.52	63.85	36.26	66.97	51.61
AACL	Swin	29.89	55.85	24.82	48.85	30.88	56.85	28.53	53.85	41.19
ComqueryFormer	Swin	28.85	55.38	25.64	50.22	33.61	60.48	29.37	55.36	42.36
AlRet	CLIP	35.75	60.56	37.02	60.55	42.25	67.52	38.30	62.82	50.56
MCEM ((L_AFCE))	CLIP	33.98	59.96	40.15	62.76	43.75	67.70	39.29	63.47	51.38
SPN (TG-CIR)	CLIP	36.84	60.83	41.85	63.89	45.59	68.79	41.43	64.50	52.97
SPN (CLIP4CIR)	CLIP	38.82	62.92	45.83	66.44	48.80	71.29	44.48	66.88	55.68
PL4CIR	CLIP-B	29.00	53.94	35.43	58.88	39.16	64.56	34.53	59.13	46.83
FAME-ViL	CLIP-B	42.19	67.38	47.64	68.79	50.69	73.07	46.84	69.75	58.30
PALAVRA	CLIP-B	17.25	35.94	21.49	37.05	20.55	38.76	19.76	37.25	28.51
MagicLens-B	CLIP-B	21.50	41.30	27.30	48.80	30.20	52.30	26.30	47.40	36.85
SEARLE	CLIP-B	18.54	39.51	24.44	41.61	25.70	46.46	22.89	42.53	32.71
CIReVL	CLIP-B	25.29	46.36	28.36	47.84	31.21	53.85	28.29	49.35	38.82
SEARLE-OTI	CLIP-B	17.85	39.91	25.37	41.32	24.12	45.79	22.44	42.34	32.39
PLI	CLIP-B	25.71	47.81	33.36	53.47	34.87	58.44	31.31	53.24	42.28
PL4CIR	CLIP-L	33.60	58.90	39.45	61.78	43.96	68.33	39.02	63.00	51.01
SEARLE-XL	CLIP-L	20.48	43.13	26.89	45.58	29.32	49.97	25.56	46.23	35.90
SEARLE-XL-OTI	CLIP-L	21.57	44.47	30.37	47.49	30.90	51.76	27.61	47.90	37.76
Context-I2W	CLIP-L	23.10	45.30	29.70	48.60	30.60	52.90	27.80	48.90	38.35
CompoDiff (with SynthTriplets18M)	CLIP-L	32.24	46.27	37.69	49.08	38.12	50.57	36.02	48.64	42.33
CompoDiff (with SynthTriplets18M)	CLIP-L	37.78	49.10	41.31	55.17	44.26	56.41	39.02	51.71	46.85
Pic2Word	CLIP-L	20.00	40.20	26.20	43.60	27.90	47.40	24.70	43.70	34.20
PLI	CLIP-L	28.11	51.12	38.63	58.51	39.42	62.68	35.39	57.44	46.42
KEDs	CLIP-L	21.70	43.80	28.90	48.00	29.90	51.90	26.80	47.90	37.35
CIReVL	CLIP-L	24.79	44.76	29.49	47.40	31.36	53.65	28.55	48.57	38.56
LinCIR	CLIP-L	20.92	42.44	29.10	46.81	28.81	50.18	26.28	46.49	36.39
MagicLens-L	CLIP-L	25.50	46.10	32.70	53.80	34.00	57.70	30.70	52.50	41.60
LinCIR	CLIP-H	29.80	52.11	36.90	57.75	42.07	62.52	36.26	57.46	46.86
DQU-CIR	CLIP-H	51.90	74.37	53.57	73.21	58.48	79.23	54.65	75.60	65.13
LinCIR	CLIP-G	38.08	60.88	46.76	65.11	50.48	71.09	45.11	65.69	55.40
CIReVL	CLIP-G	27.07	49.53	33.71	51.42	35.80	56.14	32.19	52.36	42.28
MagicLens-B	CoCa-B	29.00	48.90	36.50	55.50	40.20	61.90	35.20	55.40	45.30
MagicLens-L	CoCa-L	32.30	52.70	40.50	59.20	41.40	63.00	38.00	58.20	48.10
SPN (BLIP4CIR1)	BLIP	44.52	67.13	45.68	67.96	50.74	73.79	46.98	69.63	58.30
PLI	BLIP-B	28.62	50.78	38.09	57.79	40.92	62.68	35.88	57.08	46.48
SPN (SPRC)	BLIP-2	50.57	74.12	57.70	75.27	60.84	79.96	56.37	76.45	66.41
CurlingNet	-	24.44	47.69	18.59	40.57	25.19	49.66	22.74	45.97	34.36

Performance comparison on the Fashion200k dataset

Methods	Image Encoder	R@1	R@10	R@50
TIRG	ResNet-18	14.10	42.50	63.80
ComposeAE	ResNet-18	22.80	55.30	73.40
HCL	ResNet-18	23.48	54.03	73.71
CoSMo	ResNet-18	23.30	50.40	69.30
JPM(TIRG,MSE)	ResNet-18	19.80	46.50	66.60
JPM(TIRG,Tri)	ResNet-18	17.70	44.70	64.50
ARTEMIS	ResNet-18	21.50	51.10	70.50
GA(TIRG-BERT)	ResNet-18	31.40	54.10	77.60
LGLI	ResNet-18	26.50	58.60	75.60
AlRet	ResNet-18	24.42	53.93	73.25
FashionVLP	ResNet-18	-	49.90	70.50
CLVC-Net	ResNet-50	22.60	53.00	72.20
Uncertainty	ResNet-50	21.80	52.10	70.20
MCR	ResNet-50	49.40	69.40	59.40
CRN	ResNet-50	-	53.10	73.00
EER w/ Random Emb.	ResNet-50	-	51.09	70.23
EER w/ GloVe	ResNet-50	-	50.88	73.40
DWC	ResNet-50	36.49	63.58	79.02
JGAN	ResNet-101	17.34	45.28	65.65
CRR	ResNet-101	24.85	56.41	73.56
GSCMR	ResNet-101	21.57	52.84	70.12
VAL(GloVe)	MobileNet	22.90	50.80	73.30
VAL(Lvv+Lvs)	MobileNet	21.50	53.80	72.70
DATIR	MobileNet	21.50	48.80	71.60
VAL(Lvv)	MobileNet	21.20	49.00	68.80
JVSM	MobileNet-v1	19.00	52.10	70.00
TIS	MobileNet-v1	17.76	47.54	68.02
DCNet	MobileNet-v1	-	46.89	67.56
TIS	Inception-v3	16.25	44.14	65.02
LBF(big)	Faster-RCNN	17.78	48.35	68.50
LBF(small)	Faster-RCNN	16.26	46.90	71.73
ProVLA	Swin	21.70	53.70	74.60
CRN	Swin	-	53.30	73.30
ComqueryFormer	Swin	-	52.20	72.20
AACL	Swin	19.64	58.85	78.86
CRN	Swin-L	-	53.50	74.50
DQU-CIR	CLIP-H	36.80	67.90	87.80

Performance comparison on the MIT-States dataset

Methods	Image Encoder	R@1	R@10	R@50	Average
TIRG	ResNet-18	12.20	31.90	43.10	29.10
ComposeAE	ResNet-18	13.90	35.30	47.90	32.37
HCL	ResNet-18	15.22	35.95	46.71	32.63
GA(TIRG)	ResNet-18	13.60	32.40	43.20	29.70
GA(TIRG-BERT)	ResNet-18	15.40	36.30	47.70	33.20
GA(ComposeAE)	ResNet-18	14.60	37.00	47.90	33.20
LGLI	ResNet-18	14.90	36.40	47.70	33.00
MAAF	ResNet-50	12.70	32.60	44.80	-
MCR	ResNet-50	14.30	35.36	47.12	32.26
CRR	ResNet-101	17.71	37.16	47.83	34.23
JGAN	ResNet-101	14.27	33.21	45.34	29.10
GSCMR	ResNet-101	17.28	-	36.45	-
TIS	Inception-v3	13.13	31.94	43.32	29.46
LBF(big)	Faster-RCNN	14.72	35.30	46.56	96.58
LBF(small)	Faster-RCNN	14.29	-	34.67	46.06

Performance comparison on the CSS dataset

Methods	Image Encoder	R@1(3D-to-3D)	R@1(2D-to-3D
TIRG	ResNet-18	73.70	46.60
HCL	ResNet-18	81.59	58.65
GA(TIRG)	ResNet-18	91.20	-
TIRG+JPM(MSE)	ResNet-18	83.80	-
TIRG+JPM(Tri)	ResNet-18	83.20	-
LGLI	ResNet-18	93.30	-
MAAF	ResNet-50	87.80	-
CRR	ResNet-101	85.84	-
JGAN	ResNet-101	76.07	48.85
GSCMR	ResNet-101	81.81	58.74
TIS	Inception-v3	76.64	48.02
LBF(big)	Faster-RCNN	79.20	55.69
LBF(small)	Faster-RCNN	67.26	50.31

Performance comparison on the Shoes dataset

Methods	Image Encoder	R@1	R@10	R@50	Average
ComposeAE	ResNet-18	31.25	60.30	-	-
TIRG	ResNet-50	12.60	45.45	69.39	42.48
VAL(Lvv)	ResNet-50	16.49	49.12	73.53	46.38
VAL(Lvv + Lvs)	ResNet-50	16.98	49.83	73.91	46.91
VAL(GloVe)	ResNet-50	17.18	51.52	75.83	48.18
CoSMo	ResNet-50	16.72	48.36	75.64	46.91
CLVC-Net	ResNet-50	17.64	54.39	79.47	50.50
DCNet	ResNet-50	-	53.82	79.33	-
SAC w/BERT	ResNet-50	18.50	51.73	77.28	49.17
SAC w/Random Emb.	ResNet-50	18.11	52.41	75.42	48.64
ARTEMIS+LSTM	ResNet-50	17.60	51.05	76.85	48.50
ARTEMIS+BiGRU	ResNet-50	18.72	53.11	79.31	50.38
AMC	ResNet-50	19.99	56.89	79.27	52.05
DATIR	ResNet-50	17.20	51.10	75.60	47.97
MCR	ResNet-50	17.85	50.95	77.24	48.68
EER	ResNet-50	20.05	56.02	79.94	52.00
CRN	ResNet-50	17.19	53.88	79.12	50.06
Uncertainty	ResNet-50	18.41	53.63	79.84	50.63
FashionVLP	ResNet-50	-	49.08	77.32	-
DWC	ResNet-50	18.94	55.55	80.19	51.56
MCEM((L_CE))	ResNet-50	15.17	49.33	73.78	46.09
MCEM((L_FCE))	ResNet-50	18.13	54.31	78.65	50.36
MCEM((L_AFCE))	ResNet-50	19.10	55.37	79.57	51.35
AlRet	ResNet-50	18.13	53.98	78.81	50.31
RTIC	ResNet-50	43.66	72.11	-	-
RTIC-GCN	ResNet-50	43.38	72.09	-	-
CRR	ResNet-101	18.41	56.38	79.92	51.57
CRN	Swin	17.32	54.15	79.34	50.27
ProVLA	Swin	19.20	56.20	73.30	49.57
CRN	Swin-L	18.92	54.55	80.04	51.17
AlRet	CLIP	21.02	55.72	80.77	52.50
PL4CIR	CLIP-L	22.88	58.83	84.16	55.29
PL4CIR	CLIP-B	19.53	55.65	80.58	51.92
TG-CIR	CLIP-B	25.89	63.20	85.07	58.05
DQU-CIR	CLIP-H	31.47	69.19	88.52	63.06

Performance comparison on the CIRR dataset

Methods	Image Encoder	R@1	R@5	R@10	R@50
ComposeAE	ResNet-18	-	29.60	59.82	-
MCEM((L_CE))	ResNet-18	14.26	40.46	55.61	85.66
MCEM((L_FCE))	ResNet-18	16.12	43.92	58.87	86.85
MCEM((L_AFCE))	ResNet-18	17.48	46.13	62.17	88.91
Ranking-aware	ResNet-50	32.24	66.63	79.23	96.43
SAC w/BERT	ResNet-50	-	19.56	45.24	-
SAC w/Random Emb.	ResNet-50	-	20.34	44.94	-
ARTEMIS+BiGRU	ResNet-152	16.96	46.10	61.31	87.73
CIRPLANT	ResNet-152	15.18	43.36	60.48	87.64
CIRPLANT w/ OSCAR	ResNet-152	19.55	52.55	68.39	92.38
CASE	ViT	48.00	79.11	87.25	97.57
ComqueryFormer	Swin	25.76	61.76	75.90	95.13
CLIP4CIR	CLIP	38.53	69.98	81.86	95.93
CLIP4CIR3	CLIP	44.82	77.04	86.65	97.90
SPN(TG-CIR)	CLIP	47.28	79.13	87.98	97.54
SPN(CLIP4CIR)	CLIP	45.33	78.07	87.61	98.17
Combiner	CLIP	33.59	65.35	77.35	95.21
MCEM((L_AFCE))	CLIP	39.80	74.24	85.71	97.23
TG-CIR	CLIP-B	45.25	78.29	87.16	97.30
CIReVL	CLIP-B	23.94	52.51	66.00	86.95
SEARLE-OTI	CLIP-B	24.27	53.25	66.10	88.84
SEARLE	CLIP-B	24.00	53.42	66.82	89.78
PLI	CLIP-B	18.80	46.07	60.75	86.41
SEARLE-XL	CLIP-L	24.24	52.48	66.29	88.84
SEARLE-XL-OTI	CLIP-L	24.87	52.31	66.29	88.58
CIReVL	CLIP-L	24.55	52.31	64.92	86.34
Context-I2W	CLIP-L	25.60	55.10	68.50	89.80
Pic2Word	CLIP-L	23.90	51.70	65.30	87.80
CompoDiff(with SynthTriplets18M)	CLIP-L	18.24	53.14	70.82	90.25
LinCIR	CLIP-L	25.04	53.25	66.68	-
PLI	CLIP-L	25.52	54.58	67.59	88.70
KEDs	CLIP-L	26.40	54.80	67.20	89.20
CIReVL	CLIP-G	34.65	64.29	75.06	91.66
LinCIR	CLIP-G	35.25	64.72	76.05	-
CompoDiff(with SynthTriplets18M)	CLIP-G	26.71	55.14	74.52	92.01
LinCIR	CLIP-H	33.83	63.52	75.35	-
DQU-CIR	CLIP-H	46.22	78.17	87.64	97.81
PLI	BLIP	27.23	58.87	71.40	91.25
BLIP4CIR2	BLIP	40.17	71.81	83.18	95.69
BLIP4CIR2+Bi	BLIP	40.15	73.08	83.88	96.27
SPN(BLIP4CIR1)	BLIP	46.43	77.64	87.01	97.06
SPN(SPRC)	BLIP-2	55.06	83.83	90.87	98.29
BLIP4CIR1	BLIP-B	46.83	78.59	88.04	97.08

[NOTE] If you have any questions, please don't hesitate to contact us.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
README.md		README.md

fuxianghuang1/Multimodal-Composite-Editing-and-Retrieval

Folders and files

Latest commit

History

Repository files navigation

Multimodal Composite Editing and Retrieval

Papers and related codes

Image-text composite editing

2024

2023

2022

2021

2020

2019

2018

2017

Image-text composite retrieval

2024

2023

2022

2021

2020

2019

2018

2017

Other mutimodal composite retrieval

2024

2023

2022

2021

2020

2019

2017

2014

2013

2003

Datasets

Datasets for image-text composite editing

Datasets for image-text composite retrieval

Datasets for other mutimodal composite retrieval

Datasets for other mutimodal composite retrieval

Experimental Results

Performance comparison on the Fashion-IQ datase((VAL split)

Performance comparison on the Fashion-IQ dataset(original split)

Performance comparison on the Fashion200k dataset

Performance comparison on the MIT-States dataset

Performance comparison on the CSS dataset

Performance comparison on the Shoes dataset

Performance comparison on the CIRR dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages