Skip to content

fuxianghuang1/Multimodal-Composite-Editing-and-Retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 

Repository files navigation

Multimodal Composite Editing and Retrieval

🔥🔥 This is a collection of awesome articles about multimodal composite editing and retrieval🔥🔥

[NEWS.20240909] The related survey paper has been released.

If you find this repository is useful for you, please cite our paper:

@misc{li2024survey,
      title={A Survey of Multimodal Composite Editing and Retrieval}, 
      author={Suyan Li, Fuxiang Huang, and Lei Zhang},
      year={2024},
      eprint={2409.05405},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Papers and related codes

Image-text composite editing

2024

[WACV, 2024] Text-to-Image Editing by Image Information Removal
Zhongping Zhang, Jian Zheng, Zhiyuan Fang, Bryan A. Plummer
[Paper]

[WACV, 2024] Shape-Guided Diffusion with Inside-Outside Attention
Dong Huk Park, Grace Luo, Clayton Toste, Samaneh Azadi, Xihui Liu, Maka Karalashvili, Anna Rohrbach, Trevor Darrell
[Paper]

2023

[IEEE Access, 2023] Text-Guided Image Manipulation via Generative Adversarial Network With Referring Image Segmentation-Based Guidance
Yuto Watanabe, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
[Paper]

[arXiv, 2023] InstructEdit: Improving Automatic Masks for Diffusion-Based Image Editing with User Instructions
Qian Wang, Biao Zhang, Michael Birsak, Peter Wonka
[Paper] [GitHub]

[ICLR, 2023] DiffEdit: Diffusion-Based Semantic Image Editing with Mask Guidance
Guillaume Couairon, Jakob Verbeek, Holger Schwenk, Matthieu Cord
[Paper] [GitHub]

[CVPR, 2023] SINE: Single Image Editing with Text-to-Image Diffusion Models
Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris N Metaxas, Jian Ren
[Paper] [GitHub]

[CVPR, 2023] Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation
Narek Tumanyan, Michal Geyer, Shai Bagon, Tali Dekel
[Paper] [GitHub]

[arXiv, 2023] PRedItOR: Text Guided Image Editing with Diffusion Prior
Hareesh Ravi, Sachin Kelkar, Midhun Harikumar, Ajinkya Kale
[Paper]

[TOG, 2023] Unitune: Text-Driven Image Editing by Fine Tuning a Diffusion Model on a Single Image
Dani Valevski, Matan Kalman, Eyal Molad, Eyal Segalis, Yossi Matias, Yaniv Leviathan
[Paper] [GitHub]

[arXiv, 2023] Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models
Jooyoung Choi, Yunjey Choi, Yunji Kim, Junho Kim, Sungroh Yoon
[Paper] [GitHub]

[CVPR, 2023] Imagic: Text-Based Real Image Editing with Diffusion Models
Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, Michal Irani
[Paper] [GitHub]

[ICLR, 2023] Diffusion-Based Image Translation Using Disentangled Style and Content Representation
Gihyun Kwon, Jong Chul Ye
[Paper] [GitHub]

[arXiv, 2023] MDP: A Generalized Framework for Text-Guided Image Editing by Manipulating the Diffusion Path
Qian Wang, Biao Zhang, Michael Birsak, Peter Wonka
[Paper] [GitHub]

[CVPR, 2023] InstructPix2Pix: Learning to Follow Image Editing Instructions
Tim Brooks, Aleksander Holynski, Alexei A. Efros
[Paper] [GitHub]

[ICCV, 2023] Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models
Wenkai Dong, Song Xue, Xiaoyue Duan, Shumin Han
[Paper]

[arXiv, 2023] DeltaSpace: A Semantic-Aligned Feature Space for Flexible Text-Guided Image Editing
Yueming Lyu, Kang Zhao, Bo Peng, Yue Jiang, Yingya Zhang, Jing Dong
[Paper]

[AAAI, 2023] DE-Net: Dynamic Text-Guided Image Editing Adversarial Networks
Ming Tao, Bing-Kun Bao, Hao Tang, Fei Wu, Longhui Wei, Qi Tian
[Paper] [GitHub]

2022

[ACM MM, 2022] LS-GAN: Iterative Language-Based Image Manipulation via Long and Short Term Consistency Reasoning
Gaoxiang Cong, Liang Li, Zhenhuan Liu, Yunbin Tu, Weijun Qin, Shenyuan Zhang, Chengang Yan, Wenyu Wang, Bin Jiang
[Paper]

[arXiv, 2022] FEAT: Face Editing with Attention
Xianxu Hou, Linlin Shen, Or Patashnik, Daniel Cohen-Or, Hui Huang
[Paper]

[ECCV, 2022] VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance
Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, Edward Raff
[Paper] [GitHub]

[ICML, 2022] GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, Mark Chen
[Paper] [GitHub]

[WACV, 2022] StyleMC: Multi-Channel Based Fast Text-Guided Image Generation and Manipulation
Umut Kocasari, Alara Dirik, Mert Tiftikci, Pinar Yanardag
[Paper] [GitHub] [website]

[CVPR, 2022] HairCLIP: Design Your Hair by Text and Reference Image
Tianyi Wei, Dongdong Chen, Wenbo Zhou, Jing Liao, Zhentao Tan, Lu Yuan, Weiming Zhang, Nenghai Yu
[Paper] [GitHub]

[NeurIPS, 2022] One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations
Yiming Zhu, Hongyu Liu, Yibing Song, Ziyang Yuan, Xintong Han, Chun Yuan, Qifeng Chen, Jue Wang
[Paper] [GitHub]

[CVPR, 2022] Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model
Zipeng Xu, Tianwei Lin, Hao Tang, Fu Li, Dongliang He, Nicu Sebe, Radu Timofte, Luc Van Gool, Errui Ding
[Paper] [GitHub]

[CVPR, 2022] Blended Diffusion for Text-Driven Editing of Natural Images
Omri Avrahami, Dani Lischinski, Ohad Fried
[Paper] [GitHub]

[CVPR, 2022] DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation
Gwanghyun Kim, Taesung Kwon, Jong Chul Ye
[Paper] [GitHub]

[ICLR, 2022] SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, Stefano Ermon
[Paper] [GitHub] [Website]

2021

[CVPR, 2021] TediGAN: Text-guided diverse face image generation and manipulation
Weihao Xia, Yujiu Yang, Jing-Hao Xue, Baoyuan Wu
[Paper] [GitHub]

[ICIP, 2021] Segmentation-Aware Text-Guided Image Manipulation
Tomoki Haruyama, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
[Paper] [GitHub]

[IJPR, 2021] FocusGAN: Preserving Background in Text-Guided Image Editing
Liuqing Zhao, Linyan Li, Fuyuan Hu, Zhenping Xia, Rui Yao
[Paper]

[ICCV, 2021] StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, Dani Lischinski
[Paper] [GitHub]

[MM, 2021] Text as Neural Operator: Image Manipulation by Text Instruction
Tianhao Zhang, Hung-Yu Tseng, Lu Jiang, Weilong Yang, Honglak Lee, Irfan Essa
[Paper] [GitHub]

[CVPR, 2021] TediGAN: Text-guided diverse face image generation and manipulation
Weihao Xia, Yujiu Yang, Jing-Hao Xue, Baoyuan Wu
[Paper] [GitHub]

[arXiv, 2021] Paint by Word
Alex Andonian, Sabrina Osmany, Audrey Cui, YeonHwan Park, Ali Jahanian, Antonio Torralba, David Bau
[Paper] [GitHub] [Website]

[CVPR, 2021] Learning by Planning: Language-Guided Global Image Editing
Jing Shi, Ning Xu, Yihang Xu, Trung Bui, Franck Dernoncourt, Chenliang Xu
[Paper] [GitHub]

2020

[ACM MM, 2020] IR-GAN: Image Manipulation with Linguistic Instruction by Increment Reasoning
Zhenhuan Liu, Jincan Deng, Liang Li, Shaofei Cai, Qianqian Xu, Shuhui Wang, Qingming Huang
[Paper] [GitHub]

[CVPR, 2020] ManiGAN: Text-Guided Image Manipulation
Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, Philip HS Torr
[Paper] [GitHub]

[NeurIPS, 2020] Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation
Bowen Li, Xiaojuan Qi, Philip Torr, Thomas Lukasiewicz
[Paper] [GitHub]

[LNCS, 2020] CAFE-GAN: Arbitrary Face Attribute Editing with Complementary Attention Feature
Jeong-gi Kwak, David K. Han, Hanseok Ko
[Paper] [GitHub]

[ECCV, 2020] Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary Instructions
Xihui Liu, Zhe Lin, Jianming Zhang, Handong Zhao, Quan Tran, Xiaogang Wang, Hongsheng Li
[Paper] [GitHub]

[CVPR, 2020] Composed Query Image Retrieval Using Locally Bounded Features
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3596–3605, 2020. [Paper]

2019

[ICASSP, 2019] Bilinear Representation for Language-based Image Editing Using Conditional Generative Adversarial Networks
Xiaofeng Mao, Yuefeng Chen, Yuhong Li, Tao Xiong, Yuan He, Hui Xue
[Paper] [GitHub]

2018

[NeurIPS, 2018] Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language
Seonghyeon Nam, Yunji Kim, Seon Joo Kim
[Paper] [GitHub]

[CVPR, 2018] Language-based image editing with recurrent attentive models
Jianbo Chen, Yelong Shen, Jianfeng Gao, Jingjing Liu, Xiaodong Liu
[Paper]

[arXiv, 2018] Interactive Image Manipulation with Natural Language Instruction Commands
Seitaro Shinagawa, Koichiro Yoshino, Sakriani Sakti, Yu Suzuki, Satoshi Nakamura
[Paper]

[CVPR, 2018] Language-based image editing with recurrent attentive models
Jianbo Chen, Yelong Shen, Jianfeng Gao, Jingjing Liu, Xiaodong Liu
[Paper]

2017

[ICCV, 2017] Semantic image synthesis via adversarial learning
Hao Dong, Simiao Yu, Chao Wu, Yike Guo
[Paper] [GitHub]

Image-text composite retrieval

2024

[AAAI, 2024] Dynamic weighted combiner for mixed-modal image retrieval
Fuxiang Huang, Lei Zhang, Xiaowei Fu, Suqi Song
[Paper] [GitHub]

[ICMR, 2024] Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models
Hongyi Zhu, Jia-Hong Huang, Stevan Rudinac, Evangelos Kanoulas
[Paper] [GitHub]

[ACM MM, 2024] Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives
Zhangchi Feng, Richong Zhang, Zhijie Nie
[Paper] [GitHub]

[CVPR, 2024] Language-only Training of Zero-shot Composed Image Retrieval
Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, Sangdoo Yun
[Paper] [GitHub]

[AAAI, 2024] Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image Retrieval
Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Yue Hu, Qi Wu
[Paper] [GitHub]

[CVPR, 2024] Knowledge-enhanced dual-stream zero-shot composed image retrieval
Yucheng Suo, Fan Ma, Linchao Zhu, Yi Yang
[Paper]

[WACV, 2024] Bi-directional training for composed image retrieval via text prompt learning
Zheyuan Liu, Weixuan Sun, Yicong Hong, Damien Teney, Stephen Gould
[Paper]

[AAAI, 2024] Data roaming and quality assessment for composed image retrieval
Matan Levy, Rami Ben-Ari, Nir Darshan, Dani Lischinski
[Paper]

[TMLR, 2024] Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder
Zheyuan Liu, Weixuan Sun, Damien Teney, Stephen Gould
[Paper]

[ICML, 2024] MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, Ming-Wei Chang
[Paper]

[ICLR, 2024] Vision-by-Language for Training-Free Compositional Image Retrieval
Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata
[Paper]

[CVPR, 2024] CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion
Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, Sangdoo Yun
[Paper]

[ACM SIGIR, 2024] Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image Retrieval
Haokun Wen, Xuemeng Song, Xiaolin Chen, Yinwei Wei, Liqiang Nie, Tat-Seng Chua
[Paper]

[IEEE TIP, 2024] Multimodal Composition Example Mining for Composed Query Image Retrieval
Gangjian Zhang, Shikun Li, Shikui Wei, Shiming Ge, Na Cai, Yao Zhao
[Paper]

[IEEE TMM, 2024] Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text Feedback
Yahui Xu, Yi Bin, Jiwei Wei, Yang Yang, Guoqing Wang, Heng Tao Shen
[Paper]

2023

[CVPR, 2023] Fame-vil: Multi-tasking vision-language model for heterogeneous fashion tasks
Xiao Han, Xiatian Zhu, Licheng Yu, Li Zhang, Yi-Zhe Song, Tao Xiang
[Paper] [GitHub]

[ICCV, 2023] FashionNTM: Multi-turn fashion image retrieval via cascaded memory
Anwesan Pal, Sahil Wadhwa, Ayush Jaiswal, Xu Zhang, Yue Wu, Rakesh Chada, Pradeep Natarajan, Henrik I Christensen
[Paper]

[CVPR, 2023] Pic2word: Mapping pictures to words for zero-shot composed image retrieval
Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, Tomas Pfister
[Paper] [GitHub]

[arXiv, 2023] Pretrain like you inference: Masked tuning improves zero-shot composed image retrieval
Junyang Chen, Hanjiang Lai
[Paper]

[ICCV, 2023] Zero-shot composed image retrieval with textual inversion
Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Alberto Del Bimbo
[Paper] [GitHub]

[ACM, 2023] Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features
Alberto Baldrati, Marco Bertini, Tiberio Uricchio, Alberto Del Bimbo
[Paper] [GitHub]

[arXiv, 2023] Ranking-aware Uncertainty for Text-guided Image Retrieval
Junyang Chen, Hanjiang Lai
[Paper]

[IEEE TIP, 2023] Composed Image Retrieval via Cross Relation Network With Hierarchical Aggregation Transformer
Qu Yang, Mang Ye, Zhaohui Cai, Kehua Su, Bo Du
[Paper]

[IEEE TMM, 2023] Multi-Modal Transformer With Global-Local Alignment for Composed Query Image Retrieval
Yahui Xu, Yi Bin, Jiwei Wei, Yang Yang, Guoqing Wang, Heng Tao Shen
[Paper]

[ACM MM, 2023] Target-Guided Composed Image Retrieval
Haokun Wen, Xian Zhang, Xuemeng Song, Yinwei Wei, Liqiang Nie
[Paper]

[ICCV, 2023] ProVLA: Compositional Image Search with Progressive Vision-Language Alignment and Multimodal Fusion
Zhizhang Hu, Xinliang Zhu, Son Tran, René Vidal, Arnab Dhua
[Paper]

[CVPR 2023] Language Guided Local Infiltration for Interactive Image Retrieval
Fuxiang Huang, Lei Zhang
[Paper]

2022

[IEEE TMM, 2022] Adversarial and isotropic gradient augmentation for image retrieval with text feedback
Fuxiang Huang, Lei Zhang, Yuhang Zhou, Xinbo Gao
[Paper]

[ACM TOMM, 2022] Tell, imagine, and search: End-to-end learning for composing text and image to image retrieval
Feifei Zhang, Mingliang Xu, Changsheng Xu
[Paper]

[arXiv, 2022] Image Search with Text Feedback by Additive Attention Compositional Learning
Yuxin Tian, Shawn Newsam, Kofi Boakye
[Paper]

[IEEE TMM, 2022] Heterogeneous feature alignment and fusion in cross-modal augmented space for composed image retrieval
Huaxin Pang, Shikui Wei, Gangjian Zhang, Shiyin Zhang, Shuang Qiu, Yao Zhao
[Paper]

[IEEE TIP, 2022] Composed Image Retrieval via Explicit Erasure and Replenishment With Semantic Alignment
Gangjian Zhang, Shikui Wei, Huaxin Pang, Shuang Qiu, Yao Zhao
[Paper]

[ICLR, 2022] ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity
Ginger Delmas, Rafael S. Rezende, Gabriela Csurka, Diane Larlus
[Paper] [GitHub]

[WACV, 2022] SAC: Semantic attention composition for text-conditioned image retrieval
Surgan Jandial, Pinkesh Badjatiya, Pranit Chawla, Ayush Chopra, Mausoom Sarkar, Balaji Krishnamurthy
[Paper]

[ACM TOMCCAP, 2022] AMC: Adaptive Multi-expert Collaborative Network for Text-guided Image Retrieval
Hongguang Zhu, Yunchao Wei, Yao Zhao, Chunjie Zhang, Shujuan Huang
[Paper][GitHub]

[CVPR, 2022] FashionVLP: Vision Language Transformer for Fashion Retrieval With Feedback
Sonam Goenka, Zhaoheng Zheng, Ayush Jaiswal, Rakesh Chada, Yue Wu, Varsha Hedau, Pradeep Natarajan
[Paper]

[arXiv, 2022] Composed image retrieval with text feedback via multi-grained uncertainty regularization
Yiyang Chen, Zhedong Zheng, Wei Ji, Leigang Qu, Tat-Seng Chua
[Paper]

[ACM MM, 2022] Comprehensive Relationship Reasoning for Composed Query Based Image Retrieval
Feifei Zhang, Ming Yan, Ji Zhang, Changsheng Xu
[Paper]

[ACM SIGIR, 2022] Progressive learning for image retrieval with hybrid-modality queries
Yida Zhao, Yuqing Song, Qin Jin
[Paper]

[CVPR, 2022] Conditioned and composed image retrieval combining and partially fine-tuning clip-based features
Alberto Baldrati, Marco Bertini, Tiberio Uricchio, Alberto Del Bimbo
[Paper]

[CVPR, 2022] Effective conditioned and composed image retrieval combining clip-based features
Alberto Baldrati, Marco Bertini, Tiberio Uricchio, Alberto Del Bimbo
[Paper]

[ECCV, 2022] “This is my unicorn, Fluffy”: Personalizing frozen vision-language representations
Niv Cohen, Rinon Gal, Eli A. Meirom, Gal Chechik, Yuval Atzmon
[Paper]

[MMAsia, 2022] Hierarchical Composition Learning for Composed Query Image Retrieval
Yahui Xu, Yi Bin, Guoqing Wang, Yang Yang
Paper

[IEEE TIP, 2022] Geometry Sensitive Cross-Modal Reasoning for Composed Query Based Image Retrieval
Feifei Zhang, Mingliang Xu, Changsheng Xu
Paper

2021

[ACM SIGIR, 2021] Comprehensive linguistic-visual composition network for image retrieval
Haokun Wen, Xuemeng Song, Xin Yang, Yibing Zhan, Liqiang Nie
[Paper]

[AAAI, 2021] Dual compositional learning in interactive image retrieval
Jongseok Kim, Youngjae Yu, Hoeseong Kim, Gunhee Kim
[Paper] [GitHub]

[CVPR, 2021] Leveraging Style and Content features for Text Conditioned Image Retrieval
Pranit Chawla, Surgan Jandial, Pinkesh Badjatiya, Ayush Chopra, Mausoom Sarkar, Balaji Krishnamurthy
[Paper]

[ICCV, 2021] Image retrieval on real-life images with pre-trained vision-and-language models
Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, Stephen Gould
[Paper] [GitHub]

[ACM SIGIR, 2021] Conversational fashion image retrieval via multiturn natural language feedback
Yifei Yuan, Wai Lam
[Paper] [GitHub]

[WACV, 2021] Compositional learning of image-text query for image retrieval
Muhammad Umer Anwaar, Egor Labintcev, Martin Kleinsteuber
[Paper] [GitHub]

[ACM MM, 2021] Cross-modal Joint Prediction and Alignment for Composed Query Image Retrieval
Yuchen Yang, Min Wang, Wengang Zhou, Houqiang Li
[Paper]

[ACM MM, 2021] Image Search with Text Feedback by Deep Hierarchical Attention Mutual Information Maximization
Chunbin Gu, Jiajun Bu, Zhen Zhang, Zhi Yu, Dongfang Ma, Wei Wang
[Paper]

[CVPR, 2021] CoSMo: Content-Style Modulation for Image Retrieval With Text Feedback
Seungmin Lee, Dongwan Kim, Bohyung Han
[Paper]

[arXiv, 2021] RTIC: Residual Learning for Text and Image Composition using Graph Convolutional Network
Minchul Shin, Yoonjae Cho, Byungsoo Ko, Geonmo Gu
[Paper]

[CVPR, 2021] Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, Rogerio Feris
[Paper]

2020

[ECCV, 2020] Learning joint visual semantic matching embeddings for language-guided retrieval
Yanbei Chen, Loris Bazzani
[Paper]

[arXiv, 2020] CurlingNet: Compositional Learning between Images and Text for Fashion IQ Data
Youngjae Yu, Seunghwan Lee, Yuncheol Choi, Gunhee Kim
[Paper]

[CVPR, 2020] Image search with text feedback by visiolinguistic attention learning
Yanbei Chen, Shaogang Gong, Loris Bazzani
[Paper] [GitHub]

[arXiv, 2020] Modality-Agnostic Attention Fusion for visual search with text feedback
Eric Dodds, Jack Culpepper, Simao Herdade, Yang Zhang, Kofi Boakye
[Paper] [GitHub]

[ACM MM, 2020] Joint Attribute Manipulation and Modality Alignment Learning for Composing Text and Image to Image Retrieval
Feifei Zhang, Mingliang Xu, Qirong Mao, Changsheng Xu
[Paper]

2019

[CVPR, 2019] Composing Text and Image for Image Retrieval - An Empirical Odyssey
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Fei-Fei Li, James Hays
[Paper]

2018

[CVPR, 2018] Language-based image editing with recurrent attentive models
Jianbo Chen, Yelong Shen, Jianfeng Gao, Jingjing Liu, Xiaodong Liu
[Paper]

[NeurIPS, 2018] Dialog-based interactive image retrieval
Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, Rogerio Feris
[Paper] [GitHub]

2017

[ICCV, 2017] Automatic spatially-aware fashion concept discovery
Xintong Han, Zuxuan Wu, Phoenix X Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, Larry S Davis
[Paper]

[ICCV, 2017] Be your own prada: Fashion synthesis with structural coherence
Shizhan Zhu, Raquel Urtasun, Sanja Fidler, Dahua Lin, Chen Change Loy
[Paper] [GitHub]

Other mutimodal composite retrieval

2024

[CVPR, 2024] Tri-modal motion retrieval by learning a joint embedding space
Kangning Yin, Shihao Zou, Yuxuan Ge, Zheng Tian
[Paper]

[WACV, 2024] Modality-Aware Representation Learning for Zero-shot Sketch-based Image Retrieval
Eunyi Lyou, Doyeon Lee, Jooeun Kim, Joonseok Lee
[Paper] [GitHub]

[CVPR, 2024] Pros: Prompting-to-simulate generalized knowledge for universal cross-domain retrieval
Kaipeng Fang, Jingkuan Song, Lianli Gao, Pengpeng Zeng, Zhi-Qi Cheng, Xiyao Li, Heng Tao Shen
[Paper] [GitHub]

[CVPR, 2024] You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval
Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Yi-Zhe Song
[Paper]

[AAAI, 2024] T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models
Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan
[Paper] [GitHub]

[IEEE/CVF, 2024] TriCoLo: Trimodal contrastive loss for text to shape retrieval
Yue Ruan, Han-Hung Lee, Yiming Zhang, Ke Zhang, Angel X Chang
[Paper] [GitHub]

2023

[CVPR, 2023] SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo and Text
Pinaki Nath Chowdhury, Ayan Kumar Bhunia, Aneeshan Sain, Subhadeep Koley, Tao Xiang, Yi-Zhe Song
[Paper]

2022

[ECCV, 2022] A sketch is worth a thousand words: Image retrieval with text and sketch
Patsorn Sangkloy, Wittawat Jitkrittum, Diyi Yang, James Hays
[Paper]

[ECCV, 2022] Motionclip: Exposing human motion generation to clip space
Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, Daniel Cohen-Or
[Paper] [GitHub]

[IEEE J-STARS, 2022] Multimodal Fusion Remote Sensing Image–Audio Retrieval
Rui Yang, Shuang Wang, Yingzhi Sun, Huan Zhang, Yu Liao, Yu Gu, Biao Hou, Licheng Jiao
[Paper]

2021

[CVPR, 2021] Connecting what to say with where to look by modeling human attention traces
Zihang Meng, Licheng Yu, Ning Zhang, Tamara L Berg, Babak Damavandi, Vikas Singh, Amy Bearman
[Paper] [GitHub]

[ICCV, 2021] Telling the what while pointing to the where: Multimodal queries for image retrieval
Soravit Changpinyo, Jordi Pont-Tuset, Vittorio Ferrari, Radu Soricut
[Paper]

2020

[arXiv, 2020] A Feature Analysis for Multimodal News Retrieval
Golsa Tahmasebzadeh, Sherzod Hakimov, Eric MĂĽller-Budack, Ralph Ewerth
[Paper] [GitHub]

2019

[MTA, 2019] Efficient and interactive spatial-semantic image retrieval
Ryosuke Furuta, Naoto Inoue, Toshihiko Yamasaki
[Paper]

[arXiv, 2019] Query by Semantic Sketch
Luca Rossetto, Ralph Gasser, Heiko Schuldt
[Paper]

2017

[IJCNLP, 2017] Draw and tell: Multimodal descriptions outperform verbal-or sketch-only descriptions in an image retrieval task
Ting Han, David Schlangen
[Paper]

[CVPR, 2017] Spatial-Semantic Image Search by Visual Feature Synthesis
Long Mai, Hailin Jin, Zhe Lin, Chen Fang, Jonathan Brandt, Feng Liu
[Paper]

[ACM MM, 2017] Region-based image retrieval revisited
Ryota Hinami, Yusuke Matsui, Shin'ichi Satoh
[Paper]

2014

[Cancer Informatics, 2014] Medical image retrieval: a multimodal approach
Yu Cao, Shawn Steffey, Jianbiao He, Degui Xiao, Cui Tao, Ping Chen, Henning MĂĽller
[Paper]

2013

[SIGIR, 2013] NovaMedSearch: a multimodal search engine for medical case-based retrieval
André Mourão, Flávio Martins
[Paper]

[ICDAR, 2013] Multi-modal Information Integration for Document Retrieval
Ehtesham Hassan, Santanu Chaudhury, M. Gopal
[Paper]

2003

[EURASIP, 2003] Semantic indexing of multimedia content using visual, audio, and text cues
WH Adams, Giridharan Iyengar, Ching-Yung Lin, Milind Ramesh Naphade, Chalapathy Neti, Harriet J Nock, John R Smith
[Paper]

Datasets

Datasets for image-text composite editing

Dataset Modalities Scale Link
Caltech-UCSD Birds(CUB) Images, Captions 11K images, 11K attributes Link
Oxford-102 flower Images, Captions 8K images, 8K attributes Link
CelebFaces Attributes (CelebA) Images, Captions 202K images, 8M attributes Link
DeepFashion (Fashion Synthesis) Images, Captions 78K images, - Link
MIT-Adobe 5k Images, Captions 5K images, 20K texts Link
MS-COCO Image, Caption 164K images, 616K texts Link
ReferIt Image, Caption 19K images, 130K text Link
CLEVR 3D images, Questions 100K images, 865K questions Link
i-CLEVR 3D image, Instruction 10K sequences, 50K instructions Link
CSS 3D images, 2D images, Instructions 34K images, - Link
CoDraw images, text instructions 9K images, - Link
Cityscapes images, Captions 25K images, - Link
Zap-Seq image sequences, Captions 8K images, 18K texts -
DeepFashion-Seq image sequences, Captions 4K images, 12K texts -
FFHQ Images 70K images Link
LSUN Images 1M images Link
Animal FacesHQ (AFHQ) Images 15K images Link
CelebA-HQ Images 30K images Link
Animal faces Images 16K images Link
Landscapes Images 4K images Link

Datasets for image-text composite retrieval

Dataset Modalities Scale Link
Fashion200k Image, Captions 200K images, 200K text Link
MIT-States Image, Captions 53K images, 53K text Link
Fashion IQ Image, Captions 77K images, - Link
CIRR Image, Captions 21K images, - Link
CSS 3D images, 2D images, Instructions 34K images, - Link
Shoes Images 14K images Link
Birds-to-Words Images, Captions - Link
SketchyCOCO Images, Sketches 14K sketches, 14K photos Link
FSCOCO Images, Sketches 10K sketches Link

Datasets for other mutimodal composite retrieval

Dataset Modalities Scale Link
HumanML3D Motions, Captions 14K motion sequences, 44K text Link
KIT-ML Motions, Captions 3K motion sequences, 6K text Link
Text2Shape Shapes, Captions 6K chairs, 8K tables, 70K text Link
Flickr30k LocNar Images, Captions 31K images, 155K texts Link
Conceptual Captions Images, Captions 3.3M images, 33M texts Link
Sydney_IV RS Images, Audio Captions 613 images, 3K audio descriptions Link
UCM_IV Images, Audio Captions 2K images, 10K audio descriptions Link
RSICD_IV Image, Audio Captions 11K images, 55K audio descriptions Link

Datasets for other mutimodal composite retrieval

Dataset Modalities Scale Link
HumanML3D Motions, Captions 14K motion sequences, 44K text Link
KIT-ML Motions, Captions 3K motion sequences, 6K text Link
Text2Shape Shapes, Captions 6K chairs, 8K tables, 70K text Link
Flickr30k LocNar Images, Captions 31K images, 155K texts Link
Conceptual Captions Images, Captions 3.3M images, 33M texts Link
Sydney_IV RS Images, Audio Captions 613 images, 3K audio descriptions Link
UCM_IV Images, Audio Captions 2K images, 10K audio descriptions Link
RSICD_IV Image, Audio Captions 11K images, 55K audio descriptions Link

Experimental Results

Performance comparison on the Fashion-IQ datase((VAL split)

Methods Image Encoder Dress R@10 Dress R@50 Shirt R@10 Shirt R@50 Toptee R@10 Toptee R@50 Average R@10 Average R@50 Avg.
ARTEMIS+LSTM ResNet-18 25.23 48.64 20.35 43.67 23.36 46.97 22.98 46.43 34.70
ARTEMIS+BiGRU ResNet-18 24.84 49.00 20.40 43.22 23.63 47.39 22.95 46.54 34.75
JPM(VAL,MSE) ResNet-18 21.27 43.12 21.88 43.30 25.81 50.27 22.98 45.59 34.29
JPM(VAL,Tri) ResNet-18 21.38 45.15 22.81 45.18 27.78 51.70 23.99 47.34 35.67
EER ResNet-50 30.02 55.44 25.32 49.87 33.20 60.34 29.51 55.22 42.36
Ranking-aware ResNet-50 34.80 60.22 45.01 69.06 47.68 74.85 42.50 68.04 55.27
CRN ResNet-50 30.20 57.15 29.17 55.03 33.70 63.91 31.02 58.70 44.86
DWC ResNet-50 32.67 57.96 35.53 60.11 40.13 66.09 36.11 61.39 48.75
DATIR ResNet-50 21.90 43.80 21.90 43.70 27.20 51.60 23.70 46.40 35.05
CoSMo ResNet-50 25.64 50.30 24.90 49.18 29.21 57.46 26.58 52.31 39.45
FashionVLP ResNet-50 32.42 60.29 31.89 58.44 38.51 68.79 34.27 62.51 48.39
CLVC-Net ResNet-50 29.85 56.47 28.75 54.76 33.50 64.00 30.70 58.41 44.56
SAC w/BERT ResNet-50 26.52 51.01 28.02 51.86 32.70 61.23 29.08 54.70 41.89
SAC w/ Random Emb. ResNet-50 26.13 52.10 26.20 50.93 31.16 59.05 27.83 54.03 40.93
DCNet ResNet-50 28.95 56.07 23.95 47.30 30.44 58.29 27.78 53.89 40.83
AMC ResNet-50 31.73 59.25 30.67 59.08 36.21 66.60 32.87 61.64 47.25
VAL(Lvv) ResNet-50 21.12 42.19 21.03 43.44 25.64 49.49 22.60 45.04 33.82
ARTEMIS+LSTM ResNet-50 27.34 51.71 21.05 44.18 24.91 49.87 24.43 48.59 36.51
ARTEMIS+BiGRU ResNet-50 27.16 52.40 21.78 43.64 29.20 54.83 26.05 50.29 38.17
VAL(Lvv + Lvs) ResNet-50 21.47 43.83 21.03 42.75 26.71 51.81 23.07 46.13 34.60
VAL(GloVe) ResNet-50 22.53 44.00 22.38 44.15 27.53 51.68 24.15 46.61 35.38
AlRet ResNet-50 30.19 58.80 29.39 55.69 37.66 64.97 32.36 59.76 46.12
RTIC ResNet-50 19.40 43.51 16.93 38.36 21.58 47.88 19.30 43.25 31.28
RTIC-GCN ResNet-50 19.79 43.55 16.95 38.67 21.97 49.11 19.57 43.78 31.68
Uncertainty (CLVC-Net) ResNet-50 30.60 57.46 31.54 58.29 37.37 68.41 33.17 61.39 47.28
Uncertainty (CLIP4CIR) ResNet-50 32.61 61.34 33.23 62.55 41.40 72.51 35.75 65.47 50.61
CRR ResNet-101 30.41 57.11 33.67 64.48 30.73 58.02 31.60 59.87 45.74
CIRPLANT ResNet-152 14.38 34.66 13.64 33.56 16.44 38.34 14.82 35.52 25.17
CIRPLANT w/OSCAR ResNet-152 17.45 40.41 17.53 38.81 21.64 45.38 18.87 41.53 30.20
ComqueryFormer Swin 33.86 61.08 35.57 62.19 42.07 69.30 37.17 64.19 50.68
CRN Swin 30.34 57.61 29.83 55.54 33.91 64.04 31.36 59.06 45.21
CRN Swin-L 32.67 59.30 30.27 56.97 37.74 65.94 33.56 60.74 47.15
BLIP4CIR1 BLIP-B 43.78 67.38 45.04 67.47 49.62 72.62 46.15 69.15 57.65
CASE BLIP 47.44 69.36 48.48 70.23 50.18 72.24 48.79 70.68 59.74
BLIP4CIR2 BLIP 40.65 66.34 40.38 64.13 46.86 69.91 42.63 66.79 54.71
BLIP4CIR2+Bi BLIP 42.09 67.33 41.76 64.28 46.61 70.32 43.49 67.31 55.40
CLIP4CIR3 CLIP 39.46 64.55 44.41 65.26 47.48 70.98 43.78 66.93 55.36
CLIP4CIR CLIP 33.81 59.40 39.99 60.45 41.41 65.37 38.32 61.74 50.03
AlRet CLIP-RN50 40.23 65.89 47.15 70.88 51.05 75.78 46.10 70.80 58.50
Combiner CLIP-RN50 31.63 56.67 36.36 58.00 38.19 62.42 35.39 59.03 47.21
DQU-CIR CLIP-H 57.63 78.56 62.14 80.38 66.15 85.73 61.97 81.56 71.77
PL4CIR CLIP-L 38.18 64.50 48.63 71.54 52.32 76.90 46.37 70.98 58.68
TG-CIR CLIP-B 45.22 69.66 52.60 72.52 56.14 77.10 51.32 73.09 62.21
PL4CIR CLIP-B 33.22 59.99 46.17 68.79 46.46 73.84 41.98 67.54 54.76

Performance comparison on the Fashion-IQ dataset(original split)

Methods Image Encoder Dress R@10 Dress R@50 Shirt R@10 Shirt R@50 Toptee R@10 Toptee R@50 Average R@10 Average R@50 Avg.
ComposeAE ResNet-18 10.77 28.29 9.96 25.14 12.74 30.79 - - -
TIRG ResNet-18 14.87 34.66 18.26 37.89 19.08 39.62 17.40 37.39 27.40
MAAF ResNet-50 23.80 48.60 21.30 44.20 27.90 53.60 24.30 48.80 36.60
Leveraging ResNet-50 19.33 43.52 14.47 35.47 19.73 44.56 17.84 41.18 29.51
MCR ResNet-50 26.20 51.20 22.40 46.01 29.70 56.40 26.10 51.20 38.65
MCEM ((L_CE)) ResNet-50 30.07 56.13 23.90 47.60 30.90 57.52 28.29 53.75 41.02
MCEM ((L_FCE)) ResNet-50 31.50 58.41 25.01 49.73 32.77 61.02 29.76 56.39 43.07
MCEM ((L_AFCE)) ResNet-50 33.23 59.16 26.15 50.87 33.83 61.40 31.07 57.14 44.11
AlRet ResNet-50 27.34 53.42 21.30 43.08 29.07 54.21 25.86 50.17 38.02
MCEM ((L_AFCE) w/ BERT) ResNet-50 32.11 59.21 27.28 52.01 33.96 62.30 31.12 57.84 44.48
JVSM MobileNet-v1 10.70 25.90 12.00 27.10 13.00 26.90 11.90 26.63 19.27
FashionIQ (Dialog Turn 1) EfficientNet-b 12.45 35.21 11.05 28.99 11.24 30.45 11.58 31.55 21.57
FashionIQ (Dialog Turn 5) EfficientNet-b 41.35 73.63 33.91 63.42 33.52 63.85 36.26 66.97 51.61
AACL Swin 29.89 55.85 24.82 48.85 30.88 56.85 28.53 53.85 41.19
ComqueryFormer Swin 28.85 55.38 25.64 50.22 33.61 60.48 29.37 55.36 42.36
AlRet CLIP 35.75 60.56 37.02 60.55 42.25 67.52 38.30 62.82 50.56
MCEM ((L_AFCE)) CLIP 33.98 59.96 40.15 62.76 43.75 67.70 39.29 63.47 51.38
SPN (TG-CIR) CLIP 36.84 60.83 41.85 63.89 45.59 68.79 41.43 64.50 52.97
SPN (CLIP4CIR) CLIP 38.82 62.92 45.83 66.44 48.80 71.29 44.48 66.88 55.68
PL4CIR CLIP-B 29.00 53.94 35.43 58.88 39.16 64.56 34.53 59.13 46.83
FAME-ViL CLIP-B 42.19 67.38 47.64 68.79 50.69 73.07 46.84 69.75 58.30
PALAVRA CLIP-B 17.25 35.94 21.49 37.05 20.55 38.76 19.76 37.25 28.51
MagicLens-B CLIP-B 21.50 41.30 27.30 48.80 30.20 52.30 26.30 47.40 36.85
SEARLE CLIP-B 18.54 39.51 24.44 41.61 25.70 46.46 22.89 42.53 32.71
CIReVL CLIP-B 25.29 46.36 28.36 47.84 31.21 53.85 28.29 49.35 38.82
SEARLE-OTI CLIP-B 17.85 39.91 25.37 41.32 24.12 45.79 22.44 42.34 32.39
PLI CLIP-B 25.71 47.81 33.36 53.47 34.87 58.44 31.31 53.24 42.28
PL4CIR CLIP-L 33.60 58.90 39.45 61.78 43.96 68.33 39.02 63.00 51.01
SEARLE-XL CLIP-L 20.48 43.13 26.89 45.58 29.32 49.97 25.56 46.23 35.90
SEARLE-XL-OTI CLIP-L 21.57 44.47 30.37 47.49 30.90 51.76 27.61 47.90 37.76
Context-I2W CLIP-L 23.10 45.30 29.70 48.60 30.60 52.90 27.80 48.90 38.35
CompoDiff (with SynthTriplets18M) CLIP-L 32.24 46.27 37.69 49.08 38.12 50.57 36.02 48.64 42.33
CompoDiff (with SynthTriplets18M) CLIP-L 37.78 49.10 41.31 55.17 44.26 56.41 39.02 51.71 46.85
Pic2Word CLIP-L 20.00 40.20 26.20 43.60 27.90 47.40 24.70 43.70 34.20
PLI CLIP-L 28.11 51.12 38.63 58.51 39.42 62.68 35.39 57.44 46.42
KEDs CLIP-L 21.70 43.80 28.90 48.00 29.90 51.90 26.80 47.90 37.35
CIReVL CLIP-L 24.79 44.76 29.49 47.40 31.36 53.65 28.55 48.57 38.56
LinCIR CLIP-L 20.92 42.44 29.10 46.81 28.81 50.18 26.28 46.49 36.39
MagicLens-L CLIP-L 25.50 46.10 32.70 53.80 34.00 57.70 30.70 52.50 41.60
LinCIR CLIP-H 29.80 52.11 36.90 57.75 42.07 62.52 36.26 57.46 46.86
DQU-CIR CLIP-H 51.90 74.37 53.57 73.21 58.48 79.23 54.65 75.60 65.13
LinCIR CLIP-G 38.08 60.88 46.76 65.11 50.48 71.09 45.11 65.69 55.40
CIReVL CLIP-G 27.07 49.53 33.71 51.42 35.80 56.14 32.19 52.36 42.28
MagicLens-B CoCa-B 29.00 48.90 36.50 55.50 40.20 61.90 35.20 55.40 45.30
MagicLens-L CoCa-L 32.30 52.70 40.50 59.20 41.40 63.00 38.00 58.20 48.10
SPN (BLIP4CIR1) BLIP 44.52 67.13 45.68 67.96 50.74 73.79 46.98 69.63 58.30
PLI BLIP-B 28.62 50.78 38.09 57.79 40.92 62.68 35.88 57.08 46.48
SPN (SPRC) BLIP-2 50.57 74.12 57.70 75.27 60.84 79.96 56.37 76.45 66.41
CurlingNet - 24.44 47.69 18.59 40.57 25.19 49.66 22.74 45.97 34.36

Performance comparison on the Fashion200k dataset

Methods Image Encoder R@1 R@10 R@50
TIRG ResNet-18 14.10 42.50 63.80
ComposeAE ResNet-18 22.80 55.30 73.40
HCL ResNet-18 23.48 54.03 73.71
CoSMo ResNet-18 23.30 50.40 69.30
JPM(TIRG,MSE) ResNet-18 19.80 46.50 66.60
JPM(TIRG,Tri) ResNet-18 17.70 44.70 64.50
ARTEMIS ResNet-18 21.50 51.10 70.50
GA(TIRG-BERT) ResNet-18 31.40 54.10 77.60
LGLI ResNet-18 26.50 58.60 75.60
AlRet ResNet-18 24.42 53.93 73.25
FashionVLP ResNet-18 - 49.90 70.50
CLVC-Net ResNet-50 22.60 53.00 72.20
Uncertainty ResNet-50 21.80 52.10 70.20
MCR ResNet-50 49.40 69.40 59.40
CRN ResNet-50 - 53.10 73.00
EER w/ Random Emb. ResNet-50 - 51.09 70.23
EER w/ GloVe ResNet-50 - 50.88 73.40
DWC ResNet-50 36.49 63.58 79.02
JGAN ResNet-101 17.34 45.28 65.65
CRR ResNet-101 24.85 56.41 73.56
GSCMR ResNet-101 21.57 52.84 70.12
VAL(GloVe) MobileNet 22.90 50.80 73.30
VAL(Lvv+Lvs) MobileNet 21.50 53.80 72.70
DATIR MobileNet 21.50 48.80 71.60
VAL(Lvv) MobileNet 21.20 49.00 68.80
JVSM MobileNet-v1 19.00 52.10 70.00
TIS MobileNet-v1 17.76 47.54 68.02
DCNet MobileNet-v1 - 46.89 67.56
TIS Inception-v3 16.25 44.14 65.02
LBF(big) Faster-RCNN 17.78 48.35 68.50
LBF(small) Faster-RCNN 16.26 46.90 71.73
ProVLA Swin 21.70 53.70 74.60
CRN Swin - 53.30 73.30
ComqueryFormer Swin - 52.20 72.20
AACL Swin 19.64 58.85 78.86
CRN Swin-L - 53.50 74.50
DQU-CIR CLIP-H 36.80 67.90 87.80

Performance comparison on the MIT-States dataset

Methods Image Encoder R@1 R@10 R@50 Average
TIRG ResNet-18 12.20 31.90 43.10 29.10
ComposeAE ResNet-18 13.90 35.30 47.90 32.37
HCL ResNet-18 15.22 35.95 46.71 32.63
GA(TIRG) ResNet-18 13.60 32.40 43.20 29.70
GA(TIRG-BERT) ResNet-18 15.40 36.30 47.70 33.20
GA(ComposeAE) ResNet-18 14.60 37.00 47.90 33.20
LGLI ResNet-18 14.90 36.40 47.70 33.00
MAAF ResNet-50 12.70 32.60 44.80 -
MCR ResNet-50 14.30 35.36 47.12 32.26
CRR ResNet-101 17.71 37.16 47.83 34.23
JGAN ResNet-101 14.27 33.21 45.34 29.10
GSCMR ResNet-101 17.28 - 36.45 -
TIS Inception-v3 13.13 31.94 43.32 29.46
LBF(big) Faster-RCNN 14.72 35.30 46.56 96.58
LBF(small) Faster-RCNN 14.29 - 34.67 46.06

Performance comparison on the CSS dataset

Methods Image Encoder R@1(3D-to-3D) R@1(2D-to-3D
TIRG ResNet-18 73.70 46.60
HCL ResNet-18 81.59 58.65
GA(TIRG) ResNet-18 91.20 -
TIRG+JPM(MSE) ResNet-18 83.80 -
TIRG+JPM(Tri) ResNet-18 83.20 -
LGLI ResNet-18 93.30 -
MAAF ResNet-50 87.80 -
CRR ResNet-101 85.84 -
JGAN ResNet-101 76.07 48.85
GSCMR ResNet-101 81.81 58.74
TIS Inception-v3 76.64 48.02
LBF(big) Faster-RCNN 79.20 55.69
LBF(small) Faster-RCNN 67.26 50.31

Performance comparison on the Shoes dataset

Methods Image Encoder R@1 R@10 R@50 Average
ComposeAE ResNet-18 31.25 60.30 - -
TIRG ResNet-50 12.60 45.45 69.39 42.48
VAL(Lvv) ResNet-50 16.49 49.12 73.53 46.38
VAL(Lvv + Lvs) ResNet-50 16.98 49.83 73.91 46.91
VAL(GloVe) ResNet-50 17.18 51.52 75.83 48.18
CoSMo ResNet-50 16.72 48.36 75.64 46.91
CLVC-Net ResNet-50 17.64 54.39 79.47 50.50
DCNet ResNet-50 - 53.82 79.33 -
SAC w/BERT ResNet-50 18.50 51.73 77.28 49.17
SAC w/Random Emb. ResNet-50 18.11 52.41 75.42 48.64
ARTEMIS+LSTM ResNet-50 17.60 51.05 76.85 48.50
ARTEMIS+BiGRU ResNet-50 18.72 53.11 79.31 50.38
AMC ResNet-50 19.99 56.89 79.27 52.05
DATIR ResNet-50 17.20 51.10 75.60 47.97
MCR ResNet-50 17.85 50.95 77.24 48.68
EER ResNet-50 20.05 56.02 79.94 52.00
CRN ResNet-50 17.19 53.88 79.12 50.06
Uncertainty ResNet-50 18.41 53.63 79.84 50.63
FashionVLP ResNet-50 - 49.08 77.32 -
DWC ResNet-50 18.94 55.55 80.19 51.56
MCEM((L_CE)) ResNet-50 15.17 49.33 73.78 46.09
MCEM((L_FCE)) ResNet-50 18.13 54.31 78.65 50.36
MCEM((L_AFCE)) ResNet-50 19.10 55.37 79.57 51.35
AlRet ResNet-50 18.13 53.98 78.81 50.31
RTIC ResNet-50 43.66 72.11 - -
RTIC-GCN ResNet-50 43.38 72.09 - -
CRR ResNet-101 18.41 56.38 79.92 51.57
CRN Swin 17.32 54.15 79.34 50.27
ProVLA Swin 19.20 56.20 73.30 49.57
CRN Swin-L 18.92 54.55 80.04 51.17
AlRet CLIP 21.02 55.72 80.77 52.50
PL4CIR CLIP-L 22.88 58.83 84.16 55.29
PL4CIR CLIP-B 19.53 55.65 80.58 51.92
TG-CIR CLIP-B 25.89 63.20 85.07 58.05
DQU-CIR CLIP-H 31.47 69.19 88.52 63.06

Performance comparison on the CIRR dataset

Methods Image Encoder R@1 R@5 R@10 R@50
ComposeAE ResNet-18 - 29.60 59.82 -
MCEM((L_CE)) ResNet-18 14.26 40.46 55.61 85.66
MCEM((L_FCE)) ResNet-18 16.12 43.92 58.87 86.85
MCEM((L_AFCE)) ResNet-18 17.48 46.13 62.17 88.91
Ranking-aware ResNet-50 32.24 66.63 79.23 96.43
SAC w/BERT ResNet-50 - 19.56 45.24 -
SAC w/Random Emb. ResNet-50 - 20.34 44.94 -
ARTEMIS+BiGRU ResNet-152 16.96 46.10 61.31 87.73
CIRPLANT ResNet-152 15.18 43.36 60.48 87.64
CIRPLANT w/ OSCAR ResNet-152 19.55 52.55 68.39 92.38
CASE ViT 48.00 79.11 87.25 97.57
ComqueryFormer Swin 25.76 61.76 75.90 95.13
CLIP4CIR CLIP 38.53 69.98 81.86 95.93
CLIP4CIR3 CLIP 44.82 77.04 86.65 97.90
SPN(TG-CIR) CLIP 47.28 79.13 87.98 97.54
SPN(CLIP4CIR) CLIP 45.33 78.07 87.61 98.17
Combiner CLIP 33.59 65.35 77.35 95.21
MCEM((L_AFCE)) CLIP 39.80 74.24 85.71 97.23
TG-CIR CLIP-B 45.25 78.29 87.16 97.30
CIReVL CLIP-B 23.94 52.51 66.00 86.95
SEARLE-OTI CLIP-B 24.27 53.25 66.10 88.84
SEARLE CLIP-B 24.00 53.42 66.82 89.78
PLI CLIP-B 18.80 46.07 60.75 86.41
SEARLE-XL CLIP-L 24.24 52.48 66.29 88.84
SEARLE-XL-OTI CLIP-L 24.87 52.31 66.29 88.58
CIReVL CLIP-L 24.55 52.31 64.92 86.34
Context-I2W CLIP-L 25.60 55.10 68.50 89.80
Pic2Word CLIP-L 23.90 51.70 65.30 87.80
CompoDiff(with SynthTriplets18M) CLIP-L 18.24 53.14 70.82 90.25
LinCIR CLIP-L 25.04 53.25 66.68 -
PLI CLIP-L 25.52 54.58 67.59 88.70
KEDs CLIP-L 26.40 54.80 67.20 89.20
CIReVL CLIP-G 34.65 64.29 75.06 91.66
LinCIR CLIP-G 35.25 64.72 76.05 -
CompoDiff(with SynthTriplets18M) CLIP-G 26.71 55.14 74.52 92.01
LinCIR CLIP-H 33.83 63.52 75.35 -
DQU-CIR CLIP-H 46.22 78.17 87.64 97.81
PLI BLIP 27.23 58.87 71.40 91.25
BLIP4CIR2 BLIP 40.17 71.81 83.18 95.69
BLIP4CIR2+Bi BLIP 40.15 73.08 83.88 96.27
SPN(BLIP4CIR1) BLIP 46.43 77.64 87.01 97.06
SPN(SPRC) BLIP-2 55.06 83.83 90.87 98.29
BLIP4CIR1 BLIP-B 46.83 78.59 88.04 97.08

[NOTE] If you have any questions, please don't hesitate to contact us.

About

Multimodal-Composite-Editing-and-Retrieval-update

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published