Skip to content

whwu95/GPT4Vis

Repository files navigation


This work delves into an essential, yet must-know baseline in light of the latest advancements in Generative Artificial Intelligence (GenAI): the utilization of GPT-4 for visual understanding. We center on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks. To ensure a comprehensive evaluation, we have conducted experiments across three modalities—images, videos, and point clouds—spanning a total of 16 popular academic benchmark.

📣 I also have other cross-modal projects that may interest you ✨.

Revisiting Classifier: Transferring Vision-Language Models for Video Recognition
Wenhao Wu, Zhun Sun, Wanli Ouyang
Conference Journal github

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang
Conference github

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang
Accepted by CVPR 2023 as 🌟Highlight🌟 | Conference github

News

  • [Mar 7, 2024] Due to the recent removal of RPD (request per day) limits on the GPT-4V API, we've updated our predictions for all datasets using standard single testing (one sample per request). Check out the GPT4V Results, Ground Truth and Datasets we've shared for you! As a heads-up, 😭running all tests once costs around 💰$4000+💰.
  • [Nov 28, 2023] We release our report in Arxiv.
  • [Nov 27, 2023] Our prompts have been released. Thanks for your star 😝.

Overview

An overview of 16 evaluated popular benchmark datasets, comprising images, videos, and point clouds.

Zero-shot visual recognition leveraging GPT-4's linguistic and visual capabilities.

Generated Descriptions from GPT-4

  • We have pre-generated descriptive sentences for all the categories across the datasets, which you can find in the GPT_generated_prompts folder. Enjoy exploring!

  • We've also provided the example script to help you generate descriptions using GPT-4. For guidance on this, please refer to the generate_prompt.py file. Happy coding! Please refer to the config folder for detailed information on all datasets used in our project.

  • Execute the following command to generate descriptions with GPT-4.

    # To run the script for specific dataset, simply update the following line with the name of the dataset you're working with: 
    # dataset_name = ["Dataset Name Here"]   # e.g., dtd
    python generate_prompt.py

GPT-4V(ision) for Visual Recognition

  • We share an example script that demonstrates how to use the GPT-4V API for zero-shot predictions on the DTD dataset. Please refer to the GPT4V_ZS.py file for a step-by-step guide on implementing this. We hope it helps you get started with ease!

    # GPT4V zero-shot recognition script. 
    # dataset_name = ["Dataset Name Here"]   # e.g., dtd
    python GPT4V_ZS.py
  • All results are available in the GPT4V_ZS_Results folder! In addition, we've provided the Datasets link along with their corresponding ground truths (annotations folder) to help readers in replicating the results. Note: For certain datasets, we may have removed prefixes from the sample IDs. For instance, in the case of ImageNet, "ILSVRC2012_val_00031094.JPEG" was modified to "00031094.JPEG".

DTD EuroSAT SUN397 RAF-DB Caltech101 ImageNet-1K FGVC-Aircraft Flower102
57.7 46.8 59.2 68.7 93.7 63.1 56.6 69.1
Label Label Label Label Label Label Label Label
Stanford Cars Food101 Oxford Pets UCF-101 HMDB-51 Kinetics-400 ModelNet-10
62.7 86.2 90.8 83.7 58.8 58.8 66.9
Label Label Label Label Label Label Label
  • With the provided prediction and annotation files, you can reproduce our top-1/top-5 accuracy results with the calculate_acc.py script.

    # pred_json_path = 'GPT4V_ZS_Results/imagenet.json'
    # gt_json_path = 'annotations/imagenet_gt.json'
    python calculate_acc.py

Requirement

For guidance on setting up and running the GPT-4 API, we recommend checking out the official OpenAI Quickstart documentation available at: OpenAI Quickstart Guide.

📌 BibTeX & Citation

If you use our code in your research or wish to refer to the results, please star 🌟 this repo and use the following BibTeX 📑 entry.

@article{GPT4Vis,
  title={GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?},
  author={Wu, Wenhao and Yao, Huanjin and Zhang, Mengxi and Song, Yuxin and Ouyang, Wanli and Wang, Jingdong},
  booktitle={arXiv preprint arXiv:2311.15732},
  year={2023}
}

🎗️ Acknowledgement

This evaluation is built on the excellent works:

  • CLIP: Learning Transferable Visual Models From Natural Language Supervision
  • GPT-4
  • Text4Vis: Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective

We extend our sincere gratitude to these contributors.

👫 Contact

For any questions, please feel free to file an issue.