A generative text-to-image model is a model that can generate an image from a text prompt.
This repository is a final project for the course EECM30064 Deep Learning
Stable Diffusion - Image to Prompts is a competition on Kaggle.
The goal of this competition is to reverse the typical direction of a generative text-to-image model: instead of generating an image from a text prompt.
We want to create a model which can predict the text prompt given a generated image. And making predictions on a dataset containing a wide variety of
Sample images from the competition dataset and their corresponding prompts are shown below.
Image | Prompt |
---|---|
ultrasaurus holding a black bean taco in the woods, near an identical cheneosaurus
|
Our method is to ensemble the CLIP Interrogator, OFA model, and ViT model.
Here's the ratio for three different model
- Vision Transformer (ViT) model: 74.88%
- CLIP Interrogator: 21.12%
- OFA model fine-tuned for image captioning: 4%
Based on the Kaggle competition, we want to build a model to predict the prompts that were used to generate target images.
Prompts for this challenge were generated using a variety of (non disclosed) methods, and range from fairly simple to fairly complex with multiple objects and modifiers.
Images were generated from the prompts using Stable Diffusion
[1] Learning Transferable Visual Models From Natural Language Supervision
[2] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
[3] Very Deep Convolutional Networks for Large-Scale Image Recognition
[5] CLIPInterrogator + OFA + ViT
[6] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
[7] CoCa: Contrastive Captioners are Image-Text Foundation Models