Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GenAI Arena - publication pushed back #1988

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions _blog.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3810,3 +3810,15 @@
- privacy
- research
- FHE

- local: arena-genai
title: "Introducing GenAI-Arena: Image Generation Rankings with Human Preferences"
thumbnail: /blog/assets/arenas-on-the-hub/thumbnail.png
author: tianleliphoebe
guest: true
date: Apr 18, 2024
tags:
- leaderboard
- arena
- collaboration
- research
101 changes: 101 additions & 0 deletions arena-genai.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
title: "Introducing GenAI-Arena: Image Generation Rankings with Human Preferences"
thumbnail: /blog/assets/arenas-on-the-hub/thumbnail.png
authors:
- user: tianleliphoebe
guest: true
org: TIGER-Lab
- user: yuanshengni
guest: true
org: TIGER-Lab
- user: DongfuJiang
guest: true
org: TIGER-Lab
- user: wenhu
guest: true
org: TIGER-Lab
- user: clefourrier
---

# Introducing GenAI-Arena: Image Generation Rankings with Human Preferences

Image generation and manipulation are now used for many use cases, from creating stunning artwork to aiding in medical imaging. However, navigating through the multitude of available models and assessing their performance can be a daunting task, as most traditional metrics offer valuable but very specific insights on precise aspects of image generation, and do not assess overall performance.

That's why we (researchers at the Text and Image GEnerative Research (TIGER) Lab) introduced the GenAI-Arena, a tool to help anyone easily compare image generation models side by side, and provide model rankings back to the communtiy.

<script type="module" src="https://gradio.s3-us-west-2.amazonaws.com/4.24/gradio.js"> </script>
<gradio-app theme_mode="light" space="TIGER-Lab/GenAI-Arena"></gradio-app>

## Why create the GenAI-Arena?

Evaluation of image generation models typically involves scouring through research papers to find relevant metrics, running experiments locally, and comparing results manually. However, much like in text (with the [Chatbot Arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard)) and speech (with the [TextToSpeech Arena](https://huggingface.co/spaces/TTS-AGI/TTS-Arena)), using an arena system allows both to streamline this process, and to make sure that model rankings are aligned with human preferences, instead of just relying on automated metrics that can only offer a partial view of model capabilities.

We therefore provide a dynamic side-by-side comparison arena, empowering the community to effortlessly generate images, make comparisons, and vote for their preferred model. Our platform is built upon [ImagenHub](https://github.com/TIGER-AI-Lab/ImagenHub), a comprehensive library designed to support inference with various generation and editing models.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We therefore provide a dynamic side-by-side comparison arena, empowering the community to effortlessly generate images, make comparisons, and vote for their preferred model. Our platform is built upon [ImagenHub](https://github.com/TIGER-AI-Lab/ImagenHub), a comprehensive library designed to support inference with various generation and editing models.
Our arena provides a side-by-side comparison of images generated by two models selected randomly from a pool. The community can effortlessly generate images, make comparisons, and vote for their preferred model. Our platform is built upon [ImagenHub](https://github.com/TIGER-AI-Lab/ImagenHub), a comprehensive library designed to support inference with various generation and editing models.

(Note: voting did not work for me in the current Space. Clicking the small square button in the top-right area of the image gave an error "Share failed". Is there another way to do it?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible the authors are adding new models atm, I'll check


## A Holistic Ranking Leaderboard Using Elo

In the spirit of transparency and accountability, we complemented this arena with a leaderboard, using the Elo rating system to rank models.

Widely used in competitive games and sports, the Elo rating system calculates the relative skill levels of players by considering the difference in their ratings. Similarly, in our arena, each model is assigned an Elo rating based on its performance in pairwise battles against other models. While traditional metrics like SSIM, PSNR, LPIPS, FID offer valuable insights into specific aspects of image generation quality, they may not provide a comprehensive assessment of model performance. In contrast, the Elo rating system offers a dynamic and adaptive approach to evaluating models, reflecting their relative performance in direct comparisons. By considering both the strengths and limitations of each approach, researchers and practitioners can gain a more holistic understanding of model capabilities and make informed decisions when selecting models for specific tasks.

## Insights and Results

As the GenAI-Arena community continues to grow, so does the wealth of insights derived from user interactions.

### Image Generation Models

GenAI-Arena focuses on the performance of the open-source cutting-edge diffusion-based generation models.

Currently, Playground v2 is leading the board with the highest Arena Elo score of 1097. Developed by the Playground team, it uses the same architecture as SDXL. Its successor, Playground v2.5, falls slightly behind with an Elo score of 1088, despite offering enhanced color contrast and alignment with human preferences. This may be the consequence of having accrued less votes, as it's been available for a relatively short time. However, its wide confidence interval indicates it has significant potential for growth and preference within the community.

In the realm of Accelerated Latent Diffusion Models, StableCascade emerges as the frontrunner. It surpasses even some of the non-accelerated counterparts. The superior performance can be attributed to the utilization of Würstchen architecture, which significantly compresses the latent space, demonstrating the advantages of efficiency and compactness.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In the realm of Accelerated Latent Diffusion Models, StableCascade emerges as the frontrunner. It surpasses even some of the non-accelerated counterparts. The superior performance can be attributed to the utilization of Würstchen architecture, which significantly compresses the latent space, demonstrating the advantages of efficiency and compactness.
In the set of Accelerated Latent Diffusion Models, StableCascade emerges as the frontrunner. It surpasses even some of the non-accelerated counterparts. The superior performance can be attributed to the utilization of the Würstchen architecture, which significantly compresses the latent space, demonstrating the advantages of efficiency and compactness.

(Suggestion for the future: maybe include the number of inference steps in the leaderboard table so people can have a proxy for speed).


Notably, PixArtAlpha distinguishes itself as the sole model built on a Transformer backbone, achieving good performance with an Elo score of 1034, despite having the smallest parameter size of 0.6B. This underscores the model's efficiency and the potential of Transformer-based architectures in the field of image generation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for the number of parameters.


A broader observation from the collected statistics is the apparent correlation between parameter size and model performance. Generally, models with larger parameter sizes tend to perform better, as seen with the top-ranking models, which have parameter sizes of 2.6B or higher.


| Rank | Model Type | Model | Parameter Size | Description | Arena Elo | 95% CI | Vote |
| ---- | ---------------------------------- | --------------------------------------------------------------------------------------- |----------------| ---------------------------------------------------------------------------------------- | --------- | ------ | ---- |
| 1 | Latent Diffusion Model | [Playground v2](https://huggingface.co/playgroundai/playground-v2-1024px-aesthetic) | 2.6B | same architecture as SDXL trained by Playground team | 1097 | 26/-21 | 645 |
| 2 | Latent Diffusion Model | [Playground v2.5](https://huggingface.co/playgroundai/playground-v2.5-1024px-aesthetic) | 2.6B | a successor to Playground v2 with enhanced color contrast and human preference alignment | 1088 | 57/-52 | 148 |
| 3 | Accelerated Latent Diffusion Model | [StableCascade](https://huggingface.co/stabilityai/stable-cascade) | 5.1B | Würstchen architecture at a much smaller latent space | 1056 | 32/-32 | 226 |
| 4 | Latent Diffusion Model | [PixArtAlpha](https://huggingface.co/PixArt-alpha/PixArt-XL-2-1024-MS) | 0.6B | Transformer Latent Diffusion Model | 1034 | 26/-18 | 702 |
| 5 | Accelerated Latent Diffusion Model | [SDXLLightning](https://huggingface.co/ByteDance/SDXL-Lightning) | 2.6B | SDXL with Progressive and Adversarial Diffusion Distillation | 1034 | 35/-37 | 199 |
| 6 | Latent Diffusion Model | [SDXL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) | 2.6B | Base + Refiner latent diffusion model | 1002 | 23/-18 | 717 |
| 7 | Accelerated Latent Diffusion Model | [SDXLTurbo](https://huggingface.co/stabilityai/sdxl-turbo) | 2.6B | SDXL with Adversarial Diffusion Distillation | 955 | 23/-21 | 736 |
| 8 | Latent Diffusion Model | [OpenJourney](https://huggingface.co/prompthero/openjourney) | 0.8B | Latent diffusion model fine-tuned on midjourney images | 888 | 24/-25 | 729 |
| 9 | Accelerated Latent Diffusion Model | [LCM](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7) | 0.8B | Latent Consistency Model | 846 | 22/-24 | 806 |

Overall, these insights from the votes of GenAI-Arena community highlight the dynamic evolution of image-generation models and the diverse strategies employed to improve image generation quality and efficiency.

### Image Editing Models

Our GenAI-Arena also supports a variety of image editing models that allow users to input a source image along with a target instruction. These models generate an edited image that matches the specified instruction.

Among the range of editing models available, attention-guided models such as Prompt-to-Prompt and Plug-and-Play (PNP) stand out for their superior performance. These models harness advanced attention control mechanisms to finely edit images in response to user prompts, achieving high consistency with user expectation.

Instruction-following forward-pass models also excel, offering an impressive balance between image quality and operational efficiency. These models can handle edition without image inversion due to they train with paired annotated data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Instruction-following forward-pass models also excel, offering an impressive balance between image quality and operational efficiency. These models can handle edition without image inversion due to they train with paired annotated data.
Instruction-following forward-pass models also excel, offering an impressive balance between image quality and operational efficiency. These models can handle edition without image inversion due to being trained with paired annotated data.

Perhaps links to the models and/or techniques could be helpful for people that want to learn more. Image inversion, for instance, is never defined. No need to explain or cite papers, just a few links could be enough.


On the other hand, latent-space manipulation models currently lag behind in performance when compared to other model types. This category faces challenges in consistently meeting user expectations, highlighting an area needing for further research and development to enhance their effectiveness.


| Rank | Model Type | 🤖 Model | Parameter Size | Description | Arena Elo | 95% CI | Votes |
| ---- | ---------------------------------------- |-------------------------------------------------------------------|----------------| -------------------------------------------------------------------------------------------------- | --------- | -------- | ----- |
| 1 | Attention-guided Model | [Prompt-to-prompt](https://prompt-to-prompt.github.io/) | 0.8B | Prompt-based cross-attention control editing model. | 1188 | 202/-113 | 24 |
| 2 | Attention-guided Model | [PNP](https://github.com/MichalGeyer/plug-and-play) | 0.8B | Feature and self-attention injection image editing model. | 1134 | 156/-100 | 22 |
| 3 | Instruction-following forward-pass model | [InstructPix2Pix](https://www.timothybrooks.com/instruct-pix2pix) | 0.8B | Instruction-based image editing model trained with automatically synthesized data. | 1087 | 94/-107 | 29 |
| 4 | Instruction-following forward-pass model | [MagicBrush](https://osu-nlp-group.github.io/MagicBrush) | 0.8B | Same structure as InstructPix2Pix but fine-tuned on manually-annotated instruction-guided dataset. | 1085 | 106/-102 | 37 |
| 5 | Attention-guided Model | [Pix2PixZero](https://pix2pixzero.github.io/) | 0.8B | Cross-attention guided editing model. | 984 | 122/-102 | 21 |
| 6 | Latent-space manipulation model | [CycleDiffusion](https://github.com/ChenWu98/cycle-diffusion) | 0.8B | Unpaired image translation model with reconstructable encoder for stochastic DPMs. | 848 | 84/-129 | 24 |
| 7 | Latent-space manipulation model | [SDEdit](https://sde-image-editing.github.io/) | 0.8B | An image synthesis and editing framework with SDEs | 675 | 157/-227 | 15 |

We hope that continuously analyzing and understanding the performance dynamics of these models will foster advancements in AI-driven image editing technology and help users make decision about the best models for them.

## How to Get Involved and Join the Arena

Participating in GenAI-Arena is very simple. Visit our [Space](https://huggingface.co/spaces/TIGER-Lab/GenAI-Arena), explore the available models, and start generating and comparing images. Don’t forget to cast your vote for the model that impresses you the most!

To further engage with us and the community, join discussions on our leaderboard and share your experiences on social media. Whether you’re a seasoned researcher, a budding AI enthusiast, or simply curious about the capabilities of image generation models, we'd love to collaborate and discuss with you!

To witness how your latest model fares against other state-of-the-art competitors, consider contributing it to our [github](https://github.com/TIGER-AI-Lab/ImagenHub) via a PR. This will enable our GenAI-Arena pipeline to incorporate your model, allowing it to be showcased on the leaderboard following community voting.
Loading