LLMScore

In this work, we propose LLMScore, a new framework that offers evaluation scores with multi-granularity compositionality. LLMScore leverages the large language models (LLMs) to evaluate text-to-image models. Please check out our paper "LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation". Our work is accepted at NeurIPS 2023!

Overview

The two images are generated using Stable-Diffusion-2 based on the text prompt sampled from the Concept Conjunction dataset. Baseline section shows the scores from the existing model-based evaluation metrics, Human section is the rating score from the human evaluation, LLMScore section is our proposed metric. The right column also shows the rationale generated by LLMScore.

Comparison of Text-Image Matching, Sentence Matching, and our LLM-based Instruction-Following Matching pipeline for text-to-image synthesis evaluation. Our LLMScore automatically provides accurate scores and reasonable rationales for text-to-image synthesis based on text prompts, and visual descriptions following various evaluation instructions.

Installation

Please follow install page to set up the environments and models.

Text-to-Image Synthesis Evaluation

Get score with rationale for evaluating the alignment between image and text prompt.

python llm_score.py --image sample/sample.png --text_prompt "a red car and a white sheep"

Try different LLMs by setting LLM_ID as one of ["gpt-4", "gpt-3.5-turbo", "vicuna"]:

python llm_score.py --image sample/sample.png --text_prompt "a red car and a white sheep" --llm_id LLM_ID

Notice that to use Vicuna, follow Part Install and Part Model Weights in FastChat_README to install fastchat and to obtain the Vicuna weights. To enable OpenAI-compastible APIs used in our repo, follow commands from Guideline to launch the controller, model worker and RESTful API server as below:

python3 -m fastchat.serve.controller
python3 -m fastchat.serve.model_worker --model-name 'vicuna-7b-v1.1' --model-path /path/to/vicuna/weights
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000

LLMScore with Rationale

Human Correlation

The rank correlation (Kendall's tau) is aggregated across the compositional prompt dataset (Concept Conjunction, Attribute Binding Contrast) on the left two columns (CompBench) and the general prompt dataset (MSCOCO, DrawBench, PaintSkills) on the right two columns (GeneralBench).

Citation

If you found this repository useful, please consider cite our paper:

@misc{lu2023llmscore,
      title={LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation}, 
      author={Yujie Lu and Xianjun Yang and Xiujun Li and Xin Eric Wang and William Yang Wang},
      year={2023},
      eprint={2305.11116},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

This repo benefits from BLIP-2, GRIT, GPT-4. Thank for their awesome works!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
llm_descriptor		llm_descriptor
llm_evaluator		llm_evaluator
sample		sample
submodule		submodule
.gitignore		.gitignore
.gitmodules		.gitmodules
INSTALL.md		INSTALL.md
README.md		README.md
llm_score.py		llm_score.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMScore

Overview

Installation

Text-to-Image Synthesis Evaluation

LLMScore with Rationale

Human Correlation

Citation

Acknowledgement

About

Releases

Packages

Languages

YujieLu10/LLMScore

Folders and files

Latest commit

History

Repository files navigation

LLMScore

Overview

Installation

Text-to-Image Synthesis Evaluation

LLMScore with Rationale

Human Correlation

Citation

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages