Is ChatGPT a Good NLG Evaluator? A Preliminary Study

This repository contains the data for our report "Is ChatGPT a Good NLG Evaluator? A Preliminary Study"

1. ChatGPT Evaluation Results

please refer to data/*.json

2. Reproduce Results

SummEval

from correlations import sample_level_correlation_summeval, dataset_level_correlation_summeval

for aspect in ['coherence', 'relevance', 'consistency', 'fluency']:
    sample_level_correlation_summeval(aspect)

for aspect in ['coherence', 'relevance', 'consistency', 'fluency']:
    dataset_level_correlation_summeval(aspect)

OpenMEVA

from correlations import sample_level_correlation_openmeva, dataset_level_correlation_openmeva
sample_level_correlation_openmeva()
dataset_level_correlation_openmeva()

BAGEL

from correlations import dataset_level_correlation_bagel

for aspect in ['informativeness', 'naturalness', 'quality']:
    dataset_level_correlation_bagel(aspect)

3. Prompts

SummEval

Coherence:

prompt_summeval_coherence = """Score the following news summarization given the corresponding news with respect to coherence with one to five stars, where one star means "incoherence" and five stars means "perfect coherence". Note that coherence measures the quality of all sentences collectively, to the fit together and sound naturally. Consider the quality of the summary as a whole.

News: %s
Summary: %s
Stars:
""" % (article, generated_summ)

Relevance:

prompt_summeval_relevance = """Score the following news summarization given the corresponding news with respect to relevance with one to five stars, where one star means "irrelevance" and five stars means "perfect relevance". Note that relevance measures how well the summary captures the key points of the article. Consider whether all and only the important aspects are contained in the summary.

News: %s
Summary: %s
Stars:
""" % (article, generated_summ)

Consistency:

prompt_summeval_consistency = """Score the following news summarization given the corresponding news with respect to consistency with one to five stars, where one star means "inconsistency" and five stars means "perfect consistency". Note that consistency measures whether the facts in the summary are consistent with the facts in the original article. Consider whether the summary does reproduce all facts accurately and does not make up untrue information.

News: %s
Summary: %s
Stars:
""" % (article, generated_summ)

Fluency:

prompt_summeval_fluency = """Score the following news summarization given the corresponding news with respect to fluency with one to five stars, where one star means "disfluency" and five stars means "perfect fluency". Note that fluency measures the quality of individual sentences, are they well-written and grammatically correct. Consider the quality of individual sentences.

News: %s
Summary: %s
Stars:
""" % (article, generated_summ)

OpenMEVA

prompt_openmeva = """Score the following storyline given the beginning of the story with one to five stars.
Where one star means "Nonsense",
two stars mean "The storyline has some connections with the beginning, but is not understandable",
three stars mean "The storyline has some connections with the beginning and is understandable",
four stars mean "The storyline is consistent with the beginning and possibly involves a few grammar mistakes",
and five stars mean "Perfect storyline and grammar".

The beginning of the story: %s
Storyline: %s
Stars:
""" % (story_beginning, generated_storyline)

BAGEL

Informativeness:

prompt_bagel_informativeness = """Score the following natural text given the corresponding reference with respect to informativeness with one to five stars, where one star means "uninformative" and five stars means "perfect informativeness". Note that informativeness is defined as whether it contains all the information in the reference.

The reference: %s
The natural text: %s

Stars:
""" % (reference, sys_summ)

Naturalness:

prompt_bagel_naturalness = """Score the following natural text given the corresponding structured information with one to five stars, where one star means "unnaturalness" and five stars means "perfect naturalness".

The structured information: %s   
The natural text: %s

Stars:
""" % (src, sys_summ)

Quality

prompt_bagel_quality = """Score the following natural text given the corresponding structured information with one to five stars, where one star means "nonsense or no core meaning preserved" and five stars means "perfect core meaning and grammar".

The structured information: %s
The natural text: %s

Stars:
""" % (src, sys_summ)

Bib

Please cite our work if you find it useful.

TBD

Acknowledgements

Part of this code is inspired by BARTScore and OpenMEVA:

The results of baseline metrics in data/bagel.json and data/summeval are provided by BARTScore
The results of baseline metrics in data/openmeva.json are calculated by the standard implementation of OpenMEVA and BARTScore

Prompts are inspired by Large Language Models Are State-of-the-Art Evaluators of Translation Quality

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Is ChatGPT a Good NLG Evaluator? A Preliminary Study

1. ChatGPT Evaluation Results

2. Reproduce Results

SummEval

OpenMEVA

BAGEL

3. Prompts

SummEval

OpenMEVA

BAGEL

Bib

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

Is ChatGPT a Good NLG Evaluator? A Preliminary Study

1. ChatGPT Evaluation Results

2. Reproduce Results

SummEval

OpenMEVA

BAGEL

3. Prompts

SummEval

OpenMEVA

BAGEL

Bib

Acknowledgements