Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The reported scores of GENIE are not fair #57

Open
BaohaoLiao opened this issue Mar 13, 2023 · 1 comment
Open

The reported scores of GENIE are not fair #57

BaohaoLiao opened this issue Mar 13, 2023 · 1 comment

Comments

@BaohaoLiao
Copy link

Hi @qiweizhen,

I have a question about your evaluation.

From your paper: "In the inference process, we randomly sample 10 Gaussian noises for iteration denoising, and use the highest score as the final generated result." I also check your file https://github.com/microsoft/ProphetNet/blob/master/GENIE/integration/eval_split.py.

For each source sentence, you generate 10 hypotheses. And then you compute the Rouge score between each hypothesis and target sentence. Finally, you take the hypothesis with the best score as the final generation. You do this for each source sentence and combine all hypotheses with the best score as the whole generation file.

My question is: is it a fair or standard way for generation? For inference, the target sentences are blind. We can't use them as a hint for generation.

@lzh0525
Copy link

lzh0525 commented Mar 20, 2023

Thank you for your interest in our work.

The results in the main table are not fair enough to compare, which is also mentioned in the Chapter 4.5. Strictly speaking, there is currently no very fair and rigorous method to compare AR and diffusion. However, these experiments can reflect the potential of diffusion models to generate comparable AR effects, while also reflecting general trends.

In fact, we recognize this problem and propose an fair evaluation method in this article, using LLM to evaluate 10 samples generated by AR and 10 samples generated by GENIE. From the results shown in Table 4 and Table 5, it can be seen that the overall quality of the diffusion model is slightly lower than AR, but diffusion model can generate more diverse samples, which is also very important in practical applications of text generation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants