You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For each source sentence, you generate 10 hypotheses. And then you compute the Rouge score between each hypothesis and target sentence. Finally, you take the hypothesis with the best score as the final generation. You do this for each source sentence and combine all hypotheses with the best score as the whole generation file.
My question is: is it a fair or standard way for generation? For inference, the target sentences are blind. We can't use them as a hint for generation.
The text was updated successfully, but these errors were encountered:
The results in the main table are not fair enough to compare, which is also mentioned in the Chapter 4.5. Strictly speaking, there is currently no very fair and rigorous method to compare AR and diffusion. However, these experiments can reflect the potential of diffusion models to generate comparable AR effects, while also reflecting general trends.
In fact, we recognize this problem and propose an fair evaluation method in this article, using LLM to evaluate 10 samples generated by AR and 10 samples generated by GENIE. From the results shown in Table 4 and Table 5, it can be seen that the overall quality of the diffusion model is slightly lower than AR, but diffusion model can generate more diverse samples, which is also very important in practical applications of text generation.
Hi @qiweizhen,
I have a question about your evaluation.
From your paper: "In the inference process, we randomly sample 10 Gaussian noises for iteration denoising, and use the highest score as the final generated result." I also check your file https://github.com/microsoft/ProphetNet/blob/master/GENIE/integration/eval_split.py.
For each source sentence, you generate 10 hypotheses. And then you compute the Rouge score between each hypothesis and target sentence. Finally, you take the hypothesis with the best score as the final generation. You do this for each source sentence and combine all hypotheses with the best score as the whole generation file.
My question is: is it a fair or standard way for generation? For inference, the target sentences are blind. We can't use them as a hint for generation.
The text was updated successfully, but these errors were encountered: