You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I appreciate that I found a pretty good benchmark for the summarization metrics.
I have several things to ask after going through the codes and paper.
As far as I understand, human_annotations.json contains scores that is summary of human_annotations_sentence.json.
(For sanity check of my understanding) For each sentence, major error type is considered as final label of that sentence that affects the score (such as NoE for factuality, LinkE for LinkE & Discourse Errors). So if I see one NoE sentence and 2 LinkE sentences for an article, that should be scored as {Factuality: 0.333, Discourse_Error: 0.333} while Semantic_Frame_Errors and Content Verifiability Errors being 1.0 which means the summary sentences are free from those type of errors but has Discourse Error which harms Factuality to 1/3.
What I don't understand is the way of determining 'Flip' scores. First I thought it should be 1-ErrorType (e.g.Discourse Error for Flip Discourse Error). I still couldn't find any explainable way of making those scores from the labels. I tried to find some piece of code that generates human_annotations.json, but nothing indicates how to make Flip scores from the original ones explicitly. I think I got the motivation of applying flip scores for ablation study but not quite sure about how are they being generated.
Thanks again for the great piece of work. If you kindly explain this to me, it would be of great help. =]
The text was updated successfully, but these errors were encountered:
Hi, I appreciate that I found a pretty good benchmark for the summarization metrics.
I have several things to ask after going through the codes and paper.
As far as I understand,
human_annotations.json
contains scores that is summary ofhuman_annotations_sentence.json
.(For sanity check of my understanding) For each sentence, major error type is considered as final label of that sentence that affects the score (such as NoE for factuality, LinkE for LinkE & Discourse Errors). So if I see one NoE sentence and 2 LinkE sentences for an article, that should be scored as
{Factuality: 0.333, Discourse_Error: 0.333}
whileSemantic_Frame_Errors
andContent Verifiability Errors
being1.0
which means the summary sentences are free from those type of errors but has Discourse Error which harms Factuality to 1/3.What I don't understand is the way of determining 'Flip' scores. First I thought it should be
1-ErrorType
(e.g.Discourse Error
forFlip Discourse Error
). I still couldn't find any explainable way of making those scores from the labels. I tried to find some piece of code that generateshuman_annotations.json
, but nothing indicates how to make Flip scores from the original ones explicitly. I think I got the motivation of applying flip scores for ablation study but not quite sure about how are they being generated.Thanks again for the great piece of work. If you kindly explain this to me, it would be of great help. =]
The text was updated successfully, but these errors were encountered: