-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Conversation
parlai/core/metrics.py
Outdated
true_pos_score = sum(weighted_common.values()) | ||
if true_pos_score == 0: | ||
return 0 | ||
precision = true_pos_score / sum(weights[w] for w in pred_items) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, this is really interesting, i had originally imagined weights to just be 1 if above the threshold, 0 otherwise, i wonder if you could make that a configurable option for the metric? this is a cool way of doing it too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad you like it! If you look at my plots of _rarity_weight
in the PR description, you'll see that it rises fairly sharply after the chosen threshold. I was afraid that this might tend toward the simpler version (0 above and 1 below the threshold) in most cases, at which point the weighting would complicate the metric without adding much value. So I just opted for the simpler version for now.
But if there were a way of calculating the rarity weight which rose to 1.0 more gradually, we could try another version of this metric that doesn't have a "cutoff" and only has the weighting. That would feel much more elegant and would probably be more robust to different word distributions.
The issue I was running into in trying to find a function like that, though, was that the top few words of the distribution are so common relative to even the rest of the top 50 words that I have to find a function that pushes the majority of the range (let's say 0.0 to 0.99) down to near zero while expanding the last bit of the range (0.99 to 1.0) to a gradual slope from near-zero to 1.0. Otherwise fairly common words are still given a high weight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is a pretty narrow metric, I'd feel a bit better if the metric were moved to the teacher itself. Custom_evaluation in particular would be a good fit
We use this in another teacher in parlai_internal (which we plan to move to public soon). But I moved it to wizard for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya this looks quite right to me
Patch description
F1 can be gamed easily (either by humans or the model) by predicting common tokens irrespective of semantics.
As an attempt to prevent this, this PR introduces "Rare Word F1" that only gives credit for matching words that are infrequent relative to some reference corpus.
This is less susceptible to the adversarial scenario of a model that predicts the same thing over and over again, since it shouldn't be possible to find a set of words that is both rare and shows up often in the labels.