Evaluation Part 2 #136
-
Hi! Should we evaluate the baseline qrels from part 1 with compute_metrics_plain() as well? I am asking as the data differs from a re-ranking task (which is the point I guess?) and we get an MRR@10 of 0.95 with the re-ranking metrics. Or should we come up with our own evaluation metrics? If we just do the evaluation with the core_metrics_plain() there are several issues I think. Mainly that documents, that have the same relevance rank according to the baseline (or our own aggregation), e.g. 3 are sorted in arbitrary order. Lets say we have 3 documents per query with a ranks 3,3,2. Than, switching the position of the first 2 documents would result in a pretty different metric, regarding MRR for example Kind Regards, |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Dear Sebastian, Using the core_metrics_plain() is enough. Yes, the arbitrary sorting will result in worse scores but for our purposes, this is good enough. Best, Pia |
Beta Was this translation helpful? Give feedback.
-
Alright, thanks a lot for the reply! Kind Regards, |
Beta Was this translation helpful? Give feedback.
Dear Sebastian,
Using the core_metrics_plain() is enough. Yes, the arbitrary sorting will result in worse scores but for our purposes, this is good enough.
Best,
Pia