-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ranking: add tiebreakers to BM25 #914
Conversation
This adds repo freshness and file order as tiebreakers to the final bm25 score, just like we have for Zoekt's default scoring. During testing I found that it is a lot less likely for the tiebreakers to have an effect with BM25 because the score depends on qualites of the document, such as the relative length and number of matches, which usually differ even with the quality of the match is similar. Test plan: - Score tests still pass - manual testing: see screenshots
Nice ! Did you run our end-to-end evals to check this improves (or at least doesn't worsen) performance? Also could you explain what the example shows? The queries are different between BM25 and Default, so not quite sure how to compare them. |
@@ -361,10 +349,26 @@ func (d *indexData) scoreFilesUsingBM25(fileMatch *zoekt.FileMatch, doc uint32, | |||
sumTF += f | |||
score += tfScore(k, b, L, f) | |||
} | |||
// 2 digits of precision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using 2 digits here simply so we can add the tiebreaker? Or does it aid in "collapsing" similar BM25 scores together, to allow tiebreakers to actually become relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We caught up about this over Zoom. We don't believe this will actually collapse BM25 scores, it's just so we can apply the tiebreaker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 2 digits are just to avoid an overlap with the tiebreaker. In my experiments and based on the evaluations, 2 digits seemed sufficient, but we can simply increase precision if we see that's necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ! Did you run our end-to-end evals to check this improves (or at least doesn't worsen) performance?
@stefanhengl told me that he reran evals and didn't see any different in performance.
Over Zoom, we discussed whether we should even have this, because it kicks in so infrequently. I am still supportive of adding it, because I think it creates a simpler mental model for Zoekt scoring: we always have a "query/ file match" component, plus the same tiebreakers. Otherwise, BM25 would be an 'exception' compared to default scoring.
Relates to SPLF-838
This adds repo freshness and file order as tiebreakers to the final bm25 score, just like we have for Zoekt's default scoring.
During testing I found that it is a lot less likely for the tiebreakers to have an effect with BM25 because the score depends on qualities of the document, such as the relative length and number of matches, which usually differ even if the quality of the match is similar.
Note: I updated the format of the debug string a bit to make it (hopefully) more readable (see screenshot)
Test plan: