Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SearchCollection Tie-Breaking #1925

Open
manveertamber opened this issue Jul 13, 2022 · 0 comments
Open

SearchCollection Tie-Breaking #1925

manveertamber opened this issue Jul 13, 2022 · 0 comments

Comments

@manveertamber
Copy link
Member

manveertamber commented Jul 13, 2022

While building a regression for QA with the DPR Wikipedia 100-word splits corpus, I found that Top-K accuracy might differ in the 4th decimal point depending on the format of the id used in the corpus before indexing and searching. Using a numbered id achieves slightly different scores than using an id of the form "doc_id#segment_id" ex. id:"10" vs id:"9#1".

id format Natural Questions Test: top_20_accuracy
numbered 0.6294
doc_id#segment_id 0.6296

This seems to be because of how ties are broken in SearchCollection: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/search/SearchCollection.java#L146. With the above example of ids, "10" < "2" < "9#1" lexicographically, but the same document could be assigned either id.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant