SearchCollection Tie-Breaking #1925

manveertamber · 2022-07-13T00:53:48Z

While building a regression for QA with the DPR Wikipedia 100-word splits corpus, I found that Top-K accuracy might differ in the 4th decimal point depending on the format of the id used in the corpus before indexing and searching. Using a numbered id achieves slightly different scores than using an id of the form "doc_id#segment_id" ex. id:"10" vs id:"9#1".

id format	Natural Questions Test: top_20_accuracy
numbered	0.6294
doc_id#segment_id	0.6296

This seems to be because of how ties are broken in SearchCollection: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/search/SearchCollection.java#L146. With the above example of ids, "10" < "2" < "9#1" lexicographically, but the same document could be assigned either id.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SearchCollection Tie-Breaking #1925

SearchCollection Tie-Breaking #1925

manveertamber commented Jul 13, 2022 •

edited

Loading

SearchCollection Tie-Breaking #1925

SearchCollection Tie-Breaking #1925

Comments

manveertamber commented Jul 13, 2022 • edited Loading

manveertamber commented Jul 13, 2022 •

edited

Loading