You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While building a regression for QA with the DPR Wikipedia 100-word splits corpus, I found that Top-K accuracy might differ in the 4th decimal point depending on the format of the id used in the corpus before indexing and searching. Using a numbered id achieves slightly different scores than using an id of the form "doc_id#segment_id" ex. id:"10" vs id:"9#1".
While building a regression for QA with the DPR Wikipedia 100-word splits corpus, I found that Top-K accuracy might differ in the 4th decimal point depending on the format of the id used in the corpus before indexing and searching. Using a numbered id achieves slightly different scores than using an id of the form "doc_id#segment_id" ex. id:"10" vs id:"9#1".
This seems to be because of how ties are broken in SearchCollection: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/search/SearchCollection.java#L146. With the above example of ids, "10" < "2" < "9#1" lexicographically, but the same document could be assigned either id.
The text was updated successfully, but these errors were encountered: