You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@mpetri, @amallia, and I have come across a weird bug where an input JsonVectorCollection will have its weights broken by long terms, possibly impacting downstream ranking.
The specific bug is because of a series of design choices.
Anserini "clones" a term with a given weight value weight times (pseudo document generation) to offload the actual indexing to Lucene (without tinkering with internals).
Assume you have a term coming into your vector with 256 characters and a weight of 200.
What happens is that term is split at the 255th character, leaving the final character dangling as its own term. Then, this can mess up the underlying impacts.
@mpetri, @amallia, and I have come across a weird bug where an input JsonVectorCollection will have its weights broken by long terms, possibly impacting downstream ranking.
The specific bug is because of a series of design choices.
Anserini "clones" a term with a given weight value
weight
times (pseudo document generation) to offload the actual indexing to Lucene (without tinkering with internals).Inside Lucene, the default maximum term length is 255 chars (see https://lucene.apache.org/core/8_0_0/core/constant-values.html#org.apache.lucene.analysis.standard.StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH).
So, getting down to the messy bits.
Assume you have a term coming into your vector with 256 characters and a weight of 200.
What happens is that term is split at the 255th character, leaving the final character dangling as its own term. Then, this can mess up the underlying impacts.
A toy example:
This will result in an index with "X" having an impact of 400 (!!!!!) instead of 200.
Clearly this then flows on to downstream indexing/querying tasks.
One solution we found was overriding the default value of 255 in the constructor for the
WhitespaceAnalyzer
(see https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexCollection.java#L768). We set to the max permissible value of1048576
which solves the problem.The text was updated successfully, but these errors were encountered: