Add ability to parse raw text into docvectors on-the-fly for impact indexes #2122

lintool · 2023-05-23T17:27:02Z

As part of #1984 we added the ability to re-create docvectors on-the-fly so that we didn't need to store docvectors in the index (but we need to store the raw text, which is smaller).

This feature hasn't been exposed for impact indexes. We should do it so models like uniCOIL and SPLADE can benefit.

lintool · 2023-05-24T01:55:28Z

Building three types of indexes, using uniCOIL as an example:

# "base"
target/appassembler/bin/IndexCollection \
  -collection JsonVectorCollection \
  -input /mnt/collections/msmarco/msmarco-passage-unicoil \
  -index indexes/lucene-index.msmarco-passage-unicoil/ \
  -generator DefaultLuceneDocumentGenerator \
  -threads 16 -impact -pretokenized -optimize &

# Store docvectors
target/appassembler/bin/IndexCollection \
  -collection JsonVectorCollection \
  -input /mnt/collections/msmarco/msmarco-passage-unicoil \
  -index indexes/lucene-index.msmarco-passage-unicoil.docvectors/ \
  -generator DefaultLuceneDocumentGenerator \
  -threads 16 -impact -pretokenized -storeDocvectors -optimize &

# Store raw text
target/appassembler/bin/IndexCollection \
  -collection JsonVectorCollection \
  -input /mnt/collections/msmarco/msmarco-passage-unicoil \
  -index indexes/lucene-index.msmarco-passage-unicoil.text/ \
  -generator DefaultLuceneDocumentGenerator \
  -threads 16 -impact -pretokenized -storeRaw -optimize &

Saves a lot of space to store only raw text:

$ du -h indexes/ | grep unicoil
1.3G	indexes/lucene-index.msmarco-passage-unicoil
47G	indexes/lucene-index.msmarco-passage-unicoil.docvectors
8.0G	indexes/lucene-index.msmarco-passage-unicoil.text

lintool · 2023-05-24T02:37:18Z

@AileenLin to help you out - this is currently what's not working on the Pyserini end:

python -m pyserini.search.lucene \
  --threads 16 --batch-size 128 \
  --index ../anserini/indexes/lucene-index.msmarco-passage-unicoil.docvectors \
  --topics dl19-passage-unicoil \
  --output runs/run.dl19-rocchio.txt \
  --hits 1000 --impact --rocchio

Ultimately, I want to make this work.

AileenLin · 2023-05-29T02:40:30Z

do you mean this error? AttributeError: 'LuceneImpactSearcher' object has no attribute 'set_rocchio'

I have tested anserini with the following and it matched the benchmark

target/appassembler/bin/SearchCollection \ -index indexes/lucene-index.msmarco-passage-unicoil.docvectors/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.unicoil.tsv.gz \ -collection JsonVectorCollection \ -topicreader TsvInt \ -output runs/run.msmarco-passage-unicoil.rocchio.topics.msmarco-passage.dev-subset.unicoil.txt \ -impact -pretokenized -rocchio

target/appassembler/bin/SearchCollection \ -index indexes/lucene-index.msmarco-passage-unicoil.text/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.unicoil.tsv.gz \ -collection JsonVectorCollection \ -topicreader TsvInt \ -output runs/run.msmarco-passage-unicoil.rocchio.topics.msmarco-passage.dev-subset.unicoil.txt \ -impact -pretokenized -rocchio

lintool · 2023-05-29T02:45:31Z

Yup, we need to expose the feature in the Java class, and then wire the connections to Python.

AileenLin · 2023-05-29T02:46:24Z

got it

…ndexes #2122 (#2148)

lintool · 2023-08-11T23:56:10Z

Ref: #2164 #2165

castorini/anserini#2122 Add ability to parse raw text into docvectors on-the-fly for impact indexes castorini/anserini#2165 Misalignment in SearchCollection and SimpleImpactSearcher implementation - so some changes in 2cr

lintool · 2023-08-30T02:19:55Z

This has been pushed out in v0.22.0 all done!

lintool assigned AileenLin May 23, 2023

AileenLin added a commit to AileenLin/anserini that referenced this issue Jun 3, 2023

castorini#2122

b6a14e9

AileenLin mentioned this issue Jul 16, 2023

Add ability to parse raw text into docvectors on-the-fly for impact indexes #2122 #2148

Merged

lintool pushed a commit that referenced this issue Aug 8, 2023

Add ability to parse raw text into docvectors on-the-fly for impact i…

9cdcf0e

…ndexes #2122 (#2148)

lintool closed this as completed Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to parse raw text into docvectors on-the-fly for impact indexes #2122

Add ability to parse raw text into docvectors on-the-fly for impact indexes #2122

lintool commented May 23, 2023

lintool commented May 24, 2023

lintool commented May 24, 2023

AileenLin commented May 29, 2023

lintool commented May 29, 2023

AileenLin commented May 29, 2023

lintool commented Aug 11, 2023

lintool commented Aug 30, 2023

Add ability to parse raw text into docvectors on-the-fly for impact indexes #2122

Add ability to parse raw text into docvectors on-the-fly for impact indexes #2122

Comments

lintool commented May 23, 2023

lintool commented May 24, 2023

lintool commented May 24, 2023

AileenLin commented May 29, 2023

lintool commented May 29, 2023

AileenLin commented May 29, 2023

lintool commented Aug 11, 2023

lintool commented Aug 30, 2023