Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to parse raw text into docvectors on-the-fly for impact indexes #2122

Closed
lintool opened this issue May 23, 2023 · 7 comments
Closed
Assignees

Comments

@lintool
Copy link
Member

lintool commented May 23, 2023

As part of #1984 we added the ability to re-create docvectors on-the-fly so that we didn't need to store docvectors in the index (but we need to store the raw text, which is smaller).

This feature hasn't been exposed for impact indexes. We should do it so models like uniCOIL and SPLADE can benefit.

@lintool
Copy link
Member Author

lintool commented May 24, 2023

Building three types of indexes, using uniCOIL as an example:

# "base"
target/appassembler/bin/IndexCollection \
  -collection JsonVectorCollection \
  -input /mnt/collections/msmarco/msmarco-passage-unicoil \
  -index indexes/lucene-index.msmarco-passage-unicoil/ \
  -generator DefaultLuceneDocumentGenerator \
  -threads 16 -impact -pretokenized -optimize &

# Store docvectors
target/appassembler/bin/IndexCollection \
  -collection JsonVectorCollection \
  -input /mnt/collections/msmarco/msmarco-passage-unicoil \
  -index indexes/lucene-index.msmarco-passage-unicoil.docvectors/ \
  -generator DefaultLuceneDocumentGenerator \
  -threads 16 -impact -pretokenized -storeDocvectors -optimize &

# Store raw text
target/appassembler/bin/IndexCollection \
  -collection JsonVectorCollection \
  -input /mnt/collections/msmarco/msmarco-passage-unicoil \
  -index indexes/lucene-index.msmarco-passage-unicoil.text/ \
  -generator DefaultLuceneDocumentGenerator \
  -threads 16 -impact -pretokenized -storeRaw -optimize &

Saves a lot of space to store only raw text:

$ du -h indexes/ | grep unicoil
1.3G	indexes/lucene-index.msmarco-passage-unicoil
47G	indexes/lucene-index.msmarco-passage-unicoil.docvectors
8.0G	indexes/lucene-index.msmarco-passage-unicoil.text

@lintool
Copy link
Member Author

lintool commented May 24, 2023

@AileenLin to help you out - this is currently what's not working on the Pyserini end:

python -m pyserini.search.lucene \
  --threads 16 --batch-size 128 \
  --index ../anserini/indexes/lucene-index.msmarco-passage-unicoil.docvectors \
  --topics dl19-passage-unicoil \
  --output runs/run.dl19-rocchio.txt \
  --hits 1000 --impact --rocchio

Ultimately, I want to make this work.

@AileenLin
Copy link
Member

do you mean this error? AttributeError: 'LuceneImpactSearcher' object has no attribute 'set_rocchio'

I have tested anserini with the following and it matched the benchmark

target/appassembler/bin/SearchCollection \ -index indexes/lucene-index.msmarco-passage-unicoil.docvectors/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.unicoil.tsv.gz \ -collection JsonVectorCollection \ -topicreader TsvInt \ -output runs/run.msmarco-passage-unicoil.rocchio.topics.msmarco-passage.dev-subset.unicoil.txt \ -impact -pretokenized -rocchio

target/appassembler/bin/SearchCollection \ -index indexes/lucene-index.msmarco-passage-unicoil.text/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.unicoil.tsv.gz \ -collection JsonVectorCollection \ -topicreader TsvInt \ -output runs/run.msmarco-passage-unicoil.rocchio.topics.msmarco-passage.dev-subset.unicoil.txt \ -impact -pretokenized -rocchio

@lintool
Copy link
Member Author

lintool commented May 29, 2023

Yup, we need to expose the feature in the Java class, and then wire the connections to Python.

@AileenLin
Copy link
Member

got it

@lintool
Copy link
Member Author

lintool commented Aug 11, 2023

Ref: #2164 #2165

lintool pushed a commit to castorini/pyserini that referenced this issue Aug 21, 2023
castorini/anserini#2122
Add ability to parse raw text into docvectors on-the-fly for impact indexes

castorini/anserini#2165
Misalignment in SearchCollection and SimpleImpactSearcher implementation - so some changes in 2cr
@lintool
Copy link
Member Author

lintool commented Aug 30, 2023

This has been pushed out in v0.22.0 all done!

@lintool lintool closed this as completed Aug 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants