Anserini: #2122 #1571

AileenLin · 2023-07-18T22:13:05Z

No description provided.

…r impact indexes #2122

lintool · 2023-07-19T11:19:43Z

pyserini/encode/_splade.py

@@ -33,3 +36,13 @@ def _output_to_weight_dicts(self, batch_aggregated_logits):
            d = {self.reverse_voc[k]: float(v) for k, v in zip(list(col), list(weights))}
            to_return.append(d)
        return to_return
+
+    def _get_encoded_query_token_wight_dicts(self, tok_weights):


Hi @AileenLin - if we implement the quantization and API changes we discussed on the Java end, we wouldn't need this on the Python end, right?

lintool · 2023-08-08T20:12:32Z

Hi @AileenLin - with the Anserini changes, something like this should work right?

$ python -m pyserini.search.lucene \
>   --threads 16 --batch-size 128 \
>   --index msmarco-v1-passage-unicoil \
>   --topics dl19-passage-unicoil \
>   --output run.msmarco-v1-passage.unicoil.dl19.txt \
>   --hits 1000 --impact --rocchio

I'm getting:

Initializing msmarco-v1-passage-unicoil...
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 192, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tuna1/scratch/jimmylin/pyserini/pyserini/search/lucene/__main__.py", line 217, in <module>
    searcher.set_rocchio()
  File "/tuna1/scratch/jimmylin/pyserini/pyserini/search/lucene/_impact_searcher.py", line 318, in set_rocchio
    elif self.prebuilt_index_name in ['msmarco-v1-passage', 'msmarco-v1-doc', 'msmarco-v1-doc-segmented']:
AttributeError: 'LuceneImpactSearcher' object has no attribute 'prebuilt_index_name'

Something I'm missing?

lintool · 2023-08-12T00:09:24Z

Results so far:

python -m pyserini.search.lucene \
  --threads 16 --batch-size 128 \
  --index msmarco-v1-passage-splade-pp-ed \
  --topics dl19-passage \
  --onnx-encoder SpladePlusPlusEnsembleDistil \
  --output runs/run.msmarco-v1-passage.splade-pp-ed-onnx.dl19.txt \
  --hits 1000 --impact

python -m pyserini.eval.trec_eval -c -l 2 -m map dl19-passage runs/run.msmarco-v1-passage.splade-pp-ed-onnx.dl19.txt
python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 dl19-passage runs/run.msmarco-v1-passage.splade-pp-ed-onnx.dl19.txt
python -m pyserini.eval.trec_eval -c -l 2 -m recall.1000 dl19-passage runs/run.msmarco-v1-passage.splade-pp-ed-onnx.dl19.txt

map                   	all	0.5050
ndcg_cut_10           	all	0.7308
recall_1000           	all	0.8728

# Note this is different from
# SPLADE++ EnsembleDistil: query inference with ONNX	0.5054	0.7320	0.8724	
# but the same as here: https://github.com/castorini/anserini/blob/master/docs/regressions-dl19-passage-splade-pp-ed-onnx.md

python -m pyserini.search.lucene \
  --threads 16 --batch-size 128 \
  --index msmarco-v1-passage-splade-pp-ed-docvectors \
  --topics dl19-passage \
  --onnx-encoder SpladePlusPlusEnsembleDistil \
  --output runs/run.msmarco-v1-passage.splade-pp-ed-onnx.rocchio1.dl19.txt \
  --hits 1000 --impact --rocchio

python -m pyserini.eval.trec_eval -c -l 2 -m map dl19-passage runs/run.msmarco-v1-passage.splade-pp-ed-onnx.rocchio1.dl19.txt
python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 dl19-passage runs/run.msmarco-v1-passage.splade-pp-ed-onnx.rocchio1.dl19.txt
python -m pyserini.eval.trec_eval -c -l 2 -m recall.1000 dl19-passage runs/run.msmarco-v1-passage.splade-pp-ed-onnx.rocchio1.dl19.txt

# matches https://github.com/castorini/anserini/blob/master/docs/regressions-dl19-passage-splade-pp-ed-onnx.md
map                   	all	0.5140
ndcg_cut_10           	all	0.7119
recall_1000           	all	0.8799

python -m pyserini.search.lucene \
  --threads 16 --batch-size 128 \
  --index msmarco-v1-passage-splade-pp-ed-text \
  --topics dl19-passage \
  --onnx-encoder SpladePlusPlusEnsembleDistil \
  --output runs/run.msmarco-v1-passage.splade-pp-ed-onnx.rocchio2.dl19.txt \
  --hits 1000 --impact --rocchio

python -m pyserini.eval.trec_eval -c -l 2 -m map dl19-passage runs/run.msmarco-v1-passage.splade-pp-ed-onnx.rocchio2.dl19.txt
python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 dl19-passage runs/run.msmarco-v1-passage.splade-pp-ed-onnx.rocchio2.dl19.txt
python -m pyserini.eval.trec_eval -c -l 2 -m recall.1000 dl19-passage runs/run.msmarco-v1-passage.splade-pp-ed-onnx.rocchio2.dl19.txt

# matches https://github.com/castorini/anserini/blob/master/docs/regressions-dl19-passage-splade-pp-ed-onnx.md
map                   	all	0.5140
ndcg_cut_10           	all	0.7119
recall_1000           	all	0.8799

Anserini: Add ability to parse raw text into docvectors on-the-fly fo…

0b5bed7

…r impact indexes #2122

AileenLin requested a review from lintool July 18, 2023 22:13

lintool reviewed Jul 19, 2023

View reviewed changes

AileenLin added 3 commits July 27, 2023 16:22

simplify impact search for onnx query encoder.

824d4fc

refactor java side function name

2489616

refactor function name

71b9393

AileenLin added 3 commits August 9, 2023 11:04

fix prebuild index in impact_searcher

7d01777

fix impact result error, add support of indexes with raw texts

9c410b2

fix broken test

eaf398c

fix type issue and add quantization in slim encoder

9b097a5

lintool approved these changes Aug 21, 2023

View reviewed changes

lintool merged commit b713a51 into castorini:master Aug 21, 2023

lintool mentioned this pull request Aug 21, 2023

Update regressions due to fixes introduced in Pyserini #1571 #1596

Merged

lintool added a commit that referenced this pull request Aug 21, 2023

Update regressions due to fixes introduced in Pyserini #1571 (#1596)

c071c7a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anserini: #2122 #1571

Anserini: #2122 #1571

AileenLin commented Jul 18, 2023

lintool Jul 19, 2023

lintool commented Aug 8, 2023

lintool commented Aug 12, 2023

Anserini: #2122 #1571

Anserini: #2122 #1571

Conversation

AileenLin commented Jul 18, 2023

lintool Jul 19, 2023

Choose a reason for hiding this comment

lintool commented Aug 8, 2023

lintool commented Aug 12, 2023