Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anserini: #2122 #1571

Merged
merged 8 commits into from
Aug 21, 2023
Merged

Anserini: #2122 #1571

merged 8 commits into from
Aug 21, 2023

Conversation

AileenLin
Copy link
Member

No description provided.

@AileenLin AileenLin requested a review from lintool July 18, 2023 22:13
@@ -33,3 +36,13 @@ def _output_to_weight_dicts(self, batch_aggregated_logits):
d = {self.reverse_voc[k]: float(v) for k, v in zip(list(col), list(weights))}
to_return.append(d)
return to_return

def _get_encoded_query_token_wight_dicts(self, tok_weights):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @AileenLin - if we implement the quantization and API changes we discussed on the Java end, we wouldn't need this on the Python end, right?

@lintool
Copy link
Member

lintool commented Aug 8, 2023

Hi @AileenLin - with the Anserini changes, something like this should work right?

$ python -m pyserini.search.lucene \
>   --threads 16 --batch-size 128 \
>   --index msmarco-v1-passage-unicoil \
>   --topics dl19-passage-unicoil \
>   --output run.msmarco-v1-passage.unicoil.dl19.txt \
>   --hits 1000 --impact --rocchio

I'm getting:

Initializing msmarco-v1-passage-unicoil...
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 192, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tuna1/scratch/jimmylin/pyserini/pyserini/search/lucene/__main__.py", line 217, in <module>
    searcher.set_rocchio()
  File "/tuna1/scratch/jimmylin/pyserini/pyserini/search/lucene/_impact_searcher.py", line 318, in set_rocchio
    elif self.prebuilt_index_name in ['msmarco-v1-passage', 'msmarco-v1-doc', 'msmarco-v1-doc-segmented']:
AttributeError: 'LuceneImpactSearcher' object has no attribute 'prebuilt_index_name'

Something I'm missing?

@lintool
Copy link
Member

lintool commented Aug 12, 2023

Results so far:

python -m pyserini.search.lucene \
  --threads 16 --batch-size 128 \
  --index msmarco-v1-passage-splade-pp-ed \
  --topics dl19-passage \
  --onnx-encoder SpladePlusPlusEnsembleDistil \
  --output runs/run.msmarco-v1-passage.splade-pp-ed-onnx.dl19.txt \
  --hits 1000 --impact

python -m pyserini.eval.trec_eval -c -l 2 -m map dl19-passage runs/run.msmarco-v1-passage.splade-pp-ed-onnx.dl19.txt
python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 dl19-passage runs/run.msmarco-v1-passage.splade-pp-ed-onnx.dl19.txt
python -m pyserini.eval.trec_eval -c -l 2 -m recall.1000 dl19-passage runs/run.msmarco-v1-passage.splade-pp-ed-onnx.dl19.txt

map                   	all	0.5050
ndcg_cut_10           	all	0.7308
recall_1000           	all	0.8728

# Note this is different from
# SPLADE++ EnsembleDistil: query inference with ONNX	0.5054	0.7320	0.8724	
# but the same as here: https://github.com/castorini/anserini/blob/master/docs/regressions-dl19-passage-splade-pp-ed-onnx.md

python -m pyserini.search.lucene \
  --threads 16 --batch-size 128 \
  --index msmarco-v1-passage-splade-pp-ed-docvectors \
  --topics dl19-passage \
  --onnx-encoder SpladePlusPlusEnsembleDistil \
  --output runs/run.msmarco-v1-passage.splade-pp-ed-onnx.rocchio1.dl19.txt \
  --hits 1000 --impact --rocchio

python -m pyserini.eval.trec_eval -c -l 2 -m map dl19-passage runs/run.msmarco-v1-passage.splade-pp-ed-onnx.rocchio1.dl19.txt
python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 dl19-passage runs/run.msmarco-v1-passage.splade-pp-ed-onnx.rocchio1.dl19.txt
python -m pyserini.eval.trec_eval -c -l 2 -m recall.1000 dl19-passage runs/run.msmarco-v1-passage.splade-pp-ed-onnx.rocchio1.dl19.txt

# matches https://github.com/castorini/anserini/blob/master/docs/regressions-dl19-passage-splade-pp-ed-onnx.md
map                   	all	0.5140
ndcg_cut_10           	all	0.7119
recall_1000           	all	0.8799

python -m pyserini.search.lucene \
  --threads 16 --batch-size 128 \
  --index msmarco-v1-passage-splade-pp-ed-text \
  --topics dl19-passage \
  --onnx-encoder SpladePlusPlusEnsembleDistil \
  --output runs/run.msmarco-v1-passage.splade-pp-ed-onnx.rocchio2.dl19.txt \
  --hits 1000 --impact --rocchio

python -m pyserini.eval.trec_eval -c -l 2 -m map dl19-passage runs/run.msmarco-v1-passage.splade-pp-ed-onnx.rocchio2.dl19.txt
python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 dl19-passage runs/run.msmarco-v1-passage.splade-pp-ed-onnx.rocchio2.dl19.txt
python -m pyserini.eval.trec_eval -c -l 2 -m recall.1000 dl19-passage runs/run.msmarco-v1-passage.splade-pp-ed-onnx.rocchio2.dl19.txt

# matches https://github.com/castorini/anserini/blob/master/docs/regressions-dl19-passage-splade-pp-ed-onnx.md
map                   	all	0.5140
ndcg_cut_10           	all	0.7119
recall_1000           	all	0.8799

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants