Yuqingxie/add beir word piece (#1880)

* add beir word piece tests and documents Co-authored-by: Jimmy Lin <jimmylin@uwaterloo.ca>
castorini · May 21, 2022 · d457c88 · d457c88
1 parent 457978d
commit d457c88
Show file tree

Hide file tree

Showing 75 changed files with 4,250 additions and 0 deletions.
diff --git a/docs/regressions-beir-v1.0.0-arguana-wp.md b/docs/regressions-beir-v1.0.0-arguana-wp.md
@@ -0,0 +1,69 @@
+# Anserini Regressions: BEIR (v1.0.0) &mdash; ArguAna
+
+This page documents BM25 regression experiments for [BEIR (v1.0.0) &mdash; ArguAna](http://beir.ai/).
+These experiments index the corpus in a "flat" manner, by concatenating the "title" and "text" into the "contents" field.
+All the documents and queries are pre-tokenized with `bert-base-uncased` tokenizer.
+
+The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/beir-v1.0.0-arguana-wp.yaml).
+Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/beir-v1.0.0-arguana-wp.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.
+
+From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:
+
+```
+python src/main/python/run_regression.py --index --verify --search --regression beir-v1.0.0-arguana-wp
+```
+
+## Indexing
+
+Typical indexing command:
+
+```
+target/appassembler/bin/IndexCollection \
+  -collection BeirFlatCollection \
+  -input /path/to/beir-v1.0.0-arguana-wp \
+  -index indexes/lucene-index.beir-v1.0.0-arguana-wp/ \
+  -generator DefaultLuceneDocumentGenerator \
+  -threads 1 -storePositions -storeDocvectors -storeRaw -pretokenized \
+  >& logs/log.beir-v1.0.0-arguana-wp &
+```
+
+For additional details, see explanation of [common indexing options](common-indexing-options.md).
+
+## Retrieval
+
+After indexing has completed, you should be able to perform retrieval as follows:
+
+```
+target/appassembler/bin/SearchCollection \
+  -index indexes/lucene-index.beir-v1.0.0-arguana-wp/ \
+  -topics src/main/resources/topics-and-qrels/topics.beir-v1.0.0-arguana.test.wp.tsv.gz \
+  -topicreader TsvString \
+  -output runs/run.beir-v1.0.0-arguana-wp.bm25.topics.beir-v1.0.0-arguana.test.wp.txt \
+  -bm25 -removeQuery -pretokenized &
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```
+tools/eval/trec_eval.9.0.4/trec_eval -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana-wp.bm25.topics.beir-v1.0.0-arguana.test.wp.txt
+tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100 src/main/resources/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana-wp.bm25.topics.beir-v1.0.0-arguana.test.wp.txt
+tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana-wp.bm25.topics.beir-v1.0.0-arguana.test.wp.txt
+```
+
+## Effectiveness
+
+With the above commands, you should be able to reproduce the following results:
+
+| nDCG@10                                                                                                      | BM25      |
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| BEIR (v1.0.0): ArguAna                                                                                       | 0.3639    |
+
+
+| R@100                                                                                                        | BM25      |
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| BEIR (v1.0.0): ArguAna                                                                                       | 0.8791    |
+
+
+| R@1000                                                                                                       | BM25      |
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| BEIR (v1.0.0): ArguAna                                                                                       | 0.9602    |
diff --git a/docs/regressions-beir-v1.0.0-climate-fever-wp.md b/docs/regressions-beir-v1.0.0-climate-fever-wp.md
@@ -0,0 +1,69 @@
+# Anserini Regressions: BEIR (v1.0.0) &mdash; Climate-FEVER
+
+This page documents BM25 regression experiments for [BEIR (v1.0.0) &mdash; Climate-FEVER](http://beir.ai/).
+These experiments index the corpus in a "flat" manner, by concatenating the "title" and "text" into the "contents" field.
+All the documents and queries are pre-tokenized with `bert-base-uncased` tokenizer.
+
+The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/beir-v1.0.0-climate-fever-wp.yaml).
+Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/beir-v1.0.0-climate-fever-wp.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.
+
+From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:
+
+```
+python src/main/python/run_regression.py --index --verify --search --regression beir-v1.0.0-climate-fever-wp
+```
+
+## Indexing
+
+Typical indexing command:
+
+```
+target/appassembler/bin/IndexCollection \
+  -collection BeirFlatCollection \
+  -input /path/to/beir-v1.0.0-climate-fever-wp \
+  -index indexes/lucene-index.beir-v1.0.0-climate-fever-wp/ \
+  -generator DefaultLuceneDocumentGenerator \
+  -threads 1 -storePositions -storeDocvectors -storeRaw -pretokenized \
+  >& logs/log.beir-v1.0.0-climate-fever-wp &
+```
+
+For additional details, see explanation of [common indexing options](common-indexing-options.md).
+
+## Retrieval
+
+After indexing has completed, you should be able to perform retrieval as follows:
+
+```
+target/appassembler/bin/SearchCollection \
+  -index indexes/lucene-index.beir-v1.0.0-climate-fever-wp/ \
+  -topics src/main/resources/topics-and-qrels/topics.beir-v1.0.0-climate-fever.test.wp.tsv.gz \
+  -topicreader TsvString \
+  -output runs/run.beir-v1.0.0-climate-fever-wp.bm25.topics.beir-v1.0.0-climate-fever.test.wp.txt \
+  -bm25 -removeQuery -pretokenized &
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```
+tools/eval/trec_eval.9.0.4/trec_eval -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.beir-v1.0.0-climate-fever.test.txt runs/run.beir-v1.0.0-climate-fever-wp.bm25.topics.beir-v1.0.0-climate-fever.test.wp.txt
+tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100 src/main/resources/topics-and-qrels/qrels.beir-v1.0.0-climate-fever.test.txt runs/run.beir-v1.0.0-climate-fever-wp.bm25.topics.beir-v1.0.0-climate-fever.test.wp.txt
+tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.beir-v1.0.0-climate-fever.test.txt runs/run.beir-v1.0.0-climate-fever-wp.bm25.topics.beir-v1.0.0-climate-fever.test.wp.txt
+```
+
+## Effectiveness
+
+With the above commands, you should be able to reproduce the following results:
+
+| nDCG@10                                                                                                      | BM25      |
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| BEIR (v1.0.0): Climate-FEVER                                                                                 | 0.1576    |
+
+
+| R@100                                                                                                        | BM25      |
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| BEIR (v1.0.0): Climate-FEVER                                                                                 | 0.4077    |
+
+
+| R@1000                                                                                                       | BM25      |
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| BEIR (v1.0.0): Climate-FEVER                                                                                 | 0.5984    |
diff --git a/docs/regressions-beir-v1.0.0-cqadupstack-android-wp.md b/docs/regressions-beir-v1.0.0-cqadupstack-android-wp.md
@@ -0,0 +1,69 @@
+# Anserini Regressions: BEIR (v1.0.0) &mdash; CQADupStack-android
+
+This page documents BM25 regression experiments for [BEIR (v1.0.0) &mdash; CQADupStack-android](http://beir.ai/).
+These experiments index the corpus in a "flat" manner, by concatenating the "title" and "text" into the "contents" field.
+All the documents and queries are pre-tokenized with `bert-base-uncased` tokenizer.
+
+The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/beir-v1.0.0-cqadupstack-android-wp.yaml).
+Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/beir-v1.0.0-cqadupstack-android-wp.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.
+
+From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:
+
+```
+python src/main/python/run_regression.py --index --verify --search --regression beir-v1.0.0-cqadupstack-android-wp
+```
+
+## Indexing
+
+Typical indexing command:
+
+```
+target/appassembler/bin/IndexCollection \
+  -collection BeirFlatCollection \
+  -input /path/to/beir-v1.0.0-cqadupstack-android-wp \
+  -index indexes/lucene-index.beir-v1.0.0-cqadupstack-android-wp/ \
+  -generator DefaultLuceneDocumentGenerator \
+  -threads 1 -storePositions -storeDocvectors -storeRaw -pretokenized \
+  >& logs/log.beir-v1.0.0-cqadupstack-android-wp &
+```
+
+For additional details, see explanation of [common indexing options](common-indexing-options.md).
+
+## Retrieval
+
+After indexing has completed, you should be able to perform retrieval as follows:
+
+```
+target/appassembler/bin/SearchCollection \
+  -index indexes/lucene-index.beir-v1.0.0-cqadupstack-android-wp/ \
+  -topics src/main/resources/topics-and-qrels/topics.beir-v1.0.0-cqadupstack-android.test.wp.tsv.gz \
+  -topicreader TsvString \
+  -output runs/run.beir-v1.0.0-cqadupstack-android-wp.bm25.topics.beir-v1.0.0-cqadupstack-android.test.wp.txt \
+  -bm25 -removeQuery -pretokenized &
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```
+tools/eval/trec_eval.9.0.4/trec_eval -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.beir-v1.0.0-cqadupstack-android.test.txt runs/run.beir-v1.0.0-cqadupstack-android-wp.bm25.topics.beir-v1.0.0-cqadupstack-android.test.wp.txt
+tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100 src/main/resources/topics-and-qrels/qrels.beir-v1.0.0-cqadupstack-android.test.txt runs/run.beir-v1.0.0-cqadupstack-android-wp.bm25.topics.beir-v1.0.0-cqadupstack-android.test.wp.txt
+tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.beir-v1.0.0-cqadupstack-android.test.txt runs/run.beir-v1.0.0-cqadupstack-android-wp.bm25.topics.beir-v1.0.0-cqadupstack-android.test.wp.txt
+```
+
+## Effectiveness
+
+With the above commands, you should be able to reproduce the following results:
+
+| nDCG@10                                                                                                      | BM25      |
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| BEIR (v1.0.0): CQADupStack-android                                                                           | 0.3694    |
+
+
+| R@100                                                                                                        | BM25      |
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| BEIR (v1.0.0): CQADupStack-android                                                                           | 0.6394    |
+
+
+| R@1000                                                                                                       | BM25      |
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| BEIR (v1.0.0): CQADupStack-android                                                                           | 0.8447    |
diff --git a/docs/regressions-beir-v1.0.0-cqadupstack-english-wp.md b/docs/regressions-beir-v1.0.0-cqadupstack-english-wp.md
@@ -0,0 +1,69 @@
+# Anserini Regressions: BEIR (v1.0.0) &mdash; CQADupStack-english
+
+This page documents BM25 regression experiments for [BEIR (v1.0.0) &mdash; CQADupStack-english](http://beir.ai/).
+These experiments index the corpus in a "flat" manner, by concatenating the "title" and "text" into the "contents" field.
+All the documents and queries are pre-tokenized with `bert-base-uncased` tokenizer.
+
+The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/beir-v1.0.0-cqadupstack-english-wp.yaml).
+Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/beir-v1.0.0-cqadupstack-english-wp.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.
+
+From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:
+
+```
+python src/main/python/run_regression.py --index --verify --search --regression beir-v1.0.0-cqadupstack-english-wp
+```
+
+## Indexing
+
+Typical indexing command:
+
+```
+target/appassembler/bin/IndexCollection \
+  -collection BeirFlatCollection \
+  -input /path/to/beir-v1.0.0-cqadupstack-english-wp \
+  -index indexes/lucene-index.beir-v1.0.0-cqadupstack-english-wp/ \
+  -generator DefaultLuceneDocumentGenerator \
+  -threads 1 -storePositions -storeDocvectors -storeRaw -pretokenized \
+  >& logs/log.beir-v1.0.0-cqadupstack-english-wp &
+```
+
+For additional details, see explanation of [common indexing options](common-indexing-options.md).
+
+## Retrieval
+
+After indexing has completed, you should be able to perform retrieval as follows:
+
+```
+target/appassembler/bin/SearchCollection \
+  -index indexes/lucene-index.beir-v1.0.0-cqadupstack-english-wp/ \
+  -topics src/main/resources/topics-and-qrels/topics.beir-v1.0.0-cqadupstack-english.test.wp.tsv.gz \
+  -topicreader TsvString \
+  -output runs/run.beir-v1.0.0-cqadupstack-english-wp.bm25.topics.beir-v1.0.0-cqadupstack-english.test.wp.txt \
+  -bm25 -removeQuery -pretokenized &
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```
+tools/eval/trec_eval.9.0.4/trec_eval -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.beir-v1.0.0-cqadupstack-english.test.txt runs/run.beir-v1.0.0-cqadupstack-english-wp.bm25.topics.beir-v1.0.0-cqadupstack-english.test.wp.txt
+tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100 src/main/resources/topics-and-qrels/qrels.beir-v1.0.0-cqadupstack-english.test.txt runs/run.beir-v1.0.0-cqadupstack-english-wp.bm25.topics.beir-v1.0.0-cqadupstack-english.test.wp.txt
+tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.beir-v1.0.0-cqadupstack-english.test.txt runs/run.beir-v1.0.0-cqadupstack-english-wp.bm25.topics.beir-v1.0.0-cqadupstack-english.test.wp.txt
+```
+
+## Effectiveness
+
+With the above commands, you should be able to reproduce the following results:
+
+| nDCG@10                                                                                                      | BM25      |
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| BEIR (v1.0.0): CQADupStack-english                                                                           | 0.3457    |
+
+
+| R@100                                                                                                        | BM25      |
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| BEIR (v1.0.0): CQADupStack-english                                                                           | 0.5544    |
+
+
+| R@1000                                                                                                       | BM25      |
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| BEIR (v1.0.0): CQADupStack-english                                                                           | 0.7243    |
diff --git a/docs/regressions-beir-v1.0.0-cqadupstack-gaming-wp.md b/docs/regressions-beir-v1.0.0-cqadupstack-gaming-wp.md
@@ -0,0 +1,69 @@
+# Anserini Regressions: BEIR (v1.0.0) &mdash; CQADupStack-gaming
+
+This page documents BM25 regression experiments for [BEIR (v1.0.0) &mdash; CQADupStack-gaming](http://beir.ai/).
+These experiments index the corpus in a "flat" manner, by concatenating the "title" and "text" into the "contents" field.
+All the documents and queries are pre-tokenized with `bert-base-uncased` tokenizer.
+
+The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/beir-v1.0.0-cqadupstack-gaming-wp.yaml).
+Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/beir-v1.0.0-cqadupstack-gaming-wp.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.
+
+From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:
+
+```
+python src/main/python/run_regression.py --index --verify --search --regression beir-v1.0.0-cqadupstack-gaming-wp
+```
+
+## Indexing
+
+Typical indexing command:
+
+```
+target/appassembler/bin/IndexCollection \
+  -collection BeirFlatCollection \
+  -input /path/to/beir-v1.0.0-cqadupstack-gaming-wp \
+  -index indexes/lucene-index.beir-v1.0.0-cqadupstack-gaming-wp/ \
+  -generator DefaultLuceneDocumentGenerator \
+  -threads 1 -storePositions -storeDocvectors -storeRaw -pretokenized \
+  >& logs/log.beir-v1.0.0-cqadupstack-gaming-wp &
+```
+
+For additional details, see explanation of [common indexing options](common-indexing-options.md).
+
+## Retrieval
+
+After indexing has completed, you should be able to perform retrieval as follows:
+
+```
+target/appassembler/bin/SearchCollection \
+  -index indexes/lucene-index.beir-v1.0.0-cqadupstack-gaming-wp/ \
+  -topics src/main/resources/topics-and-qrels/topics.beir-v1.0.0-cqadupstack-gaming.test.wp.tsv.gz \
+  -topicreader TsvString \
+  -output runs/run.beir-v1.0.0-cqadupstack-gaming-wp.bm25.topics.beir-v1.0.0-cqadupstack-gaming.test.wp.txt \
+  -bm25 -removeQuery -pretokenized &
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```
+tools/eval/trec_eval.9.0.4/trec_eval -c -m ndcg_cut.10 src/main/resources/topics-and-qrels/qrels.beir-v1.0.0-cqadupstack-gaming.test.txt runs/run.beir-v1.0.0-cqadupstack-gaming-wp.bm25.topics.beir-v1.0.0-cqadupstack-gaming.test.wp.txt
+tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100 src/main/resources/topics-and-qrels/qrels.beir-v1.0.0-cqadupstack-gaming.test.txt runs/run.beir-v1.0.0-cqadupstack-gaming-wp.bm25.topics.beir-v1.0.0-cqadupstack-gaming.test.wp.txt
+tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.beir-v1.0.0-cqadupstack-gaming.test.txt runs/run.beir-v1.0.0-cqadupstack-gaming-wp.bm25.topics.beir-v1.0.0-cqadupstack-gaming.test.wp.txt
+```
+
+## Effectiveness
+
+With the above commands, you should be able to reproduce the following results:
+
+| nDCG@10                                                                                                      | BM25      |
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| BEIR (v1.0.0): CQADupStack-gaming                                                                            | 0.4701    |
+
+
+| R@100                                                                                                        | BM25      |
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| BEIR (v1.0.0): CQADupStack-gaming                                                                            | 0.7438    |
+
+
+| R@1000                                                                                                       | BM25      |
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| BEIR (v1.0.0): CQADupStack-gaming                                                                            | 0.8810    |