Skip to content

Commit

Permalink
Multilingual regression fixes (#902)
Browse files Browse the repository at this point in the history
With this patch, I've verified that all new regressions work on tuna.
  • Loading branch information
lintool authored Nov 27, 2019
1 parent dd47b91 commit b9264da
Show file tree
Hide file tree
Showing 13 changed files with 116 additions and 61 deletions.
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,11 @@ Note that these regressions capture the "out of the box" experience, based on [_
+ [Regressions for the MS MARCO Passage Task with Doc2query expansion](docs/regressions-msmarco-passage-doc2query.md)
+ [Regressions for the MS MARCO Document Task](docs/regressions-msmarco-doc.md)
+ [Regressions for NTCIR-8 ACLIA (IR4QA subtask, Chinese monolingual)](docs/regressions-ntcir8-zh.md)
+ [Regressions for CLEF2006 Monolingual French](docs/regressions-clef06-fr.md)
+ [Regressions for TREC2002 Monolingual Arabic](docs/regressions-trec02-ar.md)
+ [Regressions for FIRE 2012 Monolingual Bengali](docs/regressions-fire12-bn.md)
+ [Regressions for FIRE 2012 Monolingual Hindi](docs/regressions-fire12-hi.md)
+ [Regressions for FIRE 2012 Monolingual English](docs/regressions-fire12-en.md)

Other experiments:

Expand Down
11 changes: 6 additions & 5 deletions docs/regressions-clef06-fr.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,9 @@ Note that this page is automatically generated from [this template](../src/main/
Typical indexing command:

```
nohup sh target/appassembler/bin/IndexCollection -collection JsonCollection \
-generator LuceneDocumentGenerator -threads 16 -input /path/to/clef06-fr -index \
lucene-index.clef06-fr.pos+docvectors+rawdocs -storePositions -storeDocvectors \
-storeRawDocs -language fr >& log.clef06-fr.pos+docvectors+rawdocs &
nohup sh target/appassembler/bin/IndexCollection -collection JsonCollection -input /path/to/clef06-fr \
-index lucene-index.clef06-fr.pos+docvectors+rawdocs -generator LuceneDocumentGenerator -threads 16 \
-storePositions -storeDocvectors -storeRawDocs -language fr >& log.clef06-fr.pos+docvectors+rawdocs &
```

The directory `/path/to/clef06-fr/` should be a directory containing the collection (the format is jsonline format).
Expand All @@ -29,7 +28,9 @@ The regression experiments here evaluate on the 49 questions.
After indexing has completed, you should be able to perform retrieval as follows:

```
nohup target/appassembler/bin/SearchCollection -topicreader TsvString -index lucene-index.clef06-fr.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.clef06fr.mono.fr.txt -output run.clef06-fr.bm25.topics.clef06fr.mono.fr.txt -language fr -bm25 &
nohup target/appassembler/bin/SearchCollection -index lucene-index.clef06-fr.pos+docvectors+rawdocs \
-topicreader TsvString -topics src/main/resources/topics-and-qrels/topics.clef06fr.mono.fr.txt \
-language fr -bm25 -output run.clef06-fr.bm25.topics.clef06fr.mono.fr.txt &
```

Expand Down
26 changes: 16 additions & 10 deletions docs/regressions-fire12-bn.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,17 @@
This page documents regression experiments for [FIRE 2012 Ad-hoc retrieval (Monolingual Bengali topic)](http://isical.ac.in/~fire/2012/adhoc.html).
The document collection can be found in [FIRE 2012 data page](http://fire.irsi.res.in/fire/static/data).

The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/fire-bn.yaml).
The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/fire12-bn.yaml).
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/fire12-bn.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

## Indexing

Typical indexing command:

```
nohup sh target/appassembler/bin/IndexCollection -collection TrecCollection \
-generator LuceneDocumentGenerator -threads 16 -input /path/to/fire12-bn -index \
lucene-index.fire12-hi.pos+docvectors+rawdocs -storePositions -storeDocvectors \
-storeRawDocs -language bn >& log.fire12-bn.pos+docvectors+rawdocs &
nohup sh target/appassembler/bin/IndexCollection -collection TrecCollection -input /path/to/fire12-bn \
-index lucene-index.fire12-bn.pos+docvectors+rawdocs -generator LuceneDocumentGenerator -threads 16 \
-storePositions -storeDocvectors -storeRawDocs -language bn >& log.fire12-bn.pos+docvectors+rawdocs &
```

The directory `/path/to/fire12-bn/` should be a directory containing the collection, containing `bn_ABP` and `bn_BDNews24` directories.
Expand All @@ -29,14 +28,16 @@ The regression experiments here evaluate on the 50 questions.
After indexing has completed, you should be able to perform retrieval as follows:

```
nohup target/appassembler/bin/SearchCollection -topicreader TsvString -index lucene-index.fire12-bn.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.fire12bn.176-225.txt -output run.fire12-bn.bm25.topics.fire12bn.176-225.txt -language bn -bm25 &
nohup target/appassembler/bin/SearchCollection -index lucene-index.fire12-bn.pos+docvectors+rawdocs \
-topicreader TsvString -topics src/main/resources/topics-and-qrels/topics.fire12bn.176-225.txt \
-language bn -bm25 -output run.fire12-bn.bm25.topics.fire12bn.176-225.txt &
```

Evaluation can be performed using `trec_eval`:

```
eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.fire12bn.176-225.txt run.fire12-bn.bm25.topics.fire12bn.176-225.txt
eval/trec_eval.9.0.4/trec_eval -m map -m P.20 -m ndcg_cut.20 src/main/resources/topics-and-qrels/qrels.fire12bn.176-225.txt run.fire12-bn.bm25.topics.fire12bn.176-225.txt
```

Expand All @@ -46,11 +47,16 @@ With the above commands, you should be able to replicate the following results:

MAP | BM25 |
:---------------------------------------|-----------|
[FIRE2012 (Bengali monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.2881 |
[FIRE 2012 (Bengali monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.2881 |


P30 | BM25 |
P20 | BM25 |
:---------------------------------------|-----------|
[FIRE2012 (Bengali monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.3360 |
[FIRE 2012 (Bengali monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.3740 |


NDCG20 | BM25 |
:---------------------------------------|-----------|
[FIRE 2012 (Bengali monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.4261 |


26 changes: 16 additions & 10 deletions docs/regressions-fire12-en.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,17 @@
This page documents regression experiments for [FIRE 2012 Ad-hoc retrieval (Monolingual English topic)](http://isical.ac.in/~fire/2012/adhoc.html).
The document collection can be found in [FIRE 2012 data page](http://fire.irsi.res.in/fire/static/data).

The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/fire-en.yaml).
The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/fire12-en.yaml).
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/fire12-en.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

## Indexing

Typical indexing command:

```
nohup sh target/appassembler/bin/IndexCollection -collection TrecCollection \
-generator LuceneDocumentGenerator -threads 16 -input /path/to/fire12-en -index \
lucene-index.fire12-en.pos+docvectors+rawdocs -storePositions -storeDocvectors \
-storeRawDocs -language en >& log.fire12-en.pos+docvectors+rawdocs &
nohup sh target/appassembler/bin/IndexCollection -collection TrecCollection -input /path/to/fire12-en \
-index lucene-index.fire12-en.pos+docvectors+rawdocs -generator LuceneDocumentGenerator -threads 16 \
-storePositions -storeDocvectors -storeRawDocs -language en >& log.fire12-en.pos+docvectors+rawdocs &
```

The directory `/path/to/fire12-en/` should be a directory containing the collection, containing `en_BDNews24` and `en_TheTelegraph_2001-2010` directories.
Expand All @@ -29,14 +28,16 @@ The regression experiments here evaluate on the 50 questions.
After indexing has completed, you should be able to perform retrieval as follows:

```
nohup target/appassembler/bin/SearchCollection -topicreader TsvString -index lucene-index.fire12-en.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.fire12en.176-225.txt -output run.fire12-en.bm25.topics.fire12en.176-225.txt -language en -bm25 &
nohup target/appassembler/bin/SearchCollection -index lucene-index.fire12-en.pos+docvectors+rawdocs \
-topicreader TsvString -topics src/main/resources/topics-and-qrels/topics.fire12en.176-225.txt \
-language en -bm25 -output run.fire12-en.bm25.topics.fire12en.176-225.txt &
```

Evaluation can be performed using `trec_eval`:

```
eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.fire12en.176-225.txt run.fire12-en.bm25.topics.fire12en.176-225.txt
eval/trec_eval.9.0.4/trec_eval -m map -m P.20 -m ndcg_cut.20 src/main/resources/topics-and-qrels/qrels.fire12en.176-225.txt run.fire12-en.bm25.topics.fire12en.176-225.txt
```

Expand All @@ -46,11 +47,16 @@ With the above commands, you should be able to replicate the following results:

MAP | BM25 |
:---------------------------------------|-----------|
[FIRE2012 (English monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.3867 |
[FIRE 2012 (English monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.3713 |


P30 | BM25 |
P20 | BM25 |
:---------------------------------------|-----------|
[FIRE2012 (English monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.3920 |
[FIRE 2012 (English monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.4970 |


NDCG20 | BM25 |
:---------------------------------------|-----------|
[FIRE 2012 (English monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.5420 |


26 changes: 16 additions & 10 deletions docs/regressions-fire12-hi.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,17 @@
This page documents regression experiments for [FIRE 2012 Ad-hoc retrieval (Monolingual Hindi topic)](http://isical.ac.in/~fire/2012/adhoc.html).
The document collection can be found in [FIRE 2012 data page](http://fire.irsi.res.in/fire/static/data).

The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/fire-hi.yaml).
The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/fire12-hi.yaml).
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/fire12-hi.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

## Indexing

Typical indexing command:

```
nohup sh target/appassembler/bin/IndexCollection -collection TrecCollection \
-generator LuceneDocumentGenerator -threads 16 -input /path/to/fire12-hi -index \
lucene-index.fire12-hi.pos+docvectors+rawdocs -storePositions -storeDocvectors \
-storeRawDocs -language hi >& log.fire12-hi.pos+docvectors+rawdocs &
nohup sh target/appassembler/bin/IndexCollection -collection TrecCollection -input /path/to/fire12-hi \
-index lucene-index.fire12-hi.pos+docvectors+rawdocs -generator LuceneDocumentGenerator -threads 16 \
-storePositions -storeDocvectors -storeRawDocs -language hi >& log.fire12-hi.pos+docvectors+rawdocs &
```

The directory `/path/to/fire12-hi/` should be a directory containing the collection, containing `hi_AmarUjala` and `hi_NavbharatTimes` directories.
Expand All @@ -29,14 +28,16 @@ The regression experiments here evaluate on the 50 questions.
After indexing has completed, you should be able to perform retrieval as follows:

```
nohup target/appassembler/bin/SearchCollection -topicreader TsvString -index lucene-index.fire12-hi.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.fire12hi.176-225.txt -output run.fire12-hi.bm25.topics.fire12hi.176-225.txt -language hi -bm25 &
nohup target/appassembler/bin/SearchCollection -index lucene-index.fire12-hi.pos+docvectors+rawdocs \
-topicreader TsvString -topics src/main/resources/topics-and-qrels/topics.fire12hi.176-225.txt \
-language hi -bm25 -output run.fire12-hi.bm25.topics.fire12hi.176-225.txt &
```

Evaluation can be performed using `trec_eval`:

```
eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.fire12hi.176-225.txt run.fire12-hi.bm25.topics.fire12hi.176-225.txt
eval/trec_eval.9.0.4/trec_eval -m map -m P.20 -m ndcg_cut.20 src/main/resources/topics-and-qrels/qrels.fire12hi.176-225.txt run.fire12-hi.bm25.topics.fire12hi.176-225.txt
```

Expand All @@ -46,11 +47,16 @@ With the above commands, you should be able to replicate the following results:

MAP | BM25 |
:---------------------------------------|-----------|
[FIRE2012 (Hindi monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.3867 |
[FIRE 2012 (Hindi monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.3867 |


P30 | BM25 |
P20 | BM25 |
:---------------------------------------|-----------|
[FIRE2012 (Hindi monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.3920 |
[FIRE 2012 (Hindi monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.4470 |


NDCG20 | BM25 |
:---------------------------------------|-----------|
[FIRE 2012 (Hindi monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.5310 |


11 changes: 6 additions & 5 deletions docs/regressions-trec02-ar.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,9 @@ Note that this page is automatically generated from [this template](../src/main/
Typical indexing command:

```
nohup sh target/appassembler/bin/IndexCollection -collection JsonCollection \
-generator LuceneDocumentGenerator -threads 16 -input /path/to/trec02-ar -index \
lucene-index.trec02-ar.pos+docvectors+rawdocs -storePositions -storeDocvectors \
-storeRawDocs -language ar >& log.trec02-ar.pos+docvectors+rawdocs &
nohup sh target/appassembler/bin/IndexCollection -collection JsonCollection -input /path/to/trec02-ar \
-index lucene-index.trec02-ar.pos+docvectors+rawdocs -generator LuceneDocumentGenerator -threads 16 \
-storePositions -storeDocvectors -storeRawDocs -language ar >& log.trec02-ar.pos+docvectors+rawdocs &
```

The directory `/path/to/trec02-ar/` should be a directory containing the collection, 2337 gzipped files from LDC2007T38.
Expand All @@ -29,7 +28,9 @@ The regression experiments here evaluate on the 50 questions.
After indexing has completed, you should be able to perform retrieval as follows:

```
nohup target/appassembler/bin/SearchCollection -topicreader TsvString -index lucene-index.trec02-ar.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.trec02ar.mono.ar.txt -output run.trec02-ar.bm25.topics.trec02ar.mono.ar.txt -language ar -bm25 &
nohup target/appassembler/bin/SearchCollection -index lucene-index.trec02-ar.pos+docvectors+rawdocs \
-topicreader TsvString -topics src/main/resources/topics-and-qrels/topics.trec02ar.mono.ar.txt \
-language ar -bm25 -output run.trec02-ar.bm25.topics.trec02ar.mono.ar.txt &
```

Expand Down
2 changes: 1 addition & 1 deletion src/main/resources/docgen/templates/fire12-bn.template
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
This page documents regression experiments for [FIRE 2012 Ad-hoc retrieval (Monolingual Bengali topic)](http://isical.ac.in/~fire/2012/adhoc.html).
The document collection can be found in [FIRE 2012 data page](http://fire.irsi.res.in/fire/static/data).

The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/fire-bn.yaml).
The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/fire12-bn.yaml).
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/fire12-bn.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

## Indexing
Expand Down
2 changes: 1 addition & 1 deletion src/main/resources/docgen/templates/fire12-en.template
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
This page documents regression experiments for [FIRE 2012 Ad-hoc retrieval (Monolingual English topic)](http://isical.ac.in/~fire/2012/adhoc.html).
The document collection can be found in [FIRE 2012 data page](http://fire.irsi.res.in/fire/static/data).

The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/fire-en.yaml).
The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/fire12-en.yaml).
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/fire12-en.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

## Indexing
Expand Down
2 changes: 1 addition & 1 deletion src/main/resources/docgen/templates/fire12-hi.template
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
This page documents regression experiments for [FIRE 2012 Ad-hoc retrieval (Monolingual Hindi topic)](http://isical.ac.in/~fire/2012/adhoc.html).
The document collection can be found in [FIRE 2012 data page](http://fire.irsi.res.in/fire/static/data).

The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/fire-hi.yaml).
The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/fire12-hi.yaml).
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/fire12-hi.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

## Indexing
Expand Down
22 changes: 16 additions & 6 deletions src/main/resources/regression/fire12-bn.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
name: fire12-hi
name: fire12-bn
index_command: target/appassembler/bin/IndexCollection
index_utils_command: target/appassembler/bin/IndexUtils
search_command: target/appassembler/bin/SearchCollection
Expand Down Expand Up @@ -29,10 +29,18 @@ evals:
can_combine: true
- command: eval/trec_eval.9.0.4/trec_eval
params:
- -m P.30
- -m P.20
separator: "\t"
parse_index: 2
metric: p30
metric: p20
metric_precision: 4
can_combine: true
- command: eval/trec_eval.9.0.4/trec_eval
params:
- -m ndcg_cut.20
separator: "\t"
parse_index: 2
metric: ndcg20
metric_precision: 4
can_combine: true
input_roots:
Expand All @@ -46,7 +54,7 @@ index_stats:
documents (non-empty): 500122
total terms: 143972612
topics:
- name: "[FIRE2012 (Bengali monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)"
- name: "[FIRE 2012 (Bengali monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)"
path: topics.fire12bn.176-225.txt
qrel: qrels.fire12bn.176-225.txt
models:
Expand All @@ -57,5 +65,7 @@ models:
results:
map:
- 0.2881
p30:
- 0.3360
p20:
- 0.3740
ndcg20:
- 0.4261
22 changes: 16 additions & 6 deletions src/main/resources/regression/fire12-en.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,24 +29,32 @@ evals:
can_combine: true
- command: eval/trec_eval.9.0.4/trec_eval
params:
- -m P.30
- -m P.20
separator: "\t"
parse_index: 2
metric: p30
metric: p20
metric_precision: 4
can_combine: true
- command: eval/trec_eval.9.0.4/trec_eval
params:
- -m ndcg_cut.20
separator: "\t"
parse_index: 2
metric: ndcg20
metric_precision: 4
can_combine: true
input_roots:
- /tuna1/ # on tuna
- /store/ # on orca
- /scratch2/ # on damiano
input: collections/fire/hindi/en.docs.2011
input: collections/fire/english/en.docs.2011
index_path: indexes/lucene-index.fire12-en.pos+docvectors+rawdocs
index_stats:
documents: 392577
documents (non-empty): 392577
total terms: 115311163
topics:
- name: "[FIRE2012 (English monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)"
- name: "[FIRE 2012 (English monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)"
path: topics.fire12en.176-225.txt
qrel: qrels.fire12en.176-225.txt
models:
Expand All @@ -57,5 +65,7 @@ models:
results:
map:
- 0.3713
p30:
- 0.4560
p20:
- 0.4970
ndcg20:
- 0.5420
Loading

0 comments on commit b9264da

Please sign in to comment.