-
Notifications
You must be signed in to change notification settings - Fork 467
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
+ Add analyzer for different languages. + Add documents and regression test for TREC2002 Arabic, CLEF2006 French, FIRE2012 English, Bengali and Hindi.
- Loading branch information
1 parent
fb9ecf4
commit 4116188
Showing
41 changed files
with
179,933 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# Anserini: Regressions for [CLEF2006 Monolingual French](http://www.clef-initiative.eu/edition/clef2006) | ||
|
||
This page documents regression experiments for [CLEF2006 monolingual French topics)](http://www.clef-initiative.eu/edition/clef2006). | ||
The description of the document collection can be found in the [CLEF corpus page](http://www.clef-initiative.eu/dataset/corpus). | ||
|
||
The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/clef06-fr.yaml). | ||
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/celf06-fr.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. | ||
|
||
## Indexing | ||
|
||
Typical indexing command: | ||
|
||
``` | ||
nohup sh target/appassembler/bin/IndexCollection -collection JsonCollection \ | ||
-generator LuceneDocumentGenerator -threads 16 -input /path/to/clef06-fr -index \ | ||
lucene-index.clef06-fr.pos+docvectors+rawdocs -storePositions -storeDocvectors \ | ||
-storeRawDocs -language fr >& log.clef06-fr.pos+docvectors+rawdocs & | ||
``` | ||
|
||
The directory `/path/to/clef06-fr/` should be a directory containing the collection (the format is jsonline format). | ||
|
||
For additional details, see explanation of [common indexing options](common-indexing-options.md). | ||
|
||
## Retrieval | ||
|
||
Topics and qrels are stored in [`src/main/resources/topics-and-qrels/`](../src/main/resources/topics-and-qrels/). | ||
The regression experiments here evaluate on the 49 questions. | ||
|
||
After indexing has completed, you should be able to perform retrieval as follows: | ||
|
||
``` | ||
nohup target/appassembler/bin/SearchCollection -topicreader TsvString -index lucene-index.clef06-fr.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.clef06fr.mono.fr.txt -output run.clef06-fr.bm25.topics.clef06fr.mono.fr.txt -language fr -bm25 & | ||
``` | ||
|
||
Evaluation can be performed using `trec_eval`: | ||
|
||
``` | ||
eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.clef06fr.txt run.clef06-fr.bm25.topics.clef06fr.mono.fr.txt | ||
``` | ||
|
||
## Effectiveness | ||
|
||
With the above commands, you should be able to replicate the following results: | ||
|
||
MAP | BM25 | | ||
:---------------------------------------|-----------| | ||
[CLEF2006 (French monolingual)](http://www.clef-initiative.eu/edition/clef2006)| 0.3111 | | ||
|
||
|
||
P30 | BM25 | | ||
:---------------------------------------|-----------| | ||
[CLEF2006 (French monolingual)](http://www.clef-initiative.eu/edition/clef2006)| 0.2735 | | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# Anserini: Regressions for [FIRE 2012 Monolingual Bengali](http://isical.ac.in/~fire/2012/adhoc.html) | ||
|
||
This page documents regression experiments for [FIRE 2012 Ad-hoc retrieval (Monolingual Bengali topic)](http://isical.ac.in/~fire/2012/adhoc.html). | ||
The document collection can be found in [FIRE 2012 data page](http://fire.irsi.res.in/fire/static/data). | ||
|
||
The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/fire-bn.yaml). | ||
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/fire12-bn.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. | ||
|
||
## Indexing | ||
|
||
Typical indexing command: | ||
|
||
``` | ||
nohup sh target/appassembler/bin/IndexCollection -collection TrecCollection \ | ||
-generator LuceneDocumentGenerator -threads 16 -input /path/to/fire12-bn -index \ | ||
lucene-index.fire12-hi.pos+docvectors+rawdocs -storePositions -storeDocvectors \ | ||
-storeRawDocs -language bn >& log.fire12-bn.pos+docvectors+rawdocs & | ||
``` | ||
|
||
The directory `/path/to/fire12-bn/` should be a directory containing the collection, containing `bn_ABP` and `bn_BDNews24` directories. | ||
|
||
For additional details, see explanation of [common indexing options](common-indexing-options.md). | ||
|
||
## Retrieval | ||
|
||
Topics and qrels are stored in [`src/main/resources/topics-and-qrels/`](../src/main/resources/topics-and-qrels/). | ||
The regression experiments here evaluate on the 50 questions. | ||
|
||
After indexing has completed, you should be able to perform retrieval as follows: | ||
|
||
``` | ||
nohup target/appassembler/bin/SearchCollection -topicreader TsvString -index lucene-index.fire12-bn.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.fire12bn.176-225.txt -output run.fire12-bn.bm25.topics.fire12bn.176-225.txt -language bn -bm25 & | ||
``` | ||
|
||
Evaluation can be performed using `trec_eval`: | ||
|
||
``` | ||
eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.fire12bn.176-225.txt run.fire12-bn.bm25.topics.fire12bn.176-225.txt | ||
``` | ||
|
||
## Effectiveness | ||
|
||
With the above commands, you should be able to replicate the following results: | ||
|
||
MAP | BM25 | | ||
:---------------------------------------|-----------| | ||
[FIRE2012 (Bengali monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.2881 | | ||
|
||
|
||
P30 | BM25 | | ||
:---------------------------------------|-----------| | ||
[FIRE2012 (Bengali monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.3360 | | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# Anserini: Regressions for [FIRE 2012 Monolingual English](http://isical.ac.in/~fire/2012/adhoc.html) | ||
|
||
This page documents regression experiments for [FIRE 2012 Ad-hoc retrieval (Monolingual English topic)](http://isical.ac.in/~fire/2012/adhoc.html). | ||
The document collection can be found in [FIRE 2012 data page](http://fire.irsi.res.in/fire/static/data). | ||
|
||
The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/fire-en.yaml). | ||
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/fire12-en.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. | ||
|
||
## Indexing | ||
|
||
Typical indexing command: | ||
|
||
``` | ||
nohup sh target/appassembler/bin/IndexCollection -collection TrecCollection \ | ||
-generator LuceneDocumentGenerator -threads 16 -input /path/to/fire12-en -index \ | ||
lucene-index.fire12-en.pos+docvectors+rawdocs -storePositions -storeDocvectors \ | ||
-storeRawDocs -language en >& log.fire12-en.pos+docvectors+rawdocs & | ||
``` | ||
|
||
The directory `/path/to/fire12-en/` should be a directory containing the collection, containing `en_BDNews24` and `en_TheTelegraph_2001-2010` directories. | ||
|
||
For additional details, see explanation of [common indexing options](common-indexing-options.md). | ||
|
||
## Retrieval | ||
|
||
Topics and qrels are stored in [`src/main/resources/topics-and-qrels/`](../src/main/resources/topics-and-qrels/). | ||
The regression experiments here evaluate on the 50 questions. | ||
|
||
After indexing has completed, you should be able to perform retrieval as follows: | ||
|
||
``` | ||
nohup target/appassembler/bin/SearchCollection -topicreader TsvString -index lucene-index.fire12-en.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.fire12en.176-225.txt -output run.fire12-en.bm25.topics.fire12en.176-225.txt -language en -bm25 & | ||
``` | ||
|
||
Evaluation can be performed using `trec_eval`: | ||
|
||
``` | ||
eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.fire12en.176-225.txt run.fire12-en.bm25.topics.fire12en.176-225.txt | ||
``` | ||
|
||
## Effectiveness | ||
|
||
With the above commands, you should be able to replicate the following results: | ||
|
||
MAP | BM25 | | ||
:---------------------------------------|-----------| | ||
[FIRE2012 (English monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.3867 | | ||
|
||
|
||
P30 | BM25 | | ||
:---------------------------------------|-----------| | ||
[FIRE2012 (English monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.3920 | | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# Anserini: Regressions for [FIRE 2012 Monolingual Hindi](http://isical.ac.in/~fire/2012/adhoc.html) | ||
|
||
This page documents regression experiments for [FIRE 2012 Ad-hoc retrieval (Monolingual Hindi topic)](http://isical.ac.in/~fire/2012/adhoc.html). | ||
The document collection can be found in [FIRE 2012 data page](http://fire.irsi.res.in/fire/static/data). | ||
|
||
The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/fire-hi.yaml). | ||
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/fire12-hi.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. | ||
|
||
## Indexing | ||
|
||
Typical indexing command: | ||
|
||
``` | ||
nohup sh target/appassembler/bin/IndexCollection -collection TrecCollection \ | ||
-generator LuceneDocumentGenerator -threads 16 -input /path/to/fire12-hi -index \ | ||
lucene-index.fire12-hi.pos+docvectors+rawdocs -storePositions -storeDocvectors \ | ||
-storeRawDocs -language hi >& log.fire12-hi.pos+docvectors+rawdocs & | ||
``` | ||
|
||
The directory `/path/to/fire12-hi/` should be a directory containing the collection, containing `hi_AmarUjala` and `hi_NavbharatTimes` directories. | ||
|
||
For additional details, see explanation of [common indexing options](common-indexing-options.md). | ||
|
||
## Retrieval | ||
|
||
Topics and qrels are stored in [`src/main/resources/topics-and-qrels/`](../src/main/resources/topics-and-qrels/). | ||
The regression experiments here evaluate on the 50 questions. | ||
|
||
After indexing has completed, you should be able to perform retrieval as follows: | ||
|
||
``` | ||
nohup target/appassembler/bin/SearchCollection -topicreader TsvString -index lucene-index.fire12-hi.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.fire12hi.176-225.txt -output run.fire12-hi.bm25.topics.fire12hi.176-225.txt -language hi -bm25 & | ||
``` | ||
|
||
Evaluation can be performed using `trec_eval`: | ||
|
||
``` | ||
eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.fire12hi.176-225.txt run.fire12-hi.bm25.topics.fire12hi.176-225.txt | ||
``` | ||
|
||
## Effectiveness | ||
|
||
With the above commands, you should be able to replicate the following results: | ||
|
||
MAP | BM25 | | ||
:---------------------------------------|-----------| | ||
[FIRE2012 (Hindi monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.3867 | | ||
|
||
|
||
P30 | BM25 | | ||
:---------------------------------------|-----------| | ||
[FIRE2012 (Hindi monolingual)](http://isical.ac.in/~fire/2012/adhoc.html)| 0.3920 | | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# Anserini: Regressions for [TREC2002 Monolingual Arabic](https://trec.nist.gov/pubs/trec11/t11_proceedings.html) | ||
|
||
This page documents regression experiments for [TREC2002 Arabic monolingual topics)](https://trec.nist.gov/pubs/trec11/t11_proceedings.html). | ||
The description of the document collection can be found in the [TREC data page](https://trec.nist.gov/data/docs_noneng.html): Agence France Presse (AFP) Arabic newswire, from [LDC2001T55 (Arabic Newswire Part 1)](https://catalog.ldc.upenn.edu/LDC2001T55). | ||
|
||
The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/trec02-ar.yaml). | ||
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/trec02-ar.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. | ||
|
||
## Indexing | ||
|
||
Typical indexing command: | ||
|
||
``` | ||
nohup sh target/appassembler/bin/IndexCollection -collection JsonCollection \ | ||
-generator LuceneDocumentGenerator -threads 16 -input /path/to/trec02-ar -index \ | ||
lucene-index.trec02-ar.pos+docvectors+rawdocs -storePositions -storeDocvectors \ | ||
-storeRawDocs -language ar >& log.trec02-ar.pos+docvectors+rawdocs & | ||
``` | ||
|
||
The directory `/path/to/trec02-ar/` should be a directory containing the collection, 2337 gzipped files from LDC2007T38. | ||
|
||
For additional details, see explanation of [common indexing options](common-indexing-options.md). | ||
|
||
## Retrieval | ||
|
||
Topics and qrels are stored in [`src/main/resources/topics-and-qrels/`](../src/main/resources/topics-and-qrels/). | ||
The regression experiments here evaluate on the 50 questions. | ||
|
||
After indexing has completed, you should be able to perform retrieval as follows: | ||
|
||
``` | ||
nohup target/appassembler/bin/SearchCollection -topicreader TsvString -index lucene-index.trec02-ar.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.trec02ar.mono.ar.txt -output run.trec02-ar.bm25.topics.trec02ar.mono.ar.txt -language ar -bm25 & | ||
``` | ||
|
||
Evaluation can be performed using `trec_eval`: | ||
|
||
``` | ||
eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.trec02ar.txt run.trec02-ar.bm25.topics.trec02ar.mono.ar.txt | ||
``` | ||
|
||
## Effectiveness | ||
|
||
With the above commands, you should be able to replicate the following results: | ||
|
||
MAP | BM25 | | ||
:---------------------------------------|-----------| | ||
[TREC2002 (Arabic monolingual)](../src/main/resources/topics-and-qrels/topics.trec02ar.momo.ar.txt)| 0.2932 | | ||
|
||
|
||
P30 | BM25 | | ||
:---------------------------------------|-----------| | ||
[TREC2002 (Arabic monolingual)](../src/main/resources/topics-and-qrels/topics.trec02ar.momo.ar.txt)| 0.3313 | | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.