Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added formal regression for MS MARCO passage task #692

Merged
merged 3 commits into from
Jun 8, 2019
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,10 +57,11 @@ Note that these regressions capture the "out of the box" experience, based on [_
+ [Experiments on Tweets2013 (MB13 & MB14)](docs/regressions-mb13.md)
+ [Experiments on Complex Answer Retrieval v1.5 (CAR17)](docs/regressions-car17v1.5.md)
+ [Experiments on Complex Answer Retrieval v2.0 (CAR17)](docs/regressions-car17v2.0.md)
+ [Experiments on MS MARCO Passage Task](docs/regressions-msmarco-passage.md)

Other experiments:

+ [Experiments on MS MARCO](docs/experiments-msmarco.md)
+ [Experiments on MS MARCO](docs/experiments-msmarco-passage.md)
+ [Experiments on AI2 Open Research](docs/experiments-openresearch.md)
+ [Experiments from JDIQ 2018 article](docs/experiments-jdiq2018.md)
+ [Experiments from SIGIR Forum 2018 article](docs/experiments-forum2018.md)
Expand Down
58 changes: 29 additions & 29 deletions docs/experiments-msmarco.md → docs/experiments-msmarco-passage.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# Anserini: Experiments on [MS MARCO](http://www.msmarco.org/)
# Anserini: Experiments on [MS MARCO (Passage)](https://github.com/microsoft/MSMARCO-Passage-Ranking)

## Data Prep

We're going to use `msmarco-passage/` as the working directory.
First, we need to download and extract the MS MARCO dataset:

```
DATA_DIR=./msmarco_data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better!

mkdir ${DATA_DIR}
mkdir msmarco-passage

wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P ${DATA_DIR}
tar -xvf ${DATA_DIR}/collectionandqueries.tar.gz -C ${DATA_DIR}
wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P msmarco-passage
tar -xvf msmarco-passage/collectionandqueries.tar.gz -C msmarco-passage
```

To confirm, `collectionandqueries.tar.gz` should have MD5 checksum of `31644046b18952c1386cd4564ba2ae69`.
Expand All @@ -18,63 +18,63 @@ Next, we need to convert the MS MARCO tsv collection into Anserini's jsonl files

```
python ./src/main/python/msmarco/convert_collection_to_jsonl.py \
--collection_path=${DATA_DIR}/collection.tsv --output_folder=${DATA_DIR}/collection_jsonl
--collection_path msmarco-passage/collection.tsv --output_folder msmarco-passage/collection_jsonl
```

The above script should generate 9 jsonl files in `${DATA_DIR}/collection_jsonl`, each with 1M lines (except for the last one, which should have 841,823 lines).
The above script should generate 9 jsonl files in `msmarco-passage/collection_jsonl`, each with 1M lines (except for the last one, which should have 841,823 lines).

We can now index these docs as a `JsonCollection` using Anserini:

```
sh ./target/appassembler/bin/IndexCollection -collection JsonCollection \
-generator LuceneDocumentGenerator -threads 9 -input ${DATA_DIR}/collection_jsonl \
-index ${DATA_DIR}/lucene-index-msmarco -optimize -storePositions -storeDocvectors -storeRawDocs
-generator LuceneDocumentGenerator -threads 9 -input msmarco-passage/collection_jsonl \
-index msmarco-passage/lucene-index-msmarco -storePositions -storeDocvectors -storeRawDocs
```

The output message should be something like this:

```
2019-04-20 11:52:34,935 INFO [main] index.IndexCollection (IndexCollection.java:647) - Total 8,841,823 documents indexed in 00:05:04
2019-06-08 08:53:47,351 INFO [main] index.IndexCollection (IndexCollection.java:632) - Total 8,841,823 documents indexed in 00:01:31
```

Your speed may vary... with a modern desktop machine with an SSD, indexing takes around a minute.
Your speed may vary... with a modern desktop machine with an SSD, indexing takes less than two minutes.

## Retrieving and Evaluating the Dev set

Since queries of the set are too many (+100k), it would take a long time to retrieve all of them. To speed this up, we use only the queries that are in the qrels file:

```
python ./src/main/python/msmarco/filter_queries.py --qrels=${DATA_DIR}/qrels.dev.small.tsv \
--queries=${DATA_DIR}/queries.dev.tsv --output_queries=${DATA_DIR}/queries.dev.small.tsv
python ./src/main/python/msmarco/filter_queries.py --qrels msmarco-passage/qrels.dev.small.tsv \
--queries msmarco-passage/queries.dev.tsv --output_queries msmarco-passage/queries.dev.small.tsv
```

The output queries file should contain 6980 lines.

We can now retrieve this smaller set of queries.
We can now retrieve this smaller set of queries:

```
python ./src/main/python/msmarco/retrieve.py --index ${DATA_DIR}/lucene-index-msmarco \
--qid_queries ${DATA_DIR}/queries.dev.small.tsv --output ${DATA_DIR}/run.dev.small.tsv --hits 1000
python ./src/main/python/msmarco/retrieve.py --hits 1000 --index msmarco-passage/lucene-index-msmarco \
--qid_queries msmarco-passage/queries.dev.small.tsv --output msmarco-passage/run.dev.small.tsv
```

Note that by default, the above script uses BM25 with tuned parameters `k1=0.82`, `b=0.72` (more details below).
The option `-hits` specifies the of documents per query to be retrieved.
Thus, the output file should have approximately 6980 * 1000 = 6.9M lines.

Retrieval speed will vary by machine:
On a modern desktop with an SSD, we can get ~0.04 per query (taking about five minutes).
On a slower machine with mechanical disks, the entire process might take as long as a couple of hours.
Alternatively, we can run the same script implemented in Java to remove Python overhead, which ends up being ~4x faster.
On a modern desktop with an SSD, we can get ~0.06 s/query (taking about seven minutes).
Alternatively, we can run the same script implemented in Java (which for some reason seems to be slower, see [issue 604](https://github.com/castorini/anserini/issues/604)):

```
./target/appassembler/bin/SearchMsmarco -index ${DATA_DIR}/lucene-index-msmarco \
-qid_queries ${DATA_DIR}/queries.dev.small.tsv -output ${DATA_DIR}/run.dev.small.tsv -hits 1000
./target/appassembler/bin/SearchMsmarco -hits 1000 -index msmarco-passage/lucene-index-msmarco \
-qid_queries msmarco-passage/queries.dev.small.tsv -output msmarco-passage/run.dev.small.tsv
```

The option `-hits` specifies the of documents per query to be retrieved.
Thus, the output file should have approximately 6980 * 1000 = 6.9M lines.

Finally, we can evaluate the retrieved documents using this the official MS MARCO evaluation script:

```
python ./src/main/python/msmarco/msmarco_eval.py \
${DATA_DIR}/qrels.dev.small.tsv ${DATA_DIR}/run.dev.small.tsv
msmarco-passage/qrels.dev.small.tsv msmarco-passage/run.dev.small.tsv
```

And the output should be like this:
Expand All @@ -91,17 +91,17 @@ For that we first need to convert runs and qrels files to the TREC format:

```
python ./src/main/python/msmarco/convert_msmarco_to_trec_run.py \
--input_run ${DATA_DIR}/run.dev.small.tsv --output_run ${DATA_DIR}/run.dev.small.trec
--input_run msmarco-passage/run.dev.small.tsv --output_run msmarco-passage/run.dev.small.trec

python ./src/main/python/msmarco/convert_msmarco_to_trec_qrels.py \
--input_qrels ${DATA_DIR}/qrels.dev.small.tsv --output_qrels ${DATA_DIR}/qrels.dev.small.trec
--input_qrels msmarco-passage/qrels.dev.small.tsv --output_qrels msmarco-passage/qrels.dev.small.trec
```

And run the `trec_eval` tool:

```
./eval/trec_eval.9.0.4/trec_eval -mrecall.1000 -mmap \
${DATA_DIR}/qrels.dev.small.trec ${DATA_DIR}/run.dev.small.trec
msmarco-passage/qrels.dev.small.trec msmarco-passage/run.dev.small.trec
```

The output should be:
Expand All @@ -115,7 +115,7 @@ Average precision and recall@1000 are the two metrics we care about the most.

## BM25 Tuning

Note that this figure differs slightly from the value reported in [Document Expansion by Query Prediction](https://arxiv.org/abs/1904.08375), which uses the Anserini default of `k1=0.9`, `b=0.4`.
Note that this figure differs slightly from the value reported in [Document Expansion by Query Prediction](https://arxiv.org/abs/1904.08375), which uses the Anserini (system-wide) default of `k1=0.9`, `b=0.4`.

Tuning was accomplished with the `tune_bm25.py` script, using the queries found [here](https://github.com/castorini/Anserini-data/tree/master/MSMARCO).
There are five different sets of 10k samples (from the `shuf` command).
Expand Down
70 changes: 70 additions & 0 deletions docs/regressions-msmarco-passage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Anserini: Experiments on [MS MARCO (Passage)](https://github.com/microsoft/MSMARCO-Passage-Ranking)

This page documents regression experiments for the MS MARCO Passage Ranking Task, which is integrated into Anserini's regression testing framework.
For more complete instructions on how to run end-to-end experiments, refer to a [separate page](experiments-msmarco.md).

## Indexing

Typical indexing command:

```
nohup sh target/appassembler/bin/IndexCollection -collection JsonCollection \
-generator LuceneDocumentGenerator -threads 9 -input /path/to/msmarco-passage \
-index lucene-index.msmarco-passage.pos+docvectors+rawdocs -storePositions \
-storeDocvectors -storeRawDocs >& log.msmarco-passage.pos+docvectors+rawdocs &
```

The directory `/path/to/msmarco-passage/` should be a directory containing `jsonl` files converted from the official passage collection, which is in `tsv` format.
[This page](experiments-msmarco.md) explains how to perform this conversion.

For additional details, see explanation of [common indexing options](common-indexing-options.md).

## Retrieval

Topics and qrels are stored in `src/main/resources/topics-and-qrels/`.
The regression experiments here evaluate on the 6980 dev set questions; see [this page](experiments-msmarco.md) for more details.

After indexing has completed, you should be able to perform retrieval as follows:

```
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-default.topics.msmarco-passage.dev-subset.txt -bm25 &

nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-default+rm3.topics.msmarco-passage.dev-subset.txt -bm25 -rm3 &

nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-tuned.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.72 &

nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-tuned+rm3.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.72 -rm3 &

```

Evaluation can be performed using `trec_eval`:

```
eval/trec_eval.9.0.4/trec_eval -m map -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt run.msmarco-passage.bm25-default.topics.msmarco-passage.dev-subset.txt
rodrigonogueira4 marked this conversation as resolved.
Show resolved Hide resolved

eval/trec_eval.9.0.4/trec_eval -m map -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt run.msmarco-passage.bm25-default+rm3.topics.msmarco-passage.dev-subset.txt

eval/trec_eval.9.0.4/trec_eval -m map -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt run.msmarco-passage.bm25-tuned.topics.msmarco-passage.dev-subset.txt

eval/trec_eval.9.0.4/trec_eval -m map -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt run.msmarco-passage.bm25-tuned+rm3.topics.msmarco-passage.dev-subset.txt

```

## Effectiveness

With the above commands, you should be able to replicate the following results:

MAP | BM25 (Default)| +RM3 | BM25 (Tuned)| +RM3 |
:---------------------------------------|-----------|-----------|-----------|-----------|
[MS MARCO Passage Ranking: Dev Queries](https://github.com/microsoft/MSMARCO-Passage-Ranking)| 0.1924 | 0.1661 | 0.1956 | 0.1766 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing!



R@1000 | BM25 (Default)| +RM3 | BM25 (Tuned)| +RM3 |
:---------------------------------------|-----------|-----------|-----------|-----------|
[MS MARCO Passage Ranking: Dev Queries](https://github.com/microsoft/MSMARCO-Passage-Ranking)| 0.8526 | 0.8606 | 0.8578 | 0.8701 |



The setting "default" refers the default BM25 settings of `k1=0.9`, `b=0.4`, while "tuned" refers to the tuned setting of `k1=0.82`, `b=0.72`.
See [this page](experiments-msmarco.md) for more details.
Note that these results are slightly different from the above referenced page because those experiments make up "fake" scores when converting runs from MS MARCO format into TREC format for evaluation by `trec_eval`.
2 changes: 1 addition & 1 deletion src/main/java/io/anserini/search/SearchMsmarco.java
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ public static void main(String[] args) throws Exception {
SimpleSearcher.Result[] hits = searcher.search(query, retrieveArgs.hits);

if (lineNumber % 10 == 0) {
double timePerQuery = (double) (System.nanoTime() - startTime) / (lineNumber + 1) / 10e9;
double timePerQuery = (double) (System.nanoTime() - startTime) / (lineNumber + 1) / 1e9;
System.out.format("Retrieving query " + lineNumber + " (%.3f s/query)\n", timePerQuery);
}

Expand Down
61 changes: 61 additions & 0 deletions src/main/java/io/anserini/search/topicreader/TsvTopicReader.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
/**
* Anserini: A Lucene toolkit for replicable information retrieval research
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package io.anserini.search.topicreader;

import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Path;
import java.util.HashMap;
import java.util.Map;
import java.util.SortedMap;
import java.util.TreeMap;

/**
* Topic reader for queries in tsv format, such as the MS MARCO queries.
*
* <pre>
* 174249 does xpress bet charge to deposit money in your account
* 320792 how much is a cost to run disneyland
* 1090270 botulinum definition
* 1101279 do physicians pay for insurance from their salaries?
* 201376 here there be dragons comic
* 54544 blood diseases that are sexually transmitted
* ...
* </pre>
*/
public class TsvTopicReader extends TopicReader {
public TsvTopicReader(Path topicFile) {
super(topicFile);
}

@Override
public SortedMap<Integer, Map<String, String>> read(BufferedReader reader) throws IOException {
SortedMap<Integer, Map<String, String>> map = new TreeMap<>();

String line;
while ((line = reader.readLine()) != null) {
line = line.trim();
String[] arr = line.split("\\t");

Map<String,String> fields = new HashMap<>();
fields.put("title", arr[1].trim());
map.put(Integer.valueOf(arr[0]), fields);
}

return map;
}
}
44 changes: 44 additions & 0 deletions src/main/resources/docgen/templates/msmarco-passage.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Anserini: Experiments on [MS MARCO (Passage)](https://github.com/microsoft/MSMARCO-Passage-Ranking)

This page documents regression experiments for the MS MARCO Passage Ranking Task, which is integrated into Anserini's regression testing framework.
For more complete instructions on how to run end-to-end experiments, refer to a [separate page](experiments-msmarco.md).

## Indexing

Typical indexing command:

```
${index_cmds}
```

The directory `/path/to/msmarco-passage/` should be a directory containing `jsonl` files converted from the official passage collection, which is in `tsv` format.
[This page](experiments-msmarco.md) explains how to perform this conversion.

For additional details, see explanation of [common indexing options](common-indexing-options.md).

## Retrieval

Topics and qrels are stored in `src/main/resources/topics-and-qrels/`.
The regression experiments here evaluate on the 6980 dev set questions; see [this page](experiments-msmarco.md) for more details.

After indexing has completed, you should be able to perform retrieval as follows:

```
${ranking_cmds}
```

Evaluation can be performed using `trec_eval`:

```
${eval_cmds}
```

## Effectiveness

With the above commands, you should be able to replicate the following results:

${effectiveness}

The setting "default" refers the default BM25 settings of `k1=0.9`, `b=0.4`, while "tuned" refers to the tuned setting of `k1=0.82`, `b=0.72`.
See [this page](experiments-msmarco.md) for more details.
Note that these results are slightly different from the above referenced page because those experiments make up "fake" scores when converting runs from MS MARCO format into TREC format for evaluation by `trec_eval`.
Loading