castorini · lintool · Jun 8, 2019 · Jun 8, 2019 · Jun 8, 2019 · Jun 8, 2019
diff --git a/README.md b/README.md
@@ -57,10 +57,11 @@ Note that these regressions capture the "out of the box" experience, based on [_
 + [Experiments on Tweets2013 (MB13 &amp; MB14)](docs/regressions-mb13.md)
 + [Experiments on Complex Answer Retrieval v1.5 (CAR17)](docs/regressions-car17v1.5.md)
 + [Experiments on Complex Answer Retrieval v2.0 (CAR17)](docs/regressions-car17v2.0.md)
++ [Experiments on MS MARCO Passage Task](docs/regressions-msmarco-passage.md)
 
 Other experiments:
 
-+ [Experiments on MS MARCO](docs/experiments-msmarco.md)
++ [Experiments on MS MARCO](docs/experiments-msmarco-passage.md)
 + [Experiments on AI2 Open Research](docs/experiments-openresearch.md)
 + [Experiments from JDIQ 2018 article](docs/experiments-jdiq2018.md)
 + [Experiments from SIGIR Forum 2018 article](docs/experiments-forum2018.md)

diff --git a/docs/experiments-msmarco.md → docs/experiments-msmarco-passage.md b/docs/experiments-msmarco.md → docs/experiments-msmarco-passage.md
@@ -1,15 +1,15 @@
-# Anserini: Experiments on [MS MARCO](http://www.msmarco.org/)
+# Anserini: Experiments on [MS MARCO (Passage)](https://github.com/microsoft/MSMARCO-Passage-Ranking)
 
 ## Data Prep
 
+We're going to use `msmarco-passage/` as the working directory.
 First, we need to download and extract the MS MARCO dataset:
 
 ```
-DATA_DIR=./msmarco_data
-mkdir ${DATA_DIR}
+mkdir msmarco-passage
 
-wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P ${DATA_DIR}
-tar -xvf ${DATA_DIR}/collectionandqueries.tar.gz -C ${DATA_DIR}
+wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P msmarco-passage
+tar -xvf msmarco-passage/collectionandqueries.tar.gz -C msmarco-passage
 ```
 
 To confirm, `collectionandqueries.tar.gz` should have MD5 checksum of `31644046b18952c1386cd4564ba2ae69`.
@@ -18,63 +18,63 @@ Next, we need to convert the MS MARCO tsv collection into Anserini's jsonl files
 
 ```
 python ./src/main/python/msmarco/convert_collection_to_jsonl.py \
- --collection_path=${DATA_DIR}/collection.tsv --output_folder=${DATA_DIR}/collection_jsonl
+ --collection_path msmarco-passage/collection.tsv --output_folder msmarco-passage/collection_jsonl
 ```
 
-The above script should generate 9 jsonl files in `${DATA_DIR}/collection_jsonl`, each with 1M lines (except for the last one, which should have 841,823 lines).
+The above script should generate 9 jsonl files in `msmarco-passage/collection_jsonl`, each with 1M lines (except for the last one, which should have 841,823 lines).
 
 We can now index these docs as a `JsonCollection` using Anserini:
 
 ```
 sh ./target/appassembler/bin/IndexCollection -collection JsonCollection \
- -generator LuceneDocumentGenerator -threads 9 -input ${DATA_DIR}/collection_jsonl \
- -index ${DATA_DIR}/lucene-index-msmarco -optimize -storePositions -storeDocvectors -storeRawDocs 
+ -generator LuceneDocumentGenerator -threads 9 -input msmarco-passage/collection_jsonl \
+ -index msmarco-passage/lucene-index-msmarco -storePositions -storeDocvectors -storeRawDocs 
 ```
 
 The output message should be something like this:
 
 ```
-2019-04-20 11:52:34,935 INFO  [main] index.IndexCollection (IndexCollection.java:647) - Total 8,841,823 documents indexed in 00:05:04
+2019-06-08 08:53:47,351 INFO  [main] index.IndexCollection (IndexCollection.java:632) - Total 8,841,823 documents indexed in 00:01:31
 ```
 
-Your speed may vary... with a modern desktop machine with an SSD, indexing takes around a minute.
+Your speed may vary... with a modern desktop machine with an SSD, indexing takes less than two minutes.
 
 ## Retrieving and Evaluating the Dev set
 
 Since queries of the set are too many (+100k), it would take a long time to retrieve all of them. To speed this up, we use only the queries that are in the qrels file: 
 
 ```
-python ./src/main/python/msmarco/filter_queries.py --qrels=${DATA_DIR}/qrels.dev.small.tsv \
- --queries=${DATA_DIR}/queries.dev.tsv --output_queries=${DATA_DIR}/queries.dev.small.tsv
+python ./src/main/python/msmarco/filter_queries.py --qrels msmarco-passage/qrels.dev.small.tsv \
+ --queries msmarco-passage/queries.dev.tsv --output_queries msmarco-passage/queries.dev.small.tsv
 ```
 
 The output queries file should contain 6980 lines.
 
-We can now retrieve this smaller set of queries.
+We can now retrieve this smaller set of queries:
 
 ```
-python ./src/main/python/msmarco/retrieve.py --index ${DATA_DIR}/lucene-index-msmarco \
- --qid_queries ${DATA_DIR}/queries.dev.small.tsv --output ${DATA_DIR}/run.dev.small.tsv --hits 1000
+python ./src/main/python/msmarco/retrieve.py --hits 1000 --index msmarco-passage/lucene-index-msmarco \
+ --qid_queries msmarco-passage/queries.dev.small.tsv --output msmarco-passage/run.dev.small.tsv
 ```
 
+Note that by default, the above script uses BM25 with tuned parameters `k1=0.82`, `b=0.72` (more details below).
+The option `-hits` specifies the of documents per query to be retrieved.
+Thus, the output file should have approximately 6980 * 1000 = 6.9M lines. 
+
 Retrieval speed will vary by machine:
-On a modern desktop with an SSD, we can get ~0.04 per query (taking about five minutes).
-On a slower machine with mechanical disks, the entire process might take as long as a couple of hours.
-Alternatively, we can run the same script implemented in Java to remove Python overhead, which ends up being ~4x faster.
+On a modern desktop with an SSD, we can get ~0.06 s/query (taking about seven minutes).
+Alternatively, we can run the same script implemented in Java (which for some reason seems to be slower, see [issue 604](https://github.com/castorini/anserini/issues/604)):
 
 ```
-./target/appassembler/bin/SearchMsmarco -index ${DATA_DIR}/lucene-index-msmarco \
- -qid_queries ${DATA_DIR}/queries.dev.small.tsv -output ${DATA_DIR}/run.dev.small.tsv -hits 1000
+./target/appassembler/bin/SearchMsmarco  -hits 1000 -index msmarco-passage/lucene-index-msmarco \
+ -qid_queries msmarco-passage/queries.dev.small.tsv -output msmarco-passage/run.dev.small.tsv
 ```
 
-The option `-hits` specifies the of documents per query to be retrieved.
-Thus, the output file should have approximately 6980 * 1000 = 6.9M lines. 
-
 Finally, we can evaluate the retrieved documents using this the official MS MARCO evaluation script: 
 
 ```
 python ./src/main/python/msmarco/msmarco_eval.py \
- ${DATA_DIR}/qrels.dev.small.tsv ${DATA_DIR}/run.dev.small.tsv
+ msmarco-passage/qrels.dev.small.tsv msmarco-passage/run.dev.small.tsv
 ```
 
 And the output should be like this:
@@ -91,17 +91,17 @@ For that we first need to convert runs and qrels files to the TREC format:
 
 ```
 python ./src/main/python/msmarco/convert_msmarco_to_trec_run.py \
- --input_run ${DATA_DIR}/run.dev.small.tsv --output_run ${DATA_DIR}/run.dev.small.trec
+ --input_run msmarco-passage/run.dev.small.tsv --output_run msmarco-passage/run.dev.small.trec
 
 python ./src/main/python/msmarco/convert_msmarco_to_trec_qrels.py \
- --input_qrels ${DATA_DIR}/qrels.dev.small.tsv --output_qrels ${DATA_DIR}/qrels.dev.small.trec
+ --input_qrels msmarco-passage/qrels.dev.small.tsv --output_qrels msmarco-passage/qrels.dev.small.trec
 ```
 
 And run the `trec_eval` tool:
 
 ```
 ./eval/trec_eval.9.0.4/trec_eval -mrecall.1000 -mmap \
- ${DATA_DIR}/qrels.dev.small.trec ${DATA_DIR}/run.dev.small.trec
+ msmarco-passage/qrels.dev.small.trec msmarco-passage/run.dev.small.trec
 ```
 
 The output should be:
@@ -115,7 +115,7 @@ Average precision and recall@1000 are the two metrics we care about the most.
 
 ## BM25 Tuning
 
-Note that this figure differs slightly from the value reported in [Document Expansion by Query Prediction](https://arxiv.org/abs/1904.08375), which uses the Anserini default of `k1=0.9`, `b=0.4`.
+Note that this figure differs slightly from the value reported in [Document Expansion by Query Prediction](https://arxiv.org/abs/1904.08375), which uses the Anserini (system-wide) default of `k1=0.9`, `b=0.4`.
 
 Tuning was accomplished with the `tune_bm25.py` script, using the queries found [here](https://github.com/castorini/Anserini-data/tree/master/MSMARCO).
 There are five different sets of 10k samples (from the `shuf` command).

diff --git a/docs/regressions-msmarco-passage.md b/docs/regressions-msmarco-passage.md
@@ -0,0 +1,70 @@
+# Anserini: Experiments on [MS MARCO (Passage)](https://github.com/microsoft/MSMARCO-Passage-Ranking)
+
+This page documents regression experiments for the MS MARCO Passage Ranking Task, which is integrated into Anserini's regression testing framework.
+For more complete instructions on how to run end-to-end experiments, refer to a [separate page](experiments-msmarco.md).
+
+## Indexing
+
+Typical indexing command:
+
+```
+nohup sh target/appassembler/bin/IndexCollection -collection JsonCollection \
+-generator LuceneDocumentGenerator -threads 9 -input /path/to/msmarco-passage \
+-index lucene-index.msmarco-passage.pos+docvectors+rawdocs -storePositions \
+-storeDocvectors -storeRawDocs >& log.msmarco-passage.pos+docvectors+rawdocs &
+```
+
+The directory `/path/to/msmarco-passage/` should be a directory containing `jsonl` files converted from the official passage collection, which is in `tsv` format.
+[This page](experiments-msmarco.md) explains how to perform this conversion.
+
+For additional details, see explanation of [common indexing options](common-indexing-options.md).
+
+## Retrieval
+
+Topics and qrels are stored in `src/main/resources/topics-and-qrels/`.
+The regression experiments here evaluate on the 6980 dev set questions; see [this page](experiments-msmarco.md) for more details.
+
+After indexing has completed, you should be able to perform retrieval as follows:
+
+```
+nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-default.topics.msmarco-passage.dev-subset.txt -bm25 &
+
+nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-default+rm3.topics.msmarco-passage.dev-subset.txt -bm25 -rm3 &
+
+nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-tuned.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.72 &
+
+nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index lucene-index.msmarco-passage.pos+docvectors+rawdocs -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output run.msmarco-passage.bm25-tuned+rm3.topics.msmarco-passage.dev-subset.txt -bm25 -k1 0.82 -b 0.72 -rm3 &
+
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```
+eval/trec_eval.9.0.4/trec_eval -m map -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt run.msmarco-passage.bm25-default.topics.msmarco-passage.dev-subset.txt
+
+eval/trec_eval.9.0.4/trec_eval -m map -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt run.msmarco-passage.bm25-default+rm3.topics.msmarco-passage.dev-subset.txt
+
+eval/trec_eval.9.0.4/trec_eval -m map -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt run.msmarco-passage.bm25-tuned.topics.msmarco-passage.dev-subset.txt
+
+eval/trec_eval.9.0.4/trec_eval -m map -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt run.msmarco-passage.bm25-tuned+rm3.topics.msmarco-passage.dev-subset.txt
+
+```
+
+## Effectiveness
+
+With the above commands, you should be able to replicate the following results:
+
+MAP                                     | BM25 (Default)| +RM3      | BM25 (Tuned)| +RM3      |
+:---------------------------------------|-----------|-----------|-----------|-----------|
+[MS MARCO Passage Ranking: Dev Queries](https://github.com/microsoft/MSMARCO-Passage-Ranking)| 0.1924    | 0.1661    | 0.1956    | 0.1766    |
+
+
+R@1000                                  | BM25 (Default)| +RM3      | BM25 (Tuned)| +RM3      |
+:---------------------------------------|-----------|-----------|-----------|-----------|
+[MS MARCO Passage Ranking: Dev Queries](https://github.com/microsoft/MSMARCO-Passage-Ranking)| 0.8526    | 0.8606    | 0.8578    | 0.8701    |
+
+
+
+The setting "default" refers the default BM25 settings of `k1=0.9`, `b=0.4`, while "tuned" refers to the tuned setting of `k1=0.82`, `b=0.72`.
+See [this page](experiments-msmarco.md) for more details.
+Note that these results are slightly different from the above referenced page because those experiments make up "fake" scores when converting runs from MS MARCO format into TREC format for evaluation by `trec_eval`.
diff --git a/src/main/java/io/anserini/search/SearchMsmarco.java b/src/main/java/io/anserini/search/SearchMsmarco.java
@@ -67,7 +67,7 @@ public static void main(String[] args) throws Exception {
       SimpleSearcher.Result[] hits = searcher.search(query, retrieveArgs.hits);
 
       if (lineNumber % 10 == 0) {
-        double timePerQuery = (double) (System.nanoTime() - startTime) / (lineNumber + 1) / 10e9;
+        double timePerQuery = (double) (System.nanoTime() - startTime) / (lineNumber + 1) / 1e9;
         System.out.format("Retrieving query " + lineNumber + " (%.3f s/query)\n", timePerQuery);
       }
 

diff --git a/src/main/java/io/anserini/search/topicreader/TsvTopicReader.java b/src/main/java/io/anserini/search/topicreader/TsvTopicReader.java
@@ -0,0 +1,61 @@
+/**
+ * Anserini: A Lucene toolkit for replicable information retrieval research
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package io.anserini.search.topicreader;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.nio.file.Path;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.SortedMap;
+import java.util.TreeMap;
+
+/**
+ * Topic reader for queries in tsv format, such as the MS MARCO queries.
+ *
+ * <pre>
+ * 174249 does xpress bet charge to deposit money in your account
+ * 320792 how much is a cost to run disneyland
+ * 1090270  botulinum definition
+ * 1101279	do physicians pay for insurance from their salaries?
+ * 201376 here there be dragons comic
+ * 54544  blood diseases that are sexually transmitted
+ * ...
+ * </pre>
+ */
+public class TsvTopicReader extends TopicReader {
+  public TsvTopicReader(Path topicFile) {
+    super(topicFile);
+  }
+
+  @Override
+  public SortedMap<Integer, Map<String, String>> read(BufferedReader reader) throws IOException {
+    SortedMap<Integer, Map<String, String>> map = new TreeMap<>();
+
+    String line;
+    while ((line = reader.readLine()) != null) {
+      line = line.trim();
+      String[] arr = line.split("\\t");
+
+      Map<String,String> fields = new HashMap<>();
+      fields.put("title", arr[1].trim());
+      map.put(Integer.valueOf(arr[0]), fields);
+    }
+
+    return map;
+  }
+}
diff --git a/src/main/resources/docgen/templates/msmarco-passage.template b/src/main/resources/docgen/templates/msmarco-passage.template
@@ -0,0 +1,44 @@
+# Anserini: Experiments on [MS MARCO (Passage)](https://github.com/microsoft/MSMARCO-Passage-Ranking)
+
+This page documents regression experiments for the MS MARCO Passage Ranking Task, which is integrated into Anserini's regression testing framework.
+For more complete instructions on how to run end-to-end experiments, refer to a [separate page](experiments-msmarco.md).
+
+## Indexing
+
+Typical indexing command:
+
+```
+${index_cmds}
+```
+
+The directory `/path/to/msmarco-passage/` should be a directory containing `jsonl` files converted from the official passage collection, which is in `tsv` format.
+[This page](experiments-msmarco.md) explains how to perform this conversion.
+
+For additional details, see explanation of [common indexing options](common-indexing-options.md).
+
+## Retrieval
+
+Topics and qrels are stored in `src/main/resources/topics-and-qrels/`.
+The regression experiments here evaluate on the 6980 dev set questions; see [this page](experiments-msmarco.md) for more details.
+
+After indexing has completed, you should be able to perform retrieval as follows:
+
+```
+${ranking_cmds}
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```
+${eval_cmds}
+```
+
+## Effectiveness
+
+With the above commands, you should be able to replicate the following results:
+
+${effectiveness}
+
+The setting "default" refers the default BM25 settings of `k1=0.9`, `b=0.4`, while "tuned" refers to the tuned setting of `k1=0.82`, `b=0.72`.
+See [this page](experiments-msmarco.md) for more details.
+Note that these results are slightly different from the above referenced page because those experiments make up "fake" scores when converting runs from MS MARCO format into TREC format for evaluation by `trec_eval`.