Add documentation for the MS MARCO v2 passage and document corpus (#64)

* init augment collection * update requirements * start repo * update passage v2 * update document v2 * quick cleanup * rephrase long sentence
castorini · Nov 5, 2021 · 19e3b58 · 19e3b58
1 parent c6e2a23
commit 19e3b58
Show file tree

Hide file tree

Showing 3 changed files with 242 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -541,3 +541,141 @@ for ITER in {00..32}; do
       --gin_param="utils.tpu_mesh_shape.tpu_topology ='v3-8'
 done
 ```
+
+## MS MARCO V2 Passage Expansion
+
+Here we provide instructions on how to reproduce our docTTTTTquery results for the MS MARCO V2 passage ranking task with the Anserini IR toolkit, using predicted queries.
+We opensource the [predicted queries](https://huggingface.co/datasets/castorini/msmarco_v2_passage_doc2query-t5_expansions/viewer/default/train) using the [🤗 Datasets library](https://github.com/huggingface/datasets).
+Note that this is a very large dataset, so we ran the docTTTTTquery inference step across multiple TPUs.
+In fact, there is a signficant blow-up in the dataset size compared to MS MARCO v1, because of which we choose to only generate 20 queries per passage.
+Also, we use a different docTTTTTquery model trained on the MS MARCO v2 passage ranking dataset.
+
+We use the [metadata-augmented passage corpus](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#passage-collection-augmented) which was shown to have better effectiveness.
+
+First, we download the expanded queries dataset and expand this corpus using `NUM_QUERIES` queries per passage:
+```bash
+export NUM_QUERIES=20
+python3 msmarco-v2/augment_corpus.py --hgf_d2q_dataset castorini/msmarco_v2_passage_doc2query-t5_expansions \
+        --original_psg_path collections/msmarco_v2_passage_augmented \
+        --output_psg_path collections/msmarco_v2_passage_augmented_d2q-t5_${NUM_QUERIES} \
+        --num_workers 70 \
+        --num_queries ${NUM_QUERIES} \
+        --task passage \
+        --cache_dir /path/to/cache/dir
+```
+The dataset is downloaded and processed in the cache directory after which the corpus is expanded too.
+So make sure you have enough storage space (around 300 GB for this entire task).
+If the dataset is not already cached, this script would take about 18 hours.
+If it is, you can expect it to finish in about 10 hours.
+
+Upon completion, index the expanded passages with Anserini:
+```bash
+sh target/appassembler/bin/IndexCollection -collection MsMarcoV2PassageCollection \
+ -generator DefaultLuceneDocumentGenerator -threads 70 \
+ -input collections/msmarco_v2_passage_augmented_d2q-t5_${NUM_QUERIES} \
+ -index indexes/msmarco-v2-passage-augmented-d2q-t5-${NUM_QUERIES} \
+ -optimize
+```
+Note that this index does not store any "extras" (positions, document vectors, raw documents, etc.) because we don't need any of these for BM25 retrieval.
+
+
+Finally, we can perform runs on the dev queries (both sets):
+
+```bash
+target/appassembler/bin/SearchCollection -index indexes/msmarco-v2-passage-augmented-d2q-t5-${NUM_QUERIES} \
+ -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev.txt \
+ -output runs/run.msmarco-v2-passage-augmented-d2q-t5-${NUM_QUERIES}.dev.txt -bm25 -hits 1000
+
+target/appassembler/bin/SearchCollection -index indexes/msmarco-v2-passage-augmented-d2q-t5-${NUM_QUERIES} \
+ -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev2.txt \
+ -output runs/run.msmarco-v2-passage-augmented-d2q-t5-${NUM_QUERIES}.dev2.txt -bm25 -hits 1000
+```
+
+Evaluation:
+```bash
+$ tools/eval/trec_eval.9.0.4/trec_eval -c -M 100 -m map -m recip_rank src/main/resources/topics-and-qrels/qrels.msmarco-v2-passage.dev.txt runs/run.msmarco-v2-passage-augmented-d2q-t5-${NUM_QUERIES}.dev.txt
+map                     all     0.1160
+recip_rank              all     0.1172
+
+$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100,1000 src/main/resources/topics-and-qrels/qrels.msmarco-v2-passage.dev.txt runs/run.msmarco-v2-passage-augmented-d2q-t5-${NUM_QUERIES}.dev.txt
+recall_100              all     0.5039
+recall_1000             all     0.7647
+
+$ tools/eval/trec_eval.9.0.4/trec_eval -c -M 100 -m map -m recip_rank src/main/resources/topics-and-qrels/qrels.msmarco-v2-passage.dev2.txt runs/run.msmarco-v2-passage-augmented-d2q-t5-${NUM_QUERIES}.dev2.txt
+map                     all     0.1158
+recip_rank              all     0.1170
+
+$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100,1000 src/main/resources/topics-and-qrels/qrels.msmarco-v2-passage.dev2.txt runs/run.msmarco-v2-passage-augmented-d2q-t5-${NUM_QUERIES}.dev2.txt
+recall_100              all     0.5158
+recall_1000             all     0.7659
+```
+
+## MS MARCO V2 (Segmented) Document Expansion
+
+This guide provide sinstructions on how to reproduce our docTTTTTquery results for the MS MARCO V2 document ranking task with the Anserini IR toolkit, using predicted queries.
+We opensource the [predicted queries](https://huggingface.co/datasets/castorini/msmarco_v2_doc_segmented_doc2query-t5_expansions/viewer/default/train) using the [🤗 Datasets library](https://github.com/huggingface/datasets).
+Note that this is a very large dataset, so we ran the docTTTTTquery inference step across multiple TPUs.
+Also, we use a different docTTTTTquery model trained on the MS MARCO v2 passage ranking dataset.
+
+We use the [segmented document corpus](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#document-collection-segmented) which was shown to have better effectiveness.
+
+First, we download the expanded queries dataset and expand this corpus using `NUM_QUERIES` queries per passage:
+
+```bash
+export NUM_QUERIES=10
+python3 msmarco-v2/augment_corpus.py --hgf_d2q_dataset castorini/msmarco_v2_doc_segmented_doc2query-t5_expansions \
+        --original_psg_path collections/msmarco_v2_doc_segmented \
+        --output_psg_path collections/msmarco_v2_doc_segmented_d2q-t5_${NUM_QUERIES} \
+        --num_workers 60 \
+        --num_queries ${NUM_QUERIES} \
+        --task segment \
+        --cache_dir /path/to/cache/dir
+```
+The dataset is downloaded and processed in the cache directory after which the corpus is expanded too.
+So make sure you have enough storage space (around 300 GB for this entire task).
+If the dataset is not already cached, this script would take about 18 hours.
+If it is, you can expect it to finish in about 10 hours.
+
+Upon completion, index the expanded document segments with Anserini:
+```bash
+sh target/appassembler/bin/IndexCollection -collection MsMarcoV2DocCollection \
+ -generator DefaultLuceneDocumentGenerator -threads 60 \
+ -input collections/msmarco_v2_doc_segmented_d2q-t5_${NUM_QUERIES} \
+ -index indexes/msmarco-v2-doc-segmented-d2q-t5-${NUM_QUERIES} \
+ -optimize
+```
+Note that this index does not store any "extras" (positions, document vectors, raw documents, etc.) because we don't need any of these for BM25 retrieval.
+
+
+Finally, we can perform runs on the dev queries (both sets):
+```bash
+target/appassembler/bin/SearchCollection -index /store/scratch/rpradeep/msmarco-v2/indexes/msmarco-v2-doc-segmented-d2q-t5-${NUM_QUERIES} \
+  -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev.txt \
+  -output runs/run.msmarco-v2-doc-segmented-d2q-t5-${NUM_QUERIES}.dev.txt \
+  -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000
+
+target/appassembler/bin/SearchCollection -index /store/scratch/rpradeep/msmarco-v2/indexes/msmarco-v2-doc-segmented-d2q-t5-${NUM_QUERIES} \
+  -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev2.txt \
+  -output runs/run.msmarco-v2-doc-segmented-d2q-t5-${NUM_QUERIES}.dev2.txt \
+  -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000
+```
+
+
+Evaluation:
+```bash
+$ tools/eval/trec_eval.9.0.4/trec_eval -c -M 100 -m map -m recip_rank src/main/resources/topics-and-qrels/qrels.msmarco-v2-doc.dev.txt runs/run.msmarco-v2-doc-segmented-d2q-t5-${NUM_QUERIES}.dev.txt
+map                     all     0.2203
+recip_rank              all     0.2226
+
+$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100,1000 src/main/resources/topics-and-qrels/qrels.msmarco-v2-doc.dev.txt runs/run.msmarco-v2-doc-segmented-d2q-t5-${NUM_QUERIES}.dev.txt
+recall_100              all     0.7297
+recall_1000             all     0.8982
+
+$ tools/eval/trec_eval.9.0.4/trec_eval -c -M 100 -m map -m recip_rank src/main/resources/topics-and-qrels/qrels.msmarco-v2-doc.dev2.txt runs/run.msmarco-v2-doc-segmented-d2q-t5-${NUM_QUERIES}.dev2.txt
+map                     all     0.2205
+recip_rank              all     0.2234
+
+$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100,1000 src/main/resources/topics-and-qrels/qrels.msmarco-v2-doc.dev2.txt runs/run.msmarco-v2-doc-segmented-d2q-t5-${NUM_QUERIES}.dev2.txt
+recall_100              all     0.7316
+recall_1000             all     0.8952
+```
diff --git a/msmarco-v2/augment_corpus.py b/msmarco-v2/augment_corpus.py
@@ -0,0 +1,102 @@
+#
+# Pyserini: Reproducible IR research with sparse and dense representations
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import argparse
+from datasets import load_dataset
+import os
+import gzip
+import json
+from tqdm import tqdm
+import glob
+from multiprocessing import Pool, Manager
+
+
+def load_docs(docid_to_doc, f_ins, text_key="passage"):
+    print("Loading docs")
+    counter = 0
+    if text_key == "passage":
+        id_key = "pid"
+    else:
+        id_key = "docid"
+    for f_in in f_ins:
+        with gzip.open(f_in, 'rt', encoding='utf8') as in_fh:
+            for json_string in tqdm(in_fh):
+                input_dict = json.loads(json_string)
+                docid_to_doc[input_dict[id_key]] = input_dict
+                counter += 1
+    print(f'{counter} docs loaded. Done!')
+
+def augment_corpus_with_doc2query_t5(dataset, f_out, start, end, num_queries, text_key="passage"):
+    print('Output docs...')
+    output = open(f_out, 'w')
+    counter = 0
+    for i in tqdm(range(start, end)):
+        docid = dataset[i]["id"]
+        output_dict = docid_to_doc[docid]
+        concatenated_queries = " ".join(dataset[i]["predicted_queries"][:num_queries])
+        output_dict[text_key] = f"{output_dict[text_key]} {concatenated_queries}"
+        counter += 1
+        output.write(json.dumps(output_dict) + '\n')  
+    output.close()
+    print(f'{counter} lines output. Done!')
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description='Concatenate MS MARCO v2 corpus with predicted queries')
+    parser.add_argument('--hgf_d2q_dataset', required=True, 
+                        choices=['castorini/msmarco_v2_passage_doc2query-t5_expansions',
+                        'castorini/msmarco_v2_doc_segmented_doc2query-t5_expansions'])
+    parser.add_argument('--original_psg_path', required=True, help='Input corpus path')
+    parser.add_argument('--output_psg_path', required=True, help='Output file for d2q-t5 augmented corpus.')
+    parser.add_argument('--num_workers', default=1, type=int, help='Number of workers used.')
+    parser.add_argument('--num_queries', default=20, type=int, help='Number of expansions used.')
+    parser.add_argument('--task', default="passage", type=str, help='One of passage or document.')
+    parser.add_argument('--cache_dir', default=".", type=str, help='Path to cache the hgf dataset')
+    args = parser.parse_args()
+
+    psg_files = glob.glob(os.path.join(args.original_psg_path, '*.gz'))
+    os.makedirs(args.output_psg_path, exist_ok=True)
+
+
+    manager = Manager()
+    docid_to_doc = manager.dict()
+
+
+    dataset = load_dataset(args.hgf_d2q_dataset, split="train", cache_dir=args.cache_dir)
+    pool = Pool(args.num_workers)
+    num_files_per_worker = (len(psg_files) // args.num_workers)
+    for i in range(args.num_workers):
+        pool.apply_async(load_docs, (docid_to_doc, psg_files[i*num_files_per_worker: min(len(dataset), (i+1)*num_files_per_worker)], args.task))
+    pool.close()
+    pool.join()       
+    assert len(docid_to_doc) == len(dataset)
+    print('Total passages loaded: {}'.format(len(docid_to_doc)))
+
+
+    pool = Pool(args.num_workers)
+    num_examples_per_worker = (len(docid_to_doc)//args.num_workers) + 1
+    for i in range(args.num_workers):
+        f_out = os.path.join(args.output_psg_path, 'dt5q_aug_psg' + str(i) + '.json')
+        pool.apply_async(augment_corpus_with_doc2query_t5 ,(dataset, f_out, 
+                                                            i*(num_examples_per_worker), 
+                                                            min(len(docid_to_doc), (i+1)*num_examples_per_worker),
+                                                            args.num_queries, args.task))
+
+    pool.close()
+    pool.join()
+
+    print('Done!')
diff --git a/requirements.txt b/requirements.txt
@@ -1,4 +1,5 @@
 sentencepiece==0.1.95
 spacy==2.1.6
 tensorflow==2.4.1
-transformers==4.4.2
+transformers>=4.6.0
+datasets