Skip to content

Commit

Permalink
Add documentation for the MS MARCO v2 passage and document corpus (#64)
Browse files Browse the repository at this point in the history
* init augment collection

* update requirements

* start repo

* update passage v2

* update document v2

* quick cleanup

* rephrase long sentence
  • Loading branch information
ronakice authored Nov 5, 2021
1 parent c6e2a23 commit 19e3b58
Show file tree
Hide file tree
Showing 3 changed files with 242 additions and 1 deletion.
138 changes: 138 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -541,3 +541,141 @@ for ITER in {00..32}; do
--gin_param="utils.tpu_mesh_shape.tpu_topology ='v3-8'
done
```
## MS MARCO V2 Passage Expansion
Here we provide instructions on how to reproduce our docTTTTTquery results for the MS MARCO V2 passage ranking task with the Anserini IR toolkit, using predicted queries.
We opensource the [predicted queries](https://huggingface.co/datasets/castorini/msmarco_v2_passage_doc2query-t5_expansions/viewer/default/train) using the [🤗 Datasets library](https://github.com/huggingface/datasets).
Note that this is a very large dataset, so we ran the docTTTTTquery inference step across multiple TPUs.
In fact, there is a signficant blow-up in the dataset size compared to MS MARCO v1, because of which we choose to only generate 20 queries per passage.
Also, we use a different docTTTTTquery model trained on the MS MARCO v2 passage ranking dataset.
We use the [metadata-augmented passage corpus](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#passage-collection-augmented) which was shown to have better effectiveness.
First, we download the expanded queries dataset and expand this corpus using `NUM_QUERIES` queries per passage:
```bash
export NUM_QUERIES=20
python3 msmarco-v2/augment_corpus.py --hgf_d2q_dataset castorini/msmarco_v2_passage_doc2query-t5_expansions \
--original_psg_path collections/msmarco_v2_passage_augmented \
--output_psg_path collections/msmarco_v2_passage_augmented_d2q-t5_${NUM_QUERIES} \
--num_workers 70 \
--num_queries ${NUM_QUERIES} \
--task passage \
--cache_dir /path/to/cache/dir
```
The dataset is downloaded and processed in the cache directory after which the corpus is expanded too.
So make sure you have enough storage space (around 300 GB for this entire task).
If the dataset is not already cached, this script would take about 18 hours.
If it is, you can expect it to finish in about 10 hours.
Upon completion, index the expanded passages with Anserini:
```bash
sh target/appassembler/bin/IndexCollection -collection MsMarcoV2PassageCollection \
-generator DefaultLuceneDocumentGenerator -threads 70 \
-input collections/msmarco_v2_passage_augmented_d2q-t5_${NUM_QUERIES} \
-index indexes/msmarco-v2-passage-augmented-d2q-t5-${NUM_QUERIES} \
-optimize
```
Note that this index does not store any "extras" (positions, document vectors, raw documents, etc.) because we don't need any of these for BM25 retrieval.
Finally, we can perform runs on the dev queries (both sets):
```bash
target/appassembler/bin/SearchCollection -index indexes/msmarco-v2-passage-augmented-d2q-t5-${NUM_QUERIES} \
-topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev.txt \
-output runs/run.msmarco-v2-passage-augmented-d2q-t5-${NUM_QUERIES}.dev.txt -bm25 -hits 1000
target/appassembler/bin/SearchCollection -index indexes/msmarco-v2-passage-augmented-d2q-t5-${NUM_QUERIES} \
-topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-passage.dev2.txt \
-output runs/run.msmarco-v2-passage-augmented-d2q-t5-${NUM_QUERIES}.dev2.txt -bm25 -hits 1000
```
Evaluation:
```bash
$ tools/eval/trec_eval.9.0.4/trec_eval -c -M 100 -m map -m recip_rank src/main/resources/topics-and-qrels/qrels.msmarco-v2-passage.dev.txt runs/run.msmarco-v2-passage-augmented-d2q-t5-${NUM_QUERIES}.dev.txt
map all 0.1160
recip_rank all 0.1172
$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100,1000 src/main/resources/topics-and-qrels/qrels.msmarco-v2-passage.dev.txt runs/run.msmarco-v2-passage-augmented-d2q-t5-${NUM_QUERIES}.dev.txt
recall_100 all 0.5039
recall_1000 all 0.7647
$ tools/eval/trec_eval.9.0.4/trec_eval -c -M 100 -m map -m recip_rank src/main/resources/topics-and-qrels/qrels.msmarco-v2-passage.dev2.txt runs/run.msmarco-v2-passage-augmented-d2q-t5-${NUM_QUERIES}.dev2.txt
map all 0.1158
recip_rank all 0.1170
$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100,1000 src/main/resources/topics-and-qrels/qrels.msmarco-v2-passage.dev2.txt runs/run.msmarco-v2-passage-augmented-d2q-t5-${NUM_QUERIES}.dev2.txt
recall_100 all 0.5158
recall_1000 all 0.7659
```
## MS MARCO V2 (Segmented) Document Expansion
This guide provide sinstructions on how to reproduce our docTTTTTquery results for the MS MARCO V2 document ranking task with the Anserini IR toolkit, using predicted queries.
We opensource the [predicted queries](https://huggingface.co/datasets/castorini/msmarco_v2_doc_segmented_doc2query-t5_expansions/viewer/default/train) using the [🤗 Datasets library](https://github.com/huggingface/datasets).
Note that this is a very large dataset, so we ran the docTTTTTquery inference step across multiple TPUs.
Also, we use a different docTTTTTquery model trained on the MS MARCO v2 passage ranking dataset.
We use the [segmented document corpus](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#document-collection-segmented) which was shown to have better effectiveness.
First, we download the expanded queries dataset and expand this corpus using `NUM_QUERIES` queries per passage:
```bash
export NUM_QUERIES=10
python3 msmarco-v2/augment_corpus.py --hgf_d2q_dataset castorini/msmarco_v2_doc_segmented_doc2query-t5_expansions \
--original_psg_path collections/msmarco_v2_doc_segmented \
--output_psg_path collections/msmarco_v2_doc_segmented_d2q-t5_${NUM_QUERIES} \
--num_workers 60 \
--num_queries ${NUM_QUERIES} \
--task segment \
--cache_dir /path/to/cache/dir
```
The dataset is downloaded and processed in the cache directory after which the corpus is expanded too.
So make sure you have enough storage space (around 300 GB for this entire task).
If the dataset is not already cached, this script would take about 18 hours.
If it is, you can expect it to finish in about 10 hours.
Upon completion, index the expanded document segments with Anserini:
```bash
sh target/appassembler/bin/IndexCollection -collection MsMarcoV2DocCollection \
-generator DefaultLuceneDocumentGenerator -threads 60 \
-input collections/msmarco_v2_doc_segmented_d2q-t5_${NUM_QUERIES} \
-index indexes/msmarco-v2-doc-segmented-d2q-t5-${NUM_QUERIES} \
-optimize
```
Note that this index does not store any "extras" (positions, document vectors, raw documents, etc.) because we don't need any of these for BM25 retrieval.
Finally, we can perform runs on the dev queries (both sets):
```bash
target/appassembler/bin/SearchCollection -index /store/scratch/rpradeep/msmarco-v2/indexes/msmarco-v2-doc-segmented-d2q-t5-${NUM_QUERIES} \
-topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev.txt \
-output runs/run.msmarco-v2-doc-segmented-d2q-t5-${NUM_QUERIES}.dev.txt \
-bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000
target/appassembler/bin/SearchCollection -index /store/scratch/rpradeep/msmarco-v2/indexes/msmarco-v2-doc-segmented-d2q-t5-${NUM_QUERIES} \
-topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-v2-doc.dev2.txt \
-output runs/run.msmarco-v2-doc-segmented-d2q-t5-${NUM_QUERIES}.dev2.txt \
-bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000
```
Evaluation:
```bash
$ tools/eval/trec_eval.9.0.4/trec_eval -c -M 100 -m map -m recip_rank src/main/resources/topics-and-qrels/qrels.msmarco-v2-doc.dev.txt runs/run.msmarco-v2-doc-segmented-d2q-t5-${NUM_QUERIES}.dev.txt
map all 0.2203
recip_rank all 0.2226
$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100,1000 src/main/resources/topics-and-qrels/qrels.msmarco-v2-doc.dev.txt runs/run.msmarco-v2-doc-segmented-d2q-t5-${NUM_QUERIES}.dev.txt
recall_100 all 0.7297
recall_1000 all 0.8982
$ tools/eval/trec_eval.9.0.4/trec_eval -c -M 100 -m map -m recip_rank src/main/resources/topics-and-qrels/qrels.msmarco-v2-doc.dev2.txt runs/run.msmarco-v2-doc-segmented-d2q-t5-${NUM_QUERIES}.dev2.txt
map all 0.2205
recip_rank all 0.2234
$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100,1000 src/main/resources/topics-and-qrels/qrels.msmarco-v2-doc.dev2.txt runs/run.msmarco-v2-doc-segmented-d2q-t5-${NUM_QUERIES}.dev2.txt
recall_100 all 0.7316
recall_1000 all 0.8952
```
102 changes: 102 additions & 0 deletions msmarco-v2/augment_corpus.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
#
# Pyserini: Reproducible IR research with sparse and dense representations
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import argparse
from datasets import load_dataset
import os
import gzip
import json
from tqdm import tqdm
import glob
from multiprocessing import Pool, Manager


def load_docs(docid_to_doc, f_ins, text_key="passage"):
print("Loading docs")
counter = 0
if text_key == "passage":
id_key = "pid"
else:
id_key = "docid"
for f_in in f_ins:
with gzip.open(f_in, 'rt', encoding='utf8') as in_fh:
for json_string in tqdm(in_fh):
input_dict = json.loads(json_string)
docid_to_doc[input_dict[id_key]] = input_dict
counter += 1
print(f'{counter} docs loaded. Done!')

def augment_corpus_with_doc2query_t5(dataset, f_out, start, end, num_queries, text_key="passage"):
print('Output docs...')
output = open(f_out, 'w')
counter = 0
for i in tqdm(range(start, end)):
docid = dataset[i]["id"]
output_dict = docid_to_doc[docid]
concatenated_queries = " ".join(dataset[i]["predicted_queries"][:num_queries])
output_dict[text_key] = f"{output_dict[text_key]} {concatenated_queries}"
counter += 1
output.write(json.dumps(output_dict) + '\n')
output.close()
print(f'{counter} lines output. Done!')


if __name__ == '__main__':
parser = argparse.ArgumentParser(
description='Concatenate MS MARCO v2 corpus with predicted queries')
parser.add_argument('--hgf_d2q_dataset', required=True,
choices=['castorini/msmarco_v2_passage_doc2query-t5_expansions',
'castorini/msmarco_v2_doc_segmented_doc2query-t5_expansions'])
parser.add_argument('--original_psg_path', required=True, help='Input corpus path')
parser.add_argument('--output_psg_path', required=True, help='Output file for d2q-t5 augmented corpus.')
parser.add_argument('--num_workers', default=1, type=int, help='Number of workers used.')
parser.add_argument('--num_queries', default=20, type=int, help='Number of expansions used.')
parser.add_argument('--task', default="passage", type=str, help='One of passage or document.')
parser.add_argument('--cache_dir', default=".", type=str, help='Path to cache the hgf dataset')
args = parser.parse_args()

psg_files = glob.glob(os.path.join(args.original_psg_path, '*.gz'))
os.makedirs(args.output_psg_path, exist_ok=True)


manager = Manager()
docid_to_doc = manager.dict()


dataset = load_dataset(args.hgf_d2q_dataset, split="train", cache_dir=args.cache_dir)
pool = Pool(args.num_workers)
num_files_per_worker = (len(psg_files) // args.num_workers)
for i in range(args.num_workers):
pool.apply_async(load_docs, (docid_to_doc, psg_files[i*num_files_per_worker: min(len(dataset), (i+1)*num_files_per_worker)], args.task))
pool.close()
pool.join()
assert len(docid_to_doc) == len(dataset)
print('Total passages loaded: {}'.format(len(docid_to_doc)))


pool = Pool(args.num_workers)
num_examples_per_worker = (len(docid_to_doc)//args.num_workers) + 1
for i in range(args.num_workers):
f_out = os.path.join(args.output_psg_path, 'dt5q_aug_psg' + str(i) + '.json')
pool.apply_async(augment_corpus_with_doc2query_t5 ,(dataset, f_out,
i*(num_examples_per_worker),
min(len(docid_to_doc), (i+1)*num_examples_per_worker),
args.num_queries, args.task))

pool.close()
pool.join()

print('Done!')
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
sentencepiece==0.1.95
spacy==2.1.6
tensorflow==2.4.1
transformers==4.4.2
transformers>=4.6.0
datasets

0 comments on commit 19e3b58

Please sign in to comment.