Passage reranking for custom collection #1209

ghost · 2020-05-19T14:44:26Z

Hi,

It might be good to complete the custom collection doc with light instructions on passage reranking, after seeing this issue.

pygaggle has been a well-bundled and transparent resource for the CORD-19, and for other text ranking in the future. Below is a bottom-up snapshot from what it provides. It is however free-style and therefore less formalized.

Rerank with monoBERT

Optionally, you could rerank the above retrieval results. We provide a minimum working example rerank_custom_collection.py for this.

The example follows pygaggle and duoBERT (up to monoBERT in the figure below, figure source here), less the part to evaluate with ground-truth.

It calls a reranker to score (query, passage) pairs. The reranker is a pre-trained transformer model on a general passage retrieval task such as MS MARCO.

Prepare Input Files from the QuickStart Step

The initial retrieval result file, named above as [OUTPUT_PATH].
The query file, named above as [QUERY_FILE_PATH].
A mapping file between the passage id and the raw content mapping, named as [PASSAGE_ID2TEXT_PATH].
- This mapping file does not exist above. Write a simple script to convert the initial collection to this mapping file.
- Each line has the format of docid[\t]passage_raw_text[\n]. No header. We don't differentiate a document and a passage in this use case, so docid refers to the passage id.

Install Requirements

Download the requirements.txt from pygaggle, then do pip install -r requirements.txt.

Download the Pre-trained Reranker

Download BERT_Base_trained_on_MSMARCO.zip (roughly 1.1 G) from nyu-dl/dl4marco-bert.

Unzip and save to a path called [BERT_BASE_PSG_RETRIEVAL_MODEL_PATH].

Note: Different transformer version tends to read pre-trained model names slightly differently. You might need to tweak file names a bit for error message such as file not found. For example, rename [BERT_BASE_PSG_RETRIEVAL_MODEL_PATH] as lowercased, bert_config.json to config.json, and model.ckpt-100000.index to model.ckpt.index.

Run Reranker with rerank_custom_collection.py

python rerank_custom_collection.py --search_output_file [OUTPUT_PATH] \
            --qid2query_file [QUERY_FILE_PATH] \
            --passage_text_file [PASSAGE_ID2TEXT_PATH] \
            --model_name_or_path [BERT_BASE_PSG_RETRIEVAL_MODEL_PATH] \
            --device [your_device_setting] --output_path [RERANKER_OUTPUT_PATH]

[RERANKER_OUTPUT_PATH] is the rerank output file.

Each line has the format of qid[\t]query_text[\t]docid[\t]passage_text[\t]score[\n].

Screen results again, and iterate on this workflow!

The text was updated successfully, but these errors were encountered:

lintool · 2020-05-20T00:01:18Z

hey @egzhbdt cool, thanks for this!

After chatting with the team, we think this doc might be best in pygaggle itself? Maybe you could start a docs/ directory for us and drop this in there? Please send PR. We could have mutual pointers between docs in anserini and pygaggle?

lintool · 2020-05-20T10:35:02Z

ref: castorini/pygaggle#21

lintool · 2020-06-18T00:32:42Z

With the recent addition of bindings for the indexer in Python, I think this is resolved.

Fatima-200159617 · 2020-07-11T07:03:14Z

Hi,
I am trying to find the rerank_custom_collection.py. It seems the page is not available. Please advice.

lintool · 2020-07-11T10:37:22Z

@ghost ?

Fatima-200159617 · 2020-07-13T11:54:32Z

I appreciate if you can update on my request above.

Fatima-200159617 · 2020-07-16T05:36:23Z

Hi @Fatima-200159617 , apologize for the confusion and thanks for this. It was rerank_custom_collection.py.

Great thanks.

lintool closed this as completed Jun 18, 2020

crystina-z pushed a commit to crystina-z/anserini that referenced this issue Oct 28, 2022

add readme for unicoil with elasticsearch (castorini#1209)

90e80c7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Passage reranking for custom collection #1209

Passage reranking for custom collection #1209

ghost commented May 19, 2020

lintool commented May 20, 2020

lintool commented May 20, 2020

lintool commented Jun 18, 2020

Fatima-200159617 commented Jul 11, 2020

lintool commented Jul 11, 2020

Fatima-200159617 commented Jul 13, 2020

Fatima-200159617 commented Jul 16, 2020

Passage reranking for custom collection #1209

Passage reranking for custom collection #1209

Comments

ghost commented May 19, 2020

Rerank with monoBERT

lintool commented May 20, 2020

lintool commented May 20, 2020

lintool commented Jun 18, 2020

Fatima-200159617 commented Jul 11, 2020

lintool commented Jul 11, 2020

Fatima-200159617 commented Jul 13, 2020

Fatima-200159617 commented Jul 16, 2020