Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passage reranking for custom collection #1209

Closed
ghost opened this issue May 19, 2020 · 7 comments
Closed

Passage reranking for custom collection #1209

ghost opened this issue May 19, 2020 · 7 comments

Comments

@ghost
Copy link

ghost commented May 19, 2020

Hi,

It might be good to complete the custom collection doc with light instructions on passage reranking, after seeing this issue.

pygaggle has been a well-bundled and transparent resource for the CORD-19, and for other text ranking in the future. Below is a bottom-up snapshot from what it provides. It is however free-style and therefore less formalized.



Rerank with monoBERT

Optionally, you could rerank the above retrieval results. We provide a minimum working example rerank_custom_collection.py for this.

The example follows pygaggle and duoBERT (up to monoBERT in the figure below, figure source here), less the part to evaluate with ground-truth.

It calls a reranker to score (query, passage) pairs. The reranker is a pre-trained transformer model on a general passage retrieval task such as MS MARCO.

monoBERT

Prepare Input Files from the QuickStart Step

  • The initial retrieval result file, named above as [OUTPUT_PATH].
  • The query file, named above as [QUERY_FILE_PATH].
  • A mapping file between the passage id and the raw content mapping, named as [PASSAGE_ID2TEXT_PATH].
    • This mapping file does not exist above. Write a simple script to convert the initial collection to this mapping file.
    • Each line has the format of docid[\t]passage_raw_text[\n]. No header. We don't differentiate a document and a passage in this use case, so docid refers to the passage id.

Install Requirements

Download the requirements.txt from pygaggle, then do pip install -r requirements.txt.

Download the Pre-trained Reranker

Download BERT_Base_trained_on_MSMARCO.zip (roughly 1.1 G) from nyu-dl/dl4marco-bert.

Unzip and save to a path called [BERT_BASE_PSG_RETRIEVAL_MODEL_PATH].

Note: Different transformer version tends to read pre-trained model names slightly differently. You might need to tweak file names a bit for error message such as file not found. For example, rename [BERT_BASE_PSG_RETRIEVAL_MODEL_PATH] as lowercased, bert_config.json to config.json, and model.ckpt-100000.index to model.ckpt.index.

Run Reranker with rerank_custom_collection.py

python rerank_custom_collection.py --search_output_file [OUTPUT_PATH] \
            --qid2query_file [QUERY_FILE_PATH] \
            --passage_text_file [PASSAGE_ID2TEXT_PATH] \
            --model_name_or_path [BERT_BASE_PSG_RETRIEVAL_MODEL_PATH] \
            --device [your_device_setting] --output_path [RERANKER_OUTPUT_PATH]

[RERANKER_OUTPUT_PATH] is the rerank output file.

Each line has the format of qid[\t]query_text[\t]docid[\t]passage_text[\t]score[\n].

Screen results again, and iterate on this workflow!

@lintool
Copy link
Member

lintool commented May 20, 2020

hey @egzhbdt cool, thanks for this!

After chatting with the team, we think this doc might be best in pygaggle itself? Maybe you could start a docs/ directory for us and drop this in there? Please send PR. We could have mutual pointers between docs in anserini and pygaggle?

@lintool
Copy link
Member

lintool commented May 20, 2020

ref: castorini/pygaggle#21

@lintool
Copy link
Member

lintool commented Jun 18, 2020

With the recent addition of bindings for the indexer in Python, I think this is resolved.

@lintool lintool closed this as completed Jun 18, 2020
@Fatima-200159617
Copy link

Hi,
I am trying to find the rerank_custom_collection.py. It seems the page is not available. Please advice.

@lintool
Copy link
Member

lintool commented Jul 11, 2020

@ghost ?

@Fatima-200159617
Copy link

I appreciate if you can update on my request above.

@Fatima-200159617
Copy link

Hi @Fatima-200159617 , apologize for the confusion and thanks for this. It was rerank_custom_collection.py.

Great thanks.

crystina-z pushed a commit to crystina-z/anserini that referenced this issue Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants