-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Passage reranking for custom collection #1209
Comments
hey @egzhbdt cool, thanks for this! After chatting with the team, we think this doc might be best in pygaggle itself? Maybe you could start a |
With the recent addition of bindings for the indexer in Python, I think this is resolved. |
Hi, |
@ghost ? |
I appreciate if you can update on my request above. |
Great thanks. |
Hi,
It might be good to complete the custom collection doc with light instructions on passage reranking, after seeing this issue.
pygaggle has been a well-bundled and transparent resource for the CORD-19, and for other text ranking in the future. Below is a bottom-up snapshot from what it provides. It is however free-style and therefore less formalized.
Rerank with monoBERT
Optionally, you could rerank the above retrieval results. We provide a minimum working example
rerank_custom_collection.py
for this.The example follows pygaggle and duoBERT (up to monoBERT in the figure below, figure source here), less the part to evaluate with ground-truth.
It calls a reranker to score (query, passage) pairs. The reranker is a pre-trained transformer model on a general passage retrieval task such as MS MARCO.
Prepare Input Files from the
QuickStart
Step[OUTPUT_PATH]
.[QUERY_FILE_PATH]
.[PASSAGE_ID2TEXT_PATH]
.docid[\t]passage_raw_text[\n]
. No header. We don't differentiate a document and a passage in this use case, sodocid
refers to the passage id.Install Requirements
Download the
requirements.txt
from pygaggle, then dopip install -r requirements.txt
.Download the Pre-trained Reranker
Download BERT_Base_trained_on_MSMARCO.zip (roughly 1.1 G) from nyu-dl/dl4marco-bert.
Unzip and save to a path called
[BERT_BASE_PSG_RETRIEVAL_MODEL_PATH]
.Note: Different transformer version tends to read pre-trained model names slightly differently. You might need to tweak file names a bit for error message such as
file not found
. For example, rename[BERT_BASE_PSG_RETRIEVAL_MODEL_PATH]
as lowercased,bert_config.json
toconfig.json
, andmodel.ckpt-100000.index
tomodel.ckpt.index
.Run Reranker with
rerank_custom_collection.py
[RERANKER_OUTPUT_PATH]
is the rerank output file.Each line has the format of
qid[\t]query_text[\t]docid[\t]passage_text[\t]score[\n]
.Screen results again, and iterate on this workflow!
The text was updated successfully, but these errors were encountered: