OpenNIR

An end-to-end neural ad-hoc ranking pipeline.

Quick start

OpenNIR requires Python 3.6 (not tested with other versions).

Install dependencies

pip install -r requirements.txt

Train and validate a model (here, ConvKNRM on ANTIQUE):

scripts/pipeline.sh config/conv_knrm config/antique

(Performance on the test set can be obtained by adding pipeline.test=True)

Grid serach for BM25 over ANTIQUE for comparision with neural model performance:

scripts/pipeline.sh config/grid_search config/antique

(Performance on the test set can be obtained by adding pipeline.test=True)

Models, datasets, and vocabularies will be saved in ~/data/onir/. This can be overridden by setting data_dir=~/some/other/place/ as a command line argument, in a configuration file, or in the ONIR_ARGS environment variable.

Features

Rankers

DRMM ranker=drmm paper
Duet (local model) ranker=duetl paper
MatchPyramid ranker=matchpyramid paper
KNRM ranker=knrm paper
PACRR ranker=pacrr paper
ConvKNRM ranker=conv_knrm paper
Vanilla BERT config/vanilla_bert paper
CEDR models config/cedr/[model] paper
MatchZoo models source
- MatchZoo's KNRM ranker=mz_knrm
- MatchZoo's ConvKNRM ranker=mz_conv_knrm

Datasets

TREC Robust 2004 config/robust/fold[x]
MS-MARCO config/msmarco
ANTIQUE config/antique
TREC CAR config/car
New York Times config/nyt -- for content-based weak supervision
TREC Arabic, Mandarin, and Spanish config/multiling/* -- for zero-shot multilingual transfer learning (instructions)

Evaluation Metrics

map (from trec_eval)
ndcg (from trec_eval)
ndcg@X (from trec_eval, gdeval)
p@X (from trec_eval)
err@X (from gdeval)
mrr (from trec_eval)
rprec (from trec_eval)
judged@X (implemented in python)

Vocabularies

Binary term matching vocab=binary (i.e., changes interaction matrix from cosine similarity to to binary indicators)
Pretrained word vectors vocab=wordvec
- vocab.source=fasttext
  - vocab.variant=wiki-news-300d-1M, vocab.variant=crawl-300d-2M
  - (information about FastText variants can be found here)
- vocab=source=glove
  - vocab.variant=cc-42b-300d, vocab.variant=cc-840b-300d
  - (information about GloVe variants can be found here)
- vocab.source=convknrm
  - vocab.variant=knrm-bing vocab.variant=knrm-sogou, vocab.variant=convknrm-bing vocab.variant=convknrm-sogou
  - (information about ConvKNRM word embedding variants can be found here)
- vocab.source=bionlp
  - vocab.variant=pubmed-pmc
  - (information about BioNLP variants can be found here)
Pretrained word vectors w/ single UNK vector for unknown terms vocab=wordvec_unk
- (with above word embedding sources)
Pretrained word vectors w/ hash-based random selection for unknown terms vocab=wordvec_hash (defualt)
- (with above word embedding sources)
BERT contextualized embeddings vocab=bert
- Core models (from HuggingFace): vocab.bert_base=bert-base-uncased (default), vocab.bert_base=bert-large-uncased, vocab.bert_base=bert-base-cased, vocab.bert_base=bert-large-cased, vocab.bert_base=bert-base-multilingual-uncased, vocab.bert_base=bert-base-multilingual-cased, vocab.bert_base=bert-base-chinese, vocab.bert_base=bert-base-german-cased, vocab.bert_base=bert-large-uncased-whole-word-masking, vocab.bert_base=bert-large-cased-whole-word-masking, vocab.bert_base=bert-large-uncased-whole-word-masking-finetuned-squad, vocab.bert_base=bert-large-cased-whole-word-masking-finetuned-squad, vocab.bert_base=bert-base-cased-finetuned-mrpc
- SciBERT: vocab.bert_base=scibert-scivocab-uncased, vocab.bert_base=scibert-scivocab-cased, vocab.bert_base=scibert-basevocab-uncased, vocab.bert_base=scibert-basevocab-cased
- BioBERT vocab.bert_base=biobert-pubmed-pmc, vocab.bert_base=biobert-pubmed, vocab.bert_base=biobert-pmc

Citing OpenNIR

If you use OpenNIR, please cite the following WSDM demonstration paper:

@InProceedings{macavaney:wsdm2020-onir,
  author = {MacAvaney, Sean},
  title = {{OpenNIR}: A Complete Neural Ad-Hoc Ranking Pipeline},
  booktitle = {{WSDM} 2020},
  year = {2020}
}

Acknowledgements

I gratefully acknowledge support for this work from the ARCS Endowment Fellowship. I thank Andrew Yates, Arman Cohan, Luca Soldaini, Nazli Goharian, and Ophir Frieder for valuable feedback on the manuscript and/or code contributions to OpenNIR.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
bin		bin
config		config
docs		docs
etc		etc
onir		onir
scripts		scripts
test		test
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenNIR

Quick start

Features

Rankers

Datasets

Evaluation Metrics

Vocabularies

Citing OpenNIR

Acknowledgements

About

Releases

Packages

Languages

License

Indrajit-AI-Research/OpenNIR

Folders and files

Latest commit

History

Repository files navigation

OpenNIR

Quick start

Features

Rankers

Datasets

Evaluation Metrics

Vocabularies

Citing OpenNIR

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages