This is a clone of Indri 5.12 with minor customizations.
Unless noted, all parameters are thrown on the command line just like other IndriRunQuery parameters.
NB. Input queries are assumed to be natural language, not Indri query language.
parameter | type | default | description |
---|---|---|---|
rerankSize | int | 0 | relevance model reranks a length rerankSize initial query likelihood retrieval. |
Using condensed list relevance models can substantially improve speed without degrading effectiveness of relevance models. This means you can run massive query expansion very quickly.
parameter | type | default | description |
---|---|---|---|
externalIndex | path | NONE | RM is built from an initial query likelihood retrieval from externalIndex . |
Using external expansion with a large external index can substantially improve effectiveness of query expansion. This can be combined with condensed list relevance models (from the target corpus) if you are concerned about speed.
To construct and run a dependence model query, just use -dm=<dm parameter:dm parameter value>[,<dm parameter:dm parameter value>]+
. Defaults are used if any DM parameters are set.
parameter | type | default | description |
---|---|---|---|
order | int | 1 | dependence model order |
combineWeight | float | 0.85 | weight for combine subquery |
owWeight | float | 0.10 | weight for ordered window subquery |
uwWeight | float | 0.05 | weight for unordered window subquery |
uwSize | int | 8 | unordered window size |
rerankSize | int | 0 | number of query likelihood results to rerank (0=retrieve) |
Order k=1
dependence models are the classic sequential dependence model. Order k
computes the dependences between all query terms within a k
word window. Order k=-1
is the classic full dependence model.
order | indri query |
---|---|
0 | #combine( colorless green ideas sleep furiously ) |
1 | #weight( 0.85 #combine( colorless green ideas sleep furiously ) 0.1 #combine( #1( colorless green ) #1( green idea ) #1( idea sleep ) #1( sleep furiously ) ) 0.05 #combine( #uw8( colorless green) #uw8( green idea ) #uw8( idea sleep ) #uw8( sleep furiously ) ) ) |
2 | #weight( 0.85 #combine( colorless green ideas sleep furiously ) 0.1 #combine( #1( colorless green ) #1( colorless idea ) #1( green idea ) #1( green sleep ) #1( idea sleep ) #1( idea furiously ) #1( sleep furiously ) ) 0.05 #combine( #uw8( colorless green) #uw8( colorless idea ) #uw8( green idea ) #uw8( green sleep ) #uw8( idea sleep ) #uw8( idea furiously ) #uw8( sleep furiously ) ) ) |
-1 | #weight( 0.85 #combine( colorless green ideas sleep furiously ) 0.1 #combine( #1( colorless green ) #1( colorless idea ) #1( colorless sleep ) #1( colorless furiously ) #1( green idea ) #1( green sleep ) #1( green furiously ) #1( idea sleep ) #1( idea furiously ) #1( sleep furiously ) ) 0.05 #combine( #uw8( colorless green) #uw8( colorless idea ) #uw8( colorless sleep ) #uw8( colorless furiously ) #uw8( green idea ) #uw8( green sleep ) #uw8( green furiously ) #uw8( idea sleep ) #uw8( idea furiously ) #uw8( sleep furiously ) ) ) |
This implementation also includes condensed list dependence models which, like condensed list relevance models, first conduct a query likelihood retrieval and then rerank the results using the dependence model. This can substantially improve speed, especially for longer queries.
For example, to run a classic SDM reranking of a length 100 query likelihood initial retrieval,
IndriRunQuery -index=/path/to/index -query="hello world" -dm=order:1,rerank:100
Dependence models are built after internally stopping and stemming the query terms.
parameter | type | default | description |
---|---|---|---|
passageLength | int | 0 | length of passages to retrieve. |
passageOverlap | int | 0 | passage overlap. |
fbRankDocuments | bool | false | used in conjunction with passageLength > 0 and PRF for document ranking based on passage RMs. |
parameter | type | default | description |
---|---|---|---|
field | string | NONE | restrict retrieval to this field. |
To use advanced RM parameters, just use -rm=<rm parameter:rm parameter value>[,<rm parameter:rm parameter value>]+
. Defaults are used if any RM parameters are set.
parameter | type | default | description |
---|---|---|---|
passageLength | int | 0 | passage length for passage-based RM (0=doc). |
passageOverlap | int | 0 | passage overlap for passage-based RM (0=doc). |
field | string | NONE | field restriction for field-based RM. |
fbDocs | int | 0 | num feedback documents |
fbTerms | int | 0 | num feeback terms |
fbOrigWeight | float | 0.0 | original weight. |
targetPassages | bool | false | rank passages instead of documents in the final retrieval. |
condensed | int | 0 | rerank |
If you want the initial retrieval to be a DM (see above), you can pass the following into the rm parameter list,
parameter | type | default | description |
---|---|---|---|
dm.combineWeight | float | 0.85 | |
dm.owWeight | float | 0.10 | |
dm.uwWeight | float | 0.05 | |
dm.uwSize | int | 8 | |
dm.order | int | 1 | |
dm.rerankSize | int | 0 |
Here's a simple cheatsheet for good baselines for your text retrieval experiments. I'm assuming that you are evaluating on a target corpus (e.g. Associated Press documents) and might either train your model with documents from the target corpus or from some large external corpus (e.g. Gigaword). You also might incorporate term proximity or other linear term relationships (e.g. n- or skip-grams).
training corpus | |||
proximity | none | target | external |
no | QL | RM3 | EE |
yes | DM | DM+RM3 | DM+EE |
For whatever the appropriate baseline is for your experiment, you should train the baseline's parameters on the same training set you're using for your model. For RM3/EE, good ranges are,
parameter | min | max |
---|---|---|
fbDocs | 5 | 500 |
fbTerms | 5 | 500 |
fbOrigWeight | 0 | 1 |
Don's original DM weights seem to be robust across conditions but you may still want to play with the parameters above.
If you find DM/RM3/EE slow, instead of limiting the parameter ranges, you should opt for the condensed list versions of the methods. In general, you can rerank 1000 documents from a QL initial retrieval and preserve rank-equivalence with re-retrieval, especially at the top of the final ranking.
If you are not sure where your model fits in the cheatsheet above, a strong baseline is DM+EE.
Note to paper readers (including reviewers): You'll come across papers that use poorly implemented pseudo-relevance feedback baselines. Here are common mistakes,
- no parameter tuning (i.e. a fixed number of feedback documents, expansion terms, or interpolation weight).
- poor parameter tuning (e.g. only testing 5-25 terms and/or documents). I've seen effectiveness peak at hundreds of documents/terms for some collections.
- unweighted document feedback (i.e. treating all feedback documents as equally relevant).
- unweighted term expansion (i.e. treating the new terms just additional query terms).
Use the standard indri citation,
@inproceedings{strohman:indri,
author = {Trevor Strohman and Donald Metzler and Howard Turtle and W. B. Croft},
booktitle = {Proceedings of the International Conference on Intelligence Analysis},
title = {Indri: A language model-based search engine for complex queries},
year = {2004}
}
as well as the following for reproducibility,
@online{diaz:indri,
author = {Fernando Diaz},
title = {indri},
year = {2018},
url = {https://github.com/diazf/indri}
}
and references to any specific algorithms above you are using.
To build on OSX, be should to set CXXFLAGS=-fno-tree-vectorize
. Thanks to Luke Gallagher.