Skip to content

Latest commit

 

History

History
213 lines (153 loc) · 11.9 KB

solrini.md

File metadata and controls

213 lines (153 loc) · 11.9 KB

Solrini: Anserini Integration with Solr

This page documents code for reproducing results from the following paper:

Ryan Clancy, Toke Eskildsen, Nick Ruest, and Jimmy Lin. Solr Integration in the Anserini Information Retrieval Toolkit. Proceedings of the 42nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), July 2019, Paris, France.

We provide instructions for setting up a single-node SolrCloud instance running locally and indexing into it from Anserini. Instructions for setting up SolrCloud clusters can be found by searching the web.

Setting up a Single-Node SolrCloud Instance

Download Solr version 8.11.2 (binary release) from here and extract the archive:

mkdir solrini && tar -zxvf solr*.tgz -C solrini --strip-components=1

Solr 8.11.2 is the last release in the 8.x series, and unfortunately, these instructions do not work for Solr 9.x.

Start Solr:

solrini/bin/solr start -c -m 16G

When you're done, remember to stop Solr:

solrini/bin/solr stop

Adjust memory usage (i.e., -m 16G as appropriate).

Run the Solr bootstrap script to copy the Anserini JAR into Solr's classpath and upload the configsets to Solr's internal ZooKeeper:

pushd src/main/resources/solr && ./solr.sh ../../../../solrini localhost:9983 && popd

Solr should now be available at http://localhost:8983/ for browsing.

The Solr index schema can also be modified using the Schema API. This is useful for specifying field types and other properties including multiValued fields. Schemas for setting up specific Solr index schemas can be found in the src/main/resources/solr/schemas/ folder. To set the schema, we can make a request to the Schema API:

curl -X POST -H 'Content-type:application/json' \
  --data-binary @src/main/resources/solr/schemas/SCHEMA_NAME.json \
  http://localhost:8983/solr/COLLECTION_NAME/schema

For Robust04 example below, this isn't necessary.

Indexing into SolrCloud from Anserini

We can use Anserini as a common "front-end" for indexing into SolrCloud, thus supporting the same range of test collections that's already included in Anserini (when directly building local Lucene indexes). Indexing into Solr is similar indexing to disk with Lucene, with a few added parameters. Most notably, we replace the -index parameter (which specifies the Lucene index path on disk) with Solr parameters. Alternatively, Solr can also be configured to read pre-built Lucene indexes, since Solr uses Lucene indexes under the hood (more details below).

We'll index Robust04 as an example. First, create the robust04 collection in Solr:

solrini/bin/solr create -n anserini -c robust04

Run the Solr indexing command for robust04:

sh target/appassembler/bin/IndexCollection \
  -collection TrecCollection \
  -input /path/to/disk45 \
  -generator DefaultLuceneDocumentGenerator \
  -solr \
  -solr.index robust04 \
  -solr.zkUrl localhost:9983 \
  -threads 8 \
  -storePositions -storeDocvectors -storeRaw

Make sure /path/to/disk45 is updated with the appropriate path for the Robust04 collection.

Once indexing has completed, you should be able to query robust04 from the Solr query interface. You can also run the following command to reproduce Anserini BM25 retrieval:

sh target/appassembler/bin/SearchSolr \
  -topics tools/topics-and-qrels/topics.robust04.txt \
  -topicreader Trec \
  -solr.index robust04 \
  -solr.zkUrl localhost:9983 \
  -output runs/run.solr.robust04.bm25.topics.robust04.txt

Evaluation can be performed using trec_eval:

$ tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 \
    tools/topics-and-qrels/qrels.robust04.txt \
    runs/run.solr.robust04.bm25.topics.robust04.txt

map                   	all	0.2531
P_30                  	all	0.3102

Solrini has also been verified to work with following collections as well:

See run_solr_regression.py regression script for more details.

Solr with a Pre-built Lucene Index

It is possible for Solr to read pre-built Lucene indexes. To achieve this, some housekeeping is required to "install" the pre-built indexes. The following uses Robust04 as an example. Let's assume the pre-built index is stored at indexes/lucene-index.disk45/.

First, a Solr collection must be created to house the index. Here, we create a collection robust04 with configset anserini.

solrini/bin/solr create -n anserini -c robust04

Along with the collection, Solr will create a core instance, whose name can be found in the Solr UI under collection overview. It'll look something like <collection_name>_shard<id>_replica_<id> (e.g., robust04_shard1_replica_n1). Solr stores configurations and data for the core instances under Solr home, which for us is solrini/server/solr/ by default.

Second, make proper Solr schema adjustments if necessary. Here, robust04 is a TREC collection whose schema is already handled by managed-schema in the Solr configset. However, for a collection such as cord19, remember to make proper adjustments to the Solr schema (also see above):

curl -X POST -H 'Content-type:application/json' \
  --data-binary @src/main/resources/solr/schemas/SCHEMA_NAME.json \
  http://localhost:8983/solr/COLLECTION_NAME/schema

Finally, we can copy the pre-built index to the local where Solr expects them. Start by removing data that's there:

rm solrini/server/solr/robust04_shard1_replica_n1/data/index/*

Then, simply copy the pre-built Lucene indexes into that location:

cp indexes/lucene-index.disk45/* solrini/server/solr/robust04_shard1_replica_n1/data/index

Restart Solr to make sure changes take effect:

solrini/bin/solr stop
solrini/bin/solr start -c -m 16G

You can confirm that everything works by performing a retrieval run and checking the results (see above).

Solr integration test

We have an end-to-end integration testing script run_solr_regression.py. See example usage for Robust04 below:

# Check if Solr server is on
python src/main/python/run_solr_regression.py --ping

# Check if robust04 exists
python src/main/python/run_solr_regression.py --check-index-exists robust04

# Create robust04 if it does not exist
python src/main/python/run_solr_regression.py --create-index robust04

# Delete robust04 if it exists
python src/main/python/run_solr_regression.py --delete-index robust04

# Insert documents from /path/to/disk45 into robust04
python src/main/python/run_solr_regression.py --insert-docs robust04 --input /path/to/disk45

# Search and evaluate on robust04
python src/main/python/run_solr_regression.py --evaluate robust04

To run end-to-end, issue the following command:

python src/main/python/run_solr_regression.py --regression robust04 --input /path/to/disk45

The regression script has been verified to work for robust04, core18, msmarco-passage, msmarco-doc.

Reproduction Log*