Pyserini has a number of important dependencies.
For sparse retrieval, Pyserini depends on Anserini, which is built on Lucene. PyJNIus is used to interact with the JVM.
For dense retrieval (since it involves neural networks), we need the 🤗 Transformers library, PyTorch, and Faiss (specifically faiss-cpu
).
A pip
installation will automatically pull in the first to satisfy the package requirements, but since the other two may require platform-specific custom configuration, they are not explicitly listed in the package requirements.
We leave the installation of these packages to you.
In general, our development team tries to keep dependent packages at the same versions and upgrade in lockstep.
In preparation for release of Pyserini v0.12.0, our "reference" configuration is a Linux machine running Ubuntu 18.04 with faiss-cpu==1.6.5
, transformers==4.0.0
, and torch==1.7.1
.
This is the configuration used to run our many regression tests.
In most cases results have also been reproduced on macOS with the same dependency versions.
With other versions of the dependent packages, as they say, your mileage may vary...
Below is a step-by-step Pyserini installation guide. We assume you have Anaconda installed.
Create new environment:
$ conda create -n pyserini python=3.6
$ conda activate pyserini
Install JDK 11 via conda:
$ conda install -c conda-forge openjdk=11
$ pip install pyserini
$ pip install transformers==4.6.0 # https://github.com/castorini/pyserini/issues/734
$ pip install onnxruntime
$ conda install -c conda-forge pyjnius
Install Pytorch based on environment (see this guide for additional details):
$ pip3 install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio===0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
Install Faiss based on environment
$ conda install faiss-cpu -c pytorch
First follow the steps here but run
$ pip install -e . # use this
instead of
$ pip install pyserini # do NOT use this
Install Maven via conda:
$ conda install -c conda-forge maven
Clone Anserini repo and build:
$ cd ..
$ git clone https://github.com/castorini/anserini.git
$ cd anserini
$ mvn clean package appassembler:assemble -Dmaven.test.skip=true
Copy the fatjar to pyserini/pyserini/resources/jars
.
- The above guide handle JVM installation via conda. If you are using your own Java environment and get an error about Java version mismatch, it's likely an issue with your
JAVA_HOME
environmental variable. Inbash
, useecho $JAVA_HOME
to find out what the environmental variable is currently set to, and useexport JAVA_HOME=/path/to/java/home
to change it to the correct path. On a Linux system, the correct path might look something like/usr/lib/jvm/java-11
. Unfortunately, we are unable to offer more concrete advice since the actual path depends on your OS, which JDK you're using, and a host of other factors. - Windows uses GBK character encoding by default, which makes resource file reading in Anserini inconsistent with that in Linux and macOS.
To fix, manually set environment variable
set _JAVA_OPTIONS=-Dfile.encoding=UTF-8
to useUTF-8
encoding.
At the University of Waterloo, we have two (CPU) development servers, tuna
and ocra
.
Note that on these two servers, the root disk (where your home directory is mounted) doesn't have much space.
So, you need to set pyserini cache path to scratch space.
- For tuna, create the dir
/tuna1/scratch/{username}
- For ocra, create the dir
/store/scratch/{username}
Set the PYSERINI_CACHE
environment variable to point to the directory you created above
If you are using Compute Canada, follow above process in a compute node using Anaconda, and in addition:
- clear the
PYTHONPATH
before the steps above, i.e.export PYTHONPATH=
- set the
PYSERINI_CACHE
to somewhere under/scratch
before running Pyserini - reinstall
sentencepiece
byconda install -c conda-forge sentencepiece
if error occurs