Skip to content

arXiv/arxiv-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

arxiv-search

Development quickstart

Running Elasticsearch in Docker

You can start up ES on its own. Be sure to map port 9200 to the host machine, so that arXiv search can find it.

docker build -t "arxiv/elasticsearch" -f ./Dockerfile-elasticsearch .
docker run -it \
    -e "http.host=0.0.0.0" \
    -e "transport.host=127.0.0.1" \
    -p 9200:9200 -p 9300:9300 arxiv/elasticsearch

echo $ES_BASE_URL; # http://127.0.0.1:9200
curl "${ES_BASE_URL}/_cat/health?v";

Create & populate the index

A couple of helper scripts are included to create and populate the search index. Note that you will need to have access to the /docmeta endpoint, which is only accessible from the CUL network.

pipenv install --dev
FLASK_APP=app.py FLASK_DEBUG=1 ELASTICSEARCH_SERVICE_HOST=127.0.0.1 pipenv run python create_index.py
FLASK_APP=app.py FLASK_DEBUG=1 ELASTICSEARCH_SERVICE_HOST=127.0.0.1 pipenv run python bulk_index.py

bulk_index.py without parameters populate the index with the list of papers defined in tests/data/sample.json. It take several minutes to run. Individual paper IDs may be specified with the --paper_id parameter.

Flask dev server

You can spin up the search app directly.

FLASK_APP=app.py FLASK_DEBUG=1 ELASTICSEARCH_SERVICE_HOST=127.0.0.1 pipenv run flask run

This will monitor any of the Python bits for changes and restart the server. Unfortunately static files and templates are not monitored, so you'll have to manually restart to see those changes take effect.

If all goes well... http://127.0.0.1:5000/ should render the basic search page.

You can run the new metadata API in dev mode by changing FLASK_APP to point to wsgi-api.py, i.e.:

FLASK_APP=api.py FLASK_DEBUG=1 ELASTICSEARCH_SERVICE_HOST=127.0.0.1 pipenv run flask run

To run the classic API in dev mode, use wsgi-classic-api.py:

FLASK_APP=classic-api.py FLASK_DEBUG=1 ELASTICSEARCH_SERVICE_HOST=127.0.0.1 pipenv run flask run

Running Elasticsearch + Kibana with docker-compose

The easiest way to spin up ES and Kibana is using the included docker-compose.yml file. This will start ES and Kibana on a custom network. The ES service ports 9200 and 9300, and Kibana service port 5601, will be mapped to localhost.

docker-compose up

Kibana will be available at http://127.0.0.1:5601/. The containers started by docker-compose can be stopped with docker-compose down from the same directory.

Make sure that you have a recent version of docker-compose; this is confirmed to work with version 1.18.

Note that connection configuration variables for the search service are set in search/config.py, where they are read from the environment. The arXiv search service expects the ES service to be available at http://localhost:9200 by default. Hence, you should be able to start ES using docker-compose as above and make no configuration changes to the arXiv search service.

Running the indexing agent.

The indexing agent is responsible for updating the search index as new papers are published. By default, docker-compose will also start the search index and a service called Localstack that provides a local Kinesis stream for testing/development purposes.

To disable the agent and localstack, just comment out those services in docker-compose.yml.

The agent takes a little longer than the other services to start. Early in the startup, you'll see something like:

agent            | application 12/Apr/2018:15:43:13 +0000 - search.agent.base - None - [arxiv:null] - INFO: "New consumer for MetadataIsAvailable (0)"
agent            | application 12/Apr/2018:15:43:13 +0000 - search.agent.base - None - [arxiv:null] - INFO: "Getting a new connection to Kinesis at https://localstack:4568 in region us-east-1, with SSL verification=False"
agent            | application 12/Apr/2018:15:43:13 +0000 - search.agent.base - None - [arxiv:null] - INFO: "Waiting for MetadataIsAvailable to be available"
agent            | application 12/Apr/2018:15:43:13 +0000 - search.agent.base - None - [arxiv:null] - ERROR: "Waiting for stream MetadataIsAvailable"

A little while later, when localstack and the indexing agent are running, you should see something like:

agent            | application 12/Apr/2018:15:44:14 +0000 - search.agent.base - None - [arxiv:null] - ERROR: "Failed to get stream while waiting"
agent            | application 12/Apr/2018:15:44:14 +0000 - search.agent.base - None - [arxiv:null] - INFO: "Could not connect to stream; attempting to create"
agent            | application 12/Apr/2018:15:44:14 +0000 - search.agent.base - None - [arxiv:null] - INFO: "Created; waiting for MetadataIsAvailable again"
agent            | application 12/Apr/2018:15:44:14 +0000 - search.agent.base - None - [arxiv:null] - ERROR: "Waiting for stream MetadataIsAvailable"
localstack       | Ready.
agent            | application 12/Apr/2018:15:44:24 +0000 - search.agent.base - None - [arxiv:null] - INFO: "Ready to start"
agent            | application 12/Apr/2018:15:44:24 +0000 - search.agent.base - None - [arxiv:null] - INFO: "Starting processing from position 49583482132750299344823207796409748205413425533752967170 on stream MetadataIsAvailable and shard 0"

Note that Kinesis will be mounted locally on port 5586. It will be using SSL, but with an invalid certificate. You can connect to this local Kinesis using:

import boto3

client = boto3.client(
    'kinesis',
    region_name='us-east-1',
    endpoint_url="https://localhost:5568",
    aws_access_key_id='foo',
    aws_secret_access_key='bar',
    verify=False
)

To verify that the agent is working correctly, try adding some records to the stream.

import json

to_index = [
    "1712.04442",
    "1511.07473",
    "1604.04228",
    "1403.6219",
    "1404.3450",
    "1703.09067",
    "1408.6682",
    "1607.05107",
    "1509.08727",
    "1710.01597",
    "1708.07156",
    "1401.1012",
]

for document_id in to_index:
    data = bytes(json.dumps({'document_id': document_id}), encoding='utf-8')
    client.put_record(
        StreamName='MetadataIsAvailable',
        Data=data,
        PartitionKey='0'
    )

You should see these records being processed in the agent log output almost immediately. For example:

agent            | application 12/Apr/2018:15:49:18 +0000 - search.agent.consumer - None - [arxiv:null] - INFO: "Processing record 49583482484923667520018808447528512983060659815298629634"
agent            | application 12/Apr/2018:15:49:19 +0000 - search.agent.consumer - None - [arxiv:null] - INFO: "Processing record 49583482484923667520018808447529721908880274444473335810"
agent            | application 12/Apr/2018:15:49:20 +0000 - search.agent.consumer - None - [arxiv:null] - INFO: "Processing record 49583482484923667520018808447530930834699889073648041986"
agent            | application 12/Apr/2018:15:49:20 +0000 - search.agent.consumer - None - [arxiv:null] - INFO: "Processing record 49583482484923667520018808447532139760519503702822748162"
agent            | application 12/Apr/2018:15:49:21 +0000 - search.agent.consumer - None - [arxiv:null] - INFO: "Processing record 49583482484923667520018808447533348686339118400716931074"
agent            | application 12/Apr/2018:15:49:22 +0000 - search.agent.consumer - None - [arxiv:null] - INFO: "Processing record 49583482484923667520018808447534557612158733029891637250"
agent            | application 12/Apr/2018:15:49:23 +0000 - search.agent.consumer - None - [arxiv:null] - INFO: "Processing record 49583482484923667520018808447535766537978347659066343426"
agent            | application 12/Apr/2018:15:49:24 +0000 - search.agent.consumer - None - [arxiv:null] - INFO: "Processing record 49583482484923667520018808447536975463797962288241049602"
agent            | application 12/Apr/2018:15:49:24 +0000 - search.agent.consumer - None - [arxiv:null] - INFO: "Processing record 49583482484923667520018808447538184389617576917415755778"
agent            | application 12/Apr/2018:15:49:25 +0000 - search.agent.consumer - None - [arxiv:null] - INFO: "Processing record 49583482484923667520018808447539393315437191546590461954"
agent            | application 12/Apr/2018:15:49:25 +0000 - search.agent.consumer - None - [arxiv:null] - INFO: "Processing record 49583482484923667520018808447540602241256806175765168130"
agent            | application 12/Apr/2018:15:49:25 +0000 - search.agent.consumer - None - [arxiv:null] - INFO: "Processing record 49583482484923667520018808447541811167076420804939874306"

Other Operations

Audit

To check for missing records, put the IDs in a file like list_of_papers.txt, use audit.py:

ELASTICSEARCH_SERVICE_HOST=127.0.0.1 ELASTICSEARCH_INDEX=arxiv pipenv run python audit.py -l list_of_papers.txt -o missing.txt

Reindexing

ElasticSearch can perform reindexing by copying documents from one index to another index with a different mapping. reindex.py will initiate the reindexing process, and poll for completion until all of the documents are processed. If the destination index does not already exist, it will be created using the current configured mapping.

FLASK_APP=app.py ELASTICSEARCH_SERVICE_HOST=127.0.0.1 pipenv run python reindex.py OLD_INDEX NEW_INDEX

Deploying static assets to S3

Assets in search/static can be deployed to S3 using the included upload_static_assets.py script that leverages the Flask-S3 plugin https://flask-s3.readthedocs.io/en/latest/`_. Note that this requires AWS credentials that have appropriate permissions for the specified bucket.

To enable the S3-based URLs for the static assets in the templates, simply set FLASKS3_ACTIVE=1 when starting the Flask dev server.

Testing & quality

Install testing tools with...

pipenv install --dev

Test suite

Run the main test suite with...

pipenv run nose2 --with-coverage

To include integration tests, the environment variable WITH_INTEGRATION. E.g.

WITH_INTEGRATION=1 pipenv run nose2 --with-coverage

You can also run to avoid all intentional error messages or kinesis print statements created during the tests

pytest --disable-warnings search

Static checking

Goal: zero errors/warnings.

Use # type: ignore to disable mypy messages that do not reveal actual programming errors, and that are impractical to fix. If ignoring without verifying, insert a # TODO: recheck.

If there is an active mypy GitHub issue (i.e. it's a bug/limitation in mypy) relevant to missed check, link that for later follow-up.

pipenv run mypy -p search | grep -v "test.*" | grep -v "defined here"

Note that we filter out messages about test modules, and messages about a known limitation of mypy related to dataclasses support.

Documentation style

Goal: zero errors/warnings.

pipenv run pydocstyle --convention=numpy --add-ignore=D401 search

Linting

Goal: 9/10 or better.

pipenv run pylint search

Documentation

The latest version of the documentation is available at https://arxiv.github.io/arxiv-search.

The source files for the arXiv search service documentation is located in docs/.

To build the service documentation locally:

cd docs
pip install -r requirements.txt
make [format]

where [format] can be html, latexpdf. See the ``Sphinx documentation http://www.sphinx-doc.org/en/master/`_.

Pre commit hooks

To run pre commit hooks install the dev dependencies:

pipenv install --dev

After that you'll need to install the pre commit hooks:

pipenv run pre-commit install

Git will run all the pre-commit hooks on all changed files before you are allowed to commit. You will be allowed to commit only if all checks pass.

You can also run the pre commit hooks manually with:

pipenv run pre-commit run