Backend documentation

The prototype was written in Python 2.7 (we didn't use Python 3 because the Intermine lib does not support it).

The architecture consists of a web server written using the Flask framework (http://flask.pocoo.org/), ElasticSearch (https://www.elastic.co/) for full text search, and a set of scripts that loads the data into the ElasticSearch (https://github.com/alliance-genome/agr_prototype/tree/master/scripts/elastic_search).

Diagram

Let's take a look at each component and its correspondent files in the repository.

Webserver

The first component is the web server (represented by the box: search.alliancegenome.org). Its full logic is contained in just two files:

/src/server.py (main)
/src/search.py

The server.py file contains the declaration of each endpoint served by the web server. You can define your own endpoint declaring any method and adding a decorator: @app.route('YOUR_ENDPOINT_HERE'). For the prototype we have 8 endpoints:

frontend endpoints:

GET /
GET /about
GET /help
GET /search
GET /assets/path:path

backend endpoints:

GET /api/search
GET /api/search_autocomplete
GET /api/graph_search

Each backend endpoint returns json data. In Flask, you can do that by making your method return any dictionary converted by the jsonify method (example). Query string parameters can be read using a dictionary request set by Flask (example).

The search endpoints basically read query string parameters, build an ElasticSearch query, which is just a JSON represented in a Python dictionary in a format specified by the documentation, send it to the ElasticSearch server, format its response and return it to the user.

For the autocomplete, just one query is prepared, submitted to the server and then the response is formatted and returned. These methods are defined in /src/search.py and follow the ElasticSearch documentation.

For the standard query, we have 2 steps: one for the actual full text search query and an aggregation query responsible to compute the facets. In the same method, we defined the list of searchable fields and the list of available facets. The two queries share the same core and just differ in their nature of full text search and aggregation.

The graph search was an experiment with a graph data visualization of the search results and can be ignored for now.

ElasticSearch

ElasticSearch requires a schema for any data you provide. The data is allocated inside indexes and each index have its separate configuration which is basically how the data is going to be analyzed and searched. Although ElasticSearch has a built-in type inference algorithm, it's strongly recommended that you define your schema before indexing any data. The schema and index configuration used for the prototype are in /scripts/elastic_search/mapping.py. The schema is submitted by the indexing script.

Indexing script

The main indexing script is located in the file /scripts/elastic_search/index.py. It creates an index in a server defined by the environment variable ES_URI (see Makefile), loads the data (genes, diseases and go information) and indexes it into an index (the last index name I used was 'searchable_items_blue'. At the time I was running some tests with different indexes and the last one that worked the best was the 'blue' one. I didn't have time to recreate it with a better name).

Each mod has its own particular methods to prepare the data (genes, diseases and go). They were implemented into separate classes:

Except for the Human class which just implements methods necessary for the homology data processing, every class implements 3 methods: load_genes, load_go and load_diseases. They also define how to build its own URI for genes. The load_genes method fetches the data from a datasource (mine or flat files) and creates an entry for a gene defined by a common schema (quickly defined just for the prototype). The load_go and load_diseases methods also fetch data from a datasource and just provide pairs (gene id, go id) and (gene id, disease id) respectively.

All classes inherit from a MOD class which contains all the common operations that each mod must perform. The MOD class loads the GO and OMIM dataset when instantiated. It unifies all the data into three dictionaries genes, go and diseases. Except for the gene data loading, the MOD class is responsible to build the data for diseases and go association receiving the pairs (gene id, go id) and (gene id, disease id) and building a dictionary with data from the OMIM and GO dataset and the gene ids from the mods.

The loading sequence is in the main indexing script. I implemented a few methods to save the data locally once it was fully loaded to save time while I was developing these scripts (imagine loosing over 10 minutes of processing because of silly bugs :P), we don't need them anymore.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend documentation

Webserver

ElasticSearch

Indexing script

Clone this wiki locally