The identification of sub-cellular biological entities is an important consideration in the use and creation of bioinformatics analysis tools and accessible biological research apps. When research information is uniquely and unambiguously identified, it enables data to be accurately retrieved, cross-referenced, and integrated. In practice, biological entities are “identified” when they are associated with a matching record from a knowledge base that specialises in collecting and organising information of that type (e.g. gene sequences). Our search service increases the efficiency and ease of use for identifying biological entities. This identification may be used to power research apps and tools where common entity synonyms may be provided as input.
For instance, Biofactoid uses this grounding service to allow users to simply specify their preferred synonyms to identify biological entities (e.g. proteins):
biofactoid-grounding.mp4
To cite the Pathway Commons Grounding Search Service in a paper, please cite the Journal of Open Source Software paper:
Franz et al., (2021). A flexible search system for high-accuracy identification of biological entities and molecules. Journal of Open Source Software, 6(67), 3756, https://doi.org/10.21105/joss.03756
View the paper at JOSS or view the PDF directly.
The Pathway Commons Grounding Search Service is an academic project built and maintained by: Bader Lab at the University of Toronto , Sander Lab at Harvard , and the Pathway and Omics Lab at the Oregon Health & Science University .
This project was funded by the US National Institutes of Health (NIH) [U41 HG006623, U41 HG003751, R01 HG009979 and P41 GM103504].
Install Docker (>=20.10.0) and Docker Compose (>=1.29.0).
Clone this remote or at least the docker-compose.yml
file then run:
docker-compose up
Swagger documentation can be accessed at http://localhost:3000
.
NB: Server start will take some time in order for Elasticsearch to initialize and for the grounding data to be retrieved and the index restored. If it takes more than 10 minutes consider increasing the allocated memory for Docker: Preferences
> Resources
> Memory
and remove this line in docker-compose.yml: ES_JAVA_OPTS=-Xms2g -Xmx2g
With Node.js (>=8) and Elasticsearch (>=6.6.0, <7) installed with default options, run the following in a cloned copy of the repository:
npm install
: Install npm dependenciesnpm run update
: Download and index the datanpm start
: Start the server (by default on port 3000)
Swagger documentation is available on a publicly-hosted instance of the service at https://grounding.baderlab.org. You can run queries to test the API on this instance.
Please do not use https://grounding.baderlab.org
for your production apps or scripts.
Here, we provide usage examples in common languages for the main search API. For more details, please refer to the Swagger documentation at https://grounding.baderlab.org, which is also accessible when running a local instance.
const response = await fetch('http://hostname:port/search', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({ // search options here
q: 'p53'
})
});
const responseJSON = await response.json();
import requests
url = 'http://hostname:port/search'
body = {'q': 'p53'}
response = requests.post(url, data = body)
responseJSON = response.json()
curl -X POST "http://hostname:port/search" -H "accept: application/json" -H "Content-Type: application/json" -d "{ \"q\": \"p53\" }"
Here, we summarise a set of tools that overlap to some degree with the main use case of the Pathway Commons Grounding Search Service, where a user searches for a biological entity grounding by providing only a commonly-used synonym. This table was last updated on 25 October 2021 (2021-10-25).
If you have developed a new tool in this space or your tool supports new features, let us know by making a pull request, and we'll add your revision to this table.
PC Grounding Search | GProfiler | GNormPlus (PubTator) | Gilda | BridgeDB | |
---|---|---|---|---|---|
Allows for searching by synonym | ● | ● | ● | ||
Supports multiple organisms | ● | ● | ● | ● | ● |
Accepts organism ranking preference | ● | ||||
Multiple organisms per query | ● | ● | Partial support (only one organism returned) | ||
Multiple results per query | ● | One per type (e.g. protein) | ● | ||
Multiple results are ranked based on relevance | ● | ● | |||
Speed/Throughput | < 100 ms | < 100 ms | < 100ms | < 100 ms | < 1000 ms |
Allows querying for a particular grounding by ID | ● | ● | ● | ● | ● |
grounding-search
uses data files provided by several public databases:
- NCBI Gene
- Information about genes
- Alias:
ncbi
- Data file: gene_info.gz
- ChEBI (
chebi
)- Information about small molecules of biological interest
- Alias:
chebi
- Data file: chebi.owl
- UniProt (
uniprot
)- Information about proteins
- Alias:
uniprot
- Data file: uniprot_sprot.xml.gz
- Famplex
- Information about protein families
- Alias:
fplx
- Data file: famplex-master.zip
If you have followed the Quick Start ("Run from source"), you can download and index the data provided by the source databases ncbi
, chebi
and uniprot
by running:
npm run update
Downloading and building the index from source ensures that the latest information is indexed. Alternatively, to quickly retrieve and recreate the index a dump of a previously indexed Elasticsearch instance has been published on Zenodo under the following DOI:
This data is published under the Creative Commons Zero v1.0 Universal license.
To restore, create a running Elasticsearch instance and run:
npm run restore
To both restore and start the grounding-search server run:
npm run boot
NB: Index dump published on Zenodo is offered for demonstration purposes only. We do not guarantee that this data will be up-to-date or that releases of grounding-search software will be compatible with any previously published version of the dump data. To ensure you are using the latest data compatible with grounding-search, follow instructions in "Build the index database from source database files".
To let us know about an issue in the software or to provide feedback, please file an issue on GitHub.
To make a contribution to this project, please start by please filing an issue on GitHub that describes your proposal. Once your proposal is ready, you can make a pull request.
The following environment variables can be used to configure the server:
NODE_ENV
: the environment mode, eitherproduction
ordevelopment
(default)LOG_LEVEL
: the level for the log file (info
,warn
,error
)PORT
: the port on which the server runs (default 3000)ELASTICSEARCH_HOST
: thehost:port
that points to elasticsearchMAX_SEARCH_ES
: the maximum number of results to return from elasticsearchMAX_SEARCH_WS
: the maximum number of results to return in json from the webserviceCHUNK_SIZE
: how many grounding entries make up a chunk that gets bulk inserted into elasticsearchMAX_SIMULT_CHUNKS
: maximum number of chunks to insert simulteneously into elasticsearchINPUT_PATH
: the path to the input folder where data files are locatedINDEX
: the elasticsearch index name to store data from all data sourcesUNIPROT_FILE_NAME
: name of the file where uniprot data will be read fromUNIPROT_URL
: url to download uniprot file fromCHEBI_FILE_NAME
: name of the file where chebi data will be read fromCHEBI_URL
: url to download chebi file fromNCBI_FILE_NAME
: name of the file where ncbi data will be read fromNCBI_URL
: url to download ncbi file fromNCBI_EUTILS_BASE_URL
: url for NCBI EUTILSNCBI_EUTILS_API_KEY
: NCBI EUTILS API keyFAMPLEX_URL
: url to download FamPlex remote fromFAMPLEX_FILE_NAME
: name of the file where FamPlex data will be read fromFAMPLEX_TYPE_FILTER
: entity type to include ('protein', 'complex', 'all' [default])ESDUMP_LOCATION
: The location (URL, file path) of elasticdump files (note: terminate with '/')ZENODO_API_URL
: base url for ZenodoZENODO_ACCESS_TOKEN
: access token for Zenodo REST API (Scope:deposit:actions
,deposit:write
)ZENODO_BUCKET_ID
: id for Zenodo deposition 'bucket' (Files API)ZENODO_DEPOSITION_ID
: id for Zenodo deposition (for a published dataset)
npm start
: start the servernpm stop
: stop the servernpm run watch
: watch mode (debug mode enabled, autoreload)npm run refresh
: run clear, update, then startnpm test
: run tests for read only methods (e.g. search and get) assuming that data is already existingnpm test:sample
: run tests with sample datanpm run test:quality
: run the search quality tests (expects full db)npm run test:quality:csv
: run the search quality tests and output a csv filenpm run lint
: lint the projectnpm run benchmark
: run all benchmarkingnpm run benchmark:source
: run benchmarking forsource
(i.e.ncbi
,chebi
)npm run clear
: clear all datanpm run clear:source
: clear data forsource
(i.e.ncbi
,chebi
)npm run update
: update all data (download then index)npm run update:source
: update data forsource
(i.e.ncbi
,chebi
) in elasticsearchnpm run download
: download all datanpm run download:source
download data forsource
(i.e.ncbi
,chebi
)npm run index
: index all datanpm run index:source
: index data forsource
(i.e.ncbi
,chebi
) in elasticsearchnpm run test:inputgen
: generate input test file for eachsource
(i.e.uniprot
, ...)npm run test:inputgen
: generate input test file forsource
(i.e.uniprot
, ...)npm run dump
: dump the information forINDEX
toESDUMP_LOCATION
npm run restore
: restore the information forINDEX
fromESDUMP_LOCATION
npm run boot
: runclear
,restore
thenstart
; exit on errors
Zenodo lets you you to store and retrieve digital artefacts related to a scientific project or publication. Here, we use Zenodo to store Elasticsearch index dump data used to quickly recreate the index used by grounding-search.
Briefly, using their RESTful web service API, you can create a 'Deposition' for a record that has a 'bucket' referenced by a ZENODO_BUCKET_ID
to which you can upload and download 'files' (i.e. <ZENODO_API_URL>api/files/<ZENODO_BUCKET_ID>/<filename>
; list them with https://zenodo.org/api/deposit/depositions/<deposition id>/files
). In particular, there are three files required to recreate an index, corresponding to the elasticsearch types: data
; mapping
and analyzer
.
To setup follow these steps:
- Get a
ZENODO_ACCESS_TOKEN
by creating a 'Personal access token' (see docs for details). Be sure to add thedeposit:actions
anddeposit:write
scopes. - Create a recrod 'Deposition' by POSTing to
https://zenodo.org/api/deposit/depositions
with at least the following information, keeping in mind to set the headerAuthorization = Bearer <ZENODO_ACCESS_TOKEN>
:
{
"metadata": {
"title": "Elasticsearch data for biofactoid.org grounding-search service",
"upload_type": "dataset",
"description": "This deposition contains files with data describing an Elasticsearch index (https://github.com/PathwayCommons/grounding-search). The files were generated from the elasticdump npm package (https://www.npmjs.com/package/elasticdump). The data are the neccessary and sufficient information to populate an Elasticsearch index.",
"creators": [
{
"name": "Biofactoid",
"affiliation": "biofactoid.org"
}
],
"access_right": "open",
"license": "cc-zero"
}
}
- The POST response should have a 'bucket' (e.g.
"bucket": "https://zenodo.org/api/files/<uuid>"
) within thelinks
object. The variableZENODO_BUCKET_ID
is the value<uuid>
in the example URL. - Publish. You'll want to dump the index and upload to Zenodo (
npm run dump
). Log in to the Zenodo web page and click 'Publish' to make the deposition public. You may need to add a publication date (YYYY-MM-DD). - Test. Delete any
data
files; clear the index (npm run clear
); do a restore (npm run restore
) being sure to update theZENODO_DEPOSITION_ID
and run the quality tests (npm run test:quality:csv
)
Once published, a deposition cannot be updated or altered. However, you can create a new version of a record (below).
In this case, you already have a record which points to a published deposition (i.e. elasticsearch index files) and wish to create a new version for that record. Here, you'll create a new deposition under the same record:
- Make a POST request to
https://zenodo.org/api/deposit/depositions/<deposition id>/actions/newversion
to create a new version. Alternatively, visithttps://zenodo.org/record/<deposition id>
wheredeposition id
is that of the latest published version (default). - Fetch
https://zenodo.org/api/deposit/depositions?all_versions
to list all your depositions and identify the new deposition bucket id. - Proceed to upload (i.e. dump) your new files as described in "Create a new deposition", Step 3.
- Notes:
- New version's files must differ from all previous versions
- See https://help.zenodo.org/#versioning and https://developers.zenodo.org/#new-version for more info
All files /test
will be run by Mocha. You can npm test
to run all tests, or you can run npm test -- -g specific-test-name
to run specific tests.
Chai is included to make the tests easier to read and write.
- Make sure the tests are passing:
npm test
- Make sure the linting is passing:
npm run lint
- Bump the version number with
npm version
, in accordance with semver. Theversion
command innpm
updates bothpackage.json
and git tags, but note that it uses av
prefix on the tags (e.g.v1.2.3
). - For a bug fix / patch release, run
npm version patch
. - For a new feature release, run
npm version minor
. - For a breaking API change, run
npm version major.
- For a specific version number (e.g. 1.2.3), run
npm version 1.2.3
. - Push the release:
git push && git push --tags
- Publish a GitHub release so that Zenodo creates a DOI for this version.