Gilda is a Python package and REST service that grounds (i.e., finds appropriate identifiers in various namespaces for) named entities in biomedical text.
Gyori BM, Hoyt CT, Steppi A (2022). Gilda: biomedical entity text normalization with machine-learned disambiguation as a service. Bioinformatics Advances, 2022; vbac034 https://doi.org/10.1093/bioadv/vbac034.
Gilda is deployed as a web service at http://grounding.indra.bio/ (see Usage instructions below), however, it can also be used locally as a Python package.
The recommended method to install Gilda is through PyPI as
pip install gilda
Note that Gilda uses a single large resource file for grounding, which is
automatically downloaded into the ~/.data/gilda/<version>
folder during
runtime (see pystow for options to
configure the location of this folder).
Given some additional dependencies, the grounding resource file can
also be regenerated locally by running python -m gilda.generate_terms
.
Documentation for Gilda is available here. We also provide several interactive Jupyter notebooks to help use and customize Gilda:
- Gilda Introduction provides an interactive tutorial for using Gilda.
- Custom Grounders shows several examples of how Gilda can be instantiated with custom grounding resources.
- Model Training provides interactive sample code for training new disambiguation models.
Gilda can either be used as a REST web service or used programmatically via its Python API. An introduction Jupyter notebook for using Gilda is available at https://github.com/indralab/gilda/blob/master/notebooks/gilda_introduction.ipynb
For using Gilda as a Python package, the documentation at http://gilda.readthedocs.org provides detailed descriptions of each module of Gilda and their usage. A basic usage example for named entity normalization (NEN), or grounding is as follows:
import gilda
scored_matches = gilda.ground('ER', context='Calcium is released from the ER.')
Gilda also implements a simple dictionary-based named entity recognition (NER) algorithm that can be used as follows:
import gilda
results = gilda.annotate('Calcium is released from the ER.')
The REST service accepts POST requests with a JSON header on the /ground
endpoint. There is a public REST service running at http://grounding.indra.bio
but the service can also be run locally as
python -m gilda.app
which, by default, launches the server at localhost:8001
(for local usage
replace the URL in the examples below with this address).
Below is an example request using curl
:
curl -X POST -H "Content-Type: application/json" -d '{"text": "kras"}' http://grounding.indra.bio/ground
The same request using Python's request package would be as follows:
import requests
requests.post('http://grounding.indra.bio/ground', json={'text': 'kras'})
The web service also supports multiple inputs in a single request on the
ground_multi
endpoint, for instance
import requests
requests.post('http://grounding.indra.bio/ground_multi',
json=[
{'text': 'braf'},
{'text': 'ER', 'context': 'endoplasmic reticulum (ER) is a cellular component'}
]
)
Gilda loads grounding terms into memory when first used. If memory usage is an issue, the following options are recommended.
-
Run a single instance of Gilda as a local web service that one or more other processes send requests to.
-
Create a custom Grounder instance that only loads a subset of terms appropriate for a narrow use case.
-
Gilda also offers an optional sqlite back-end which significantly decreases memory usage and results in minor drop in the number of strings grounder per unit time. The sqlite back-end database can be built as follows with an optional
[db_path]
argument, which if used, should use the .db extension. If not specified, the .db file is generated in Gilda's default resource folder.
python -m gilda.resources.sqlite_adapter [db_path]
A Grounder instance can then be instantiated as follows:
from gilda.grounder import Grounder
gr = Grounder(db_path)
matches = gr.ground('kras')
After cloning the repository locally, you can build and run a Docker image of Gilda using the following commands:
$ docker build -t gilda:latest .
$ docker run -d -p 8001:8001 gilda:latest
Alternatively, you can use docker-compose
to do both the initial build and
run the container based on the docker-compose.yml
configuration:
$ docker-compose up
Gilda is customizable with terms coming from different vocabularies. However, Gilda comes with a default set of resources from which terms are collected (almost 2 million entries as of v1.1.0), without any additional configuration needed. These resources include:
- HGNC (human genes)
- UniProt (human and model organism proteins)
- FamPlex (human protein families and complexes)
- CHeBI (small molecules, metabolites, etc.)
- GO (biological processes, molecular functions, complexes)
- DOID (diseases)
- EFO (experimental factors: cell lines, cell types, anatomical entities, etc.)
- HP (human phenotypes)
- MeSH (general: diseases, proteins, small molecules, cell types, etc.)
- Adeft (misc. terms corresponding to ambiguous acronyms)
@article{gyori2022gilda,
author = {Gyori, Benjamin M and Hoyt, Charles Tapley and Steppi, Albert},
title = "{{Gilda: biomedical entity text normalization with machine-learned disambiguation as a service}}",
journal = {Bioinformatics Advances},
year = {2022},
month = {05},
issn = {2635-0041},
doi = {10.1093/bioadv/vbac034},
url = {https://doi.org/10.1093/bioadv/vbac034},
note = {vbac034}
}
The development of Gilda was funded under the DARPA Communicating with Computers program (ARO grant W911NF-15-1-0544) and the DARPA Young Faculty Award (ARO grant W911NF-20-1-0255).