Entrez Gene is the NCBI database of gene-specific information. It provides "tracked, unique identifiers for genes" and reports "information associated with those identifiers for unrestricted public use [source]." We use Entrez Gene as the primary gene vocabulary for our drug repuposing research.
This repository creates user-friendly datasets from Entrez Gene. We currently focus on human genes only.
The python notebook process.ipynb
executes the analysis. Files downloaded from external locations are stored in download
. The following created datasets reside in data
:
genes-human.tsv
: human genes with a select set of fields storing additional attributessymbols-human.tsv
: a table of GeneID, symbol, and symbol type (synonym or primary)symbols-human.json
: a Symbol–GeneID mapping of primary symbols onlysynonyms-human.json
: a Symbol–GeneIDs mapping for synonymssymbol-map.json
: a Symbol–GeneID mapping with approved symbols and unambiguous synonymsxrefs-human.tsv
: mappings to external resources