This repository will host a set of tidyverse code for Exploratory Analysis and Predictive Modelling of sequences citation in the literature. Sequences originate from the European Nucleotide Archive (ENA). The literatures originate from the EuropePMC
https://github.com/alakob/sequence-literature
- Code to parse EMBL-flatfiles (embl_flat_file_parser.pl)
- Code to query ePMC API (query_epmc.py)
- preprocessing/database/initdb/seqref.sql.gz
- ENA release 143 (03/04/2020):
- ftp://ftp.ebi.ac.uk/pub/databases/ena/sequence/release/std/
- files total 208G compressed and 3.5T uncompressed.
- The release contains 263,421,789 sequence entries comprising 408,005,271,872 nucleotides.
Division | entries |
---|---|
ENV:Environmental Samples | 16,765,544 |
FUN:Fungi | 7,511,473 |
HUM:Human | 27,520,827 |
INV:Invertebrates | 40,534,979 |
MAM:Other Mammals | 16,578,137 |
MUS:Mus musculus | 10,479,013 |
PHG:Bacteriophage | 17,393 |
PLN:Plants | 85,618,575 |
PRO:Prokaryotes | 3,589,696 |
ROD:Rodents | 3,263,952 |
SYN:Synthetic | 10,049,087 |
TGN:Transgenic | 286,472 |
UNC:Unclassified | 15,943,630 |
VRL:Viruses | 3,198,057 |
VRT:Other Vertebrates | 22,064,954 |
Total | 263,421,789 |
- Sequence entry must have the /country qualifier that represent, the locality of isolation of the sequenced organism indicated in terms of political names for nations, oceans or seas, followed by regions and localities
- Sequence entry must be a non-WGS sequence
Id | Description |
---|---|
accession | sequence accession |
primary_pmid | Primary pubmed id |
primary_doi | primary doi , extracted from flat file |
primary_pmcid | primary epmc id extracted from flat files |
origin | Locality of sequence isolation |
country | Country of sequence isolation |
submission_date | Sequence submission date |
first_created | Sequence entry first created |
lat_lon | geolocation |
organism | Sequence organism name |
taxid | Sequence taxonomic id |
code | sequence taxon |
project_acc | Sequence project accession |
Division | Counts |
---|---|
INV | 4895011 |
ENV | 4191787 |
VRL | 2623044 |
PLN | 1952666 |
VRT | 1824633 |
FUN | 877183 |
PRO | 868913 |
MAM | 575165 |
HUM | 129549 |
ROD | 79785 |
PHG | 7958 |
MUS | 7458 |
SYN | 737 |
UNC | 246 |
TGN | 57 |
Total | 18034192 |
- Europe PubMed Central ePMC
- Sequence accession e.g.:AB013190
- Primary pubmed id e.g.: 11050544
- Project accession e.g.: PRJDB3373
https://europepmc.org/RestfulWebService#!/Europe32PMC32Articles32RESTful32API/search
Retrieved Field | Description |
---|---|
accession | Sequenceid |
idpmc | unique ePMC id |
source | Literature source eg: MEDLINE |
pubtype | Publication type |
issn | ISSN |
isopenaccess | Is the publication open access |
secondary_pmid | pubmed id of the literature hit |
secondary_pmcid | pmc id id the literature hit |
secondary_doi | DOI of the literature hit |
author | Author name |
affiliation | Author affiliation |
country | Author country |
first_pubdate | First publication date |
first_epubdate | First electronic publication date |
orcid | Author ORCID |
language | Publication language |
grantid | Grant identifier |
grant_agency | Grant Agency |
grant_acronym | Grant Acronym |
receipt_date | Publication reception date |
revision_date | Publication revision date |
Journal | #accessions | Definition |
---|---|---|
MED | 534039 | PubMed/MEDLINE NLM |
AGR | 4981 | Agricola |
PMC | 1756 | PubMed Central |
CBA | 77 | Chinese Biological Abstracts |
PPR | 70 | Preprints |
PAT | 24 | Biological Patents |
CTX | 5 | CiteXplore |