Skip to content

This repository contains set of tidyverse code for Exploratory Analysis and Predictive Modelling of sequences citation in the literature. Sequences originate from the European Nucleotide Archive (ENA). The literatures originate from the EuropePMC

License

Notifications You must be signed in to change notification settings

alakob/sequence-literature

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sequence-literature

This repository will host a set of tidyverse code for Exploratory Analysis and Predictive Modelling of sequences citation in the literature. Sequences originate from the European Nucleotide Archive (ENA). The literatures originate from the EuropePMC

Data Extraction Workflow

alt text

Sequence Search stats

alt text

Code repository

https://github.com/alakob/sequence-literature

  • Code to parse EMBL-flatfiles (embl_flat_file_parser.pl)
  • Code to query ePMC API (query_epmc.py)

Sequence-Literature DB

  • preprocessing/database/initdb/seqref.sql.gz

Data Sources

ENA Data Source:

ENA Release Breakdown by Taxonomy

Division entries
ENV:Environmental Samples 16,765,544
FUN:Fungi 7,511,473
HUM:Human 27,520,827
INV:Invertebrates 40,534,979
MAM:Other Mammals 16,578,137
MUS:Mus musculus 10,479,013
PHG:Bacteriophage 17,393
PLN:Plants 85,618,575
PRO:Prokaryotes 3,589,696
ROD:Rodents 3,263,952
SYN:Synthetic 10,049,087
TGN:Transgenic 286,472
UNC:Unclassified 15,943,630
VRL:Viruses 3,198,057
VRT:Other Vertebrates 22,064,954
Total 263,421,789

ENA Release Extraction Condition

  • Sequence entry must have the /country qualifier that represent, the locality of isolation of the sequenced organism indicated in terms of political names for nations, oceans or seas, followed by regions and localities
  • Sequence entry must be a non-WGS sequence

ENA Release Extracted Target

Id Description
accession sequence accession
primary_pmid Primary pubmed id
primary_doi primary doi , extracted from flat file
primary_pmcid primary epmc id extracted from flat files
origin Locality of sequence isolation
country Country of sequence isolation
submission_date Sequence submission date
first_created Sequence entry first created
lat_lon geolocation
organism Sequence organism name
taxid Sequence taxonomic id
code sequence taxon
project_acc Sequence project accession

ENA Release Extraction Statistics

Division Counts
INV 4895011
ENV 4191787
VRL 2623044
PLN 1952666
VRT 1824633
FUN 877183
PRO 868913
MAM 575165
HUM 129549
ROD 79785
PHG 7958
MUS 7458
SYN 737
UNC 246
TGN 57
Total 18034192

Literature Data source

  • Europe PubMed Central ePMC

ePMC Query terms

  • Sequence accession e.g.:AB013190
  • Primary pubmed id e.g.: 11050544
  • Project accession e.g.: PRJDB3373

ePMC API

https://europepmc.org/RestfulWebService#!/Europe32PMC32Articles32RESTful32API/search

ePMC extracted Target

Retrieved Field Description
accession Sequenceid
idpmc unique ePMC id
source Literature source eg: MEDLINE
pubtype Publication type
issn ISSN
isopenaccess Is the publication open access
secondary_pmid pubmed id of the literature hit
secondary_pmcid pmc id id the literature hit
secondary_doi DOI of the literature hit
author Author name
affiliation Author affiliation
country Author country
first_pubdate First publication date
first_epubdate First electronic publication date
orcid Author ORCID
language Publication language
grantid Grant identifier
grant_agency Grant Agency
grant_acronym Grant Acronym
receipt_date Publication reception date
revision_date Publication revision date

ePMC extraction statistics

Journal #accessions Definition
MED 534039 PubMed/MEDLINE NLM
AGR 4981 Agricola
PMC 1756 PubMed Central
CBA 77 Chinese Biological Abstracts
PPR 70 Preprints
PAT 24 Biological Patents
CTX 5 CiteXplore

About

This repository contains set of tidyverse code for Exploratory Analysis and Predictive Modelling of sequences citation in the literature. Sequences originate from the European Nucleotide Archive (ENA). The literatures originate from the EuropePMC

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published