Skip to content

Genomic Data Retrieval with R

Compare
Choose a tag to compare
@HajkD HajkD released this 22 Feb 16:38
· 188 commits to master since this release

biomartr 1.0.2

Overall, this new version fixes a big internet connection issue to NCBI and ENSEMBL. Users can now
reinstall the new version from CRAN and will realize that their initially failing downloads will run now,
without having to change their code.

New Functions

  • New function check_annotation_biomartr() helps to check whether downloaded GFF or GTF files are corrupt. Find more details here

  • new function getCollectionSet() allows users to retrieve a Collection: Genome, Proteome, CDS, RNA, GFF, Repeat Masker, AssemblyStats of multiple species

Example:

# define scientific names of species for which
# collections shall be retrieved
organism_list <- c("Arabidopsis thaliana", 
                   "Arabidopsis lyrata", 
                   "Capsella rubella")
# download the collection of Arabidopsis thaliana from refseq
# and store the corresponding genome file in '_ncbi_downloads/collection'
 getCollectionSet( db       = "refseq", 
             organism = organism_list, 
             path = "set_collections")

New Features

  • the getGFF() function receives a new argument remove_annotation_outliers to enable users to remove corrupt lines from a GFF file
    Example:
Ath_path <- biomartr::getGFF(organism = "Arabidopsis thaliana", remove_annotation_outliers = TRUE)
  • the getGFFSet() function receives a new argument remove_annotation_outliers to enable users to remove corrupt lines from a GFF file

  • the getGTF() function receives a new argument remove_annotation_outliers to enable users to remove corrupt lines from a GTF file

  • adding a new message system to biomartr::organismBM(), biomartr::organismAttributes(), and biomartr::organismFilters() so that large API queries don't seem so unresponsive

  • getCollection() receives new arguments release, remove_annotation_outliers, and gunzip that will now be passed on to downstream retrieval functions

  • the getGTF(), getGenome() and getGenomeSet() functions receives a new argument assembly_type = "toplevel" to enable users to choose between toplevel and primary assembly when using ensembl database. Setting assembly_type = "primary_assembly" will save a lot a space on hard drives for people using large ensembl genomes.

  • all get*() functions with release argument now check if the ENSEMBL release is >45 (Many thanks to @Roleren #31 #61)

  • in all get*() functions, the readr::write_tsv(path = ) was exchanged to readr::write_tsv(file = ), since the readr package version > 1.4.0 is depreciating the path argument.

  • tbl_df() was deprecated in dplyr 1.0.0.
    Please use tibble::as_tibble() instead. -> adjusted organismBM() accordingly

  • custom_download(), getGENOMEREPORT(), and other download functions now have specified withr::local_options(timeout = max(30000000, getOption("timeout"))) which extends the default 60sec timeout to 30000000sec

Bug Fixes

  • Fixing bug where genome availability check in getCollection() was only performed in NCBI RefSeq and not in other databases due to a constant used in is.genome.available() rather than a variable (Many thanks to Takahiro Yamada for catching the bug) #53

  • fixing an issue that caused the read_cds() function to fail in data.table mode (Many thanks to Clement Kent) #57

  • fixing an SSL bug that was found on Ubuntu 20.04 systems #66 (Many thanks to Håkon Tjeldnes)

  • fixing global variable issue that caused clean.retrieval() to fail when no documentation file was in a meta.retrieval() folder

  • The NCBI recently started adding NA values as FTP file paths in their species summary files for species without reference genomes. As a result meta.retrieval() stopped working, because no FTP paths were found for some species. This issue was now fixed by adding the filter rule !is.na(ftp_path) into all get*() functions (Many thanks for making me aware of this issue Ashok Kumar Sharma #34 and Dominik Merges #72)

  • Fixing an issue in custom_download() where the method argument was causing issues when downloading from https directed ftp sites (Many thanks to @cmatKhan) #76

  • Fixing issue when trying to combine multiple summary-stats files where NA's were present in the list item that was passed along for combination in meta.retrieval() #73 (Many thanks to Dominik Merges)

  • Fixing a bug in download.database.all() where the lack of removing listed file *-metadata.json caused corruption of the download process (Many thanks to Jaruwatana Lotharukpong)