Genomic Data Retrieval with R
biomartr 1.0.2
Overall, this new version fixes a big internet connection issue to NCBI and ENSEMBL. Users can now
reinstall the new version from CRAN and will realize that their initially failing downloads will run now,
without having to change their code.
New Functions
-
New function
check_annotation_biomartr()
helps to check whether downloaded GFF or GTF files are corrupt. Find more details here -
new function
getCollectionSet()
allows users to retrieve a Collection: Genome, Proteome, CDS, RNA, GFF, Repeat Masker, AssemblyStats of multiple species
Example:
# define scientific names of species for which
# collections shall be retrieved
organism_list <- c("Arabidopsis thaliana",
"Arabidopsis lyrata",
"Capsella rubella")
# download the collection of Arabidopsis thaliana from refseq
# and store the corresponding genome file in '_ncbi_downloads/collection'
getCollectionSet( db = "refseq",
organism = organism_list,
path = "set_collections")
New Features
- the
getGFF()
function receives a new argumentremove_annotation_outliers
to enable users to remove corrupt lines from a GFF file
Example:
Ath_path <- biomartr::getGFF(organism = "Arabidopsis thaliana", remove_annotation_outliers = TRUE)
-
the
getGFFSet()
function receives a new argumentremove_annotation_outliers
to enable users to remove corrupt lines from a GFF file -
the
getGTF()
function receives a new argumentremove_annotation_outliers
to enable users to remove corrupt lines from a GTF file -
adding a new message system to
biomartr::organismBM()
,biomartr::organismAttributes()
, andbiomartr::organismFilters()
so that large API queries don't seem so unresponsive -
getCollection()
receives new argumentsrelease
,remove_annotation_outliers
, andgunzip
that will now be passed on to downstream retrieval functions -
the
getGTF()
,getGenome()
andgetGenomeSet()
functions receives a new argumentassembly_type = "toplevel"
to enable users to choose between toplevel and primary assembly when using ensembl database. Settingassembly_type = "primary_assembly"
will save a lot a space on hard drives for people using large ensembl genomes. -
all
get*()
functions withrelease
argument now check if the ENSEMBL release is >45 (Many thanks to @Roleren #31 #61) -
in all
get*()
functions, thereadr::write_tsv(path = )
was exchanged toreadr::write_tsv(file = )
, since thereadr
package version > 1.4.0 is depreciating thepath
argument. -
tbl_df()
was deprecated in dplyr 1.0.0.
Please usetibble::as_tibble()
instead. -> adjustedorganismBM()
accordingly -
custom_download()
,getGENOMEREPORT()
, and other download functions now have specifiedwithr::local_options(timeout = max(30000000, getOption("timeout")))
which extends the default 60sec timeout to 30000000sec
Bug Fixes
-
Fixing bug where genome availability check in
getCollection()
was only performed inNCBI RefSeq
and not in other databases due to a constant used inis.genome.available()
rather than a variable (Many thanks to Takahiro Yamada for catching the bug) #53 -
fixing an issue that caused the
read_cds()
function to fail indata.table
mode (Many thanks to Clement Kent) #57 -
fixing an
SSL
bug that was found onUbuntu 20.04
systems #66 (Many thanks to Håkon Tjeldnes) -
fixing global variable issue that caused
clean.retrieval()
to fail when no documentation file was in ameta.retrieval()
folder -
The NCBI recently started adding
NA
values as FTP file paths in theirspecies summary files
for species without reference genomes. As a resultmeta.retrieval()
stopped working, because no FTP paths were found for some species. This issue was now fixed by adding the filter rule!is.na(ftp_path)
into allget*()
functions (Many thanks for making me aware of this issue Ashok Kumar Sharma #34 and Dominik Merges #72) -
Fixing an issue in
custom_download()
where themethod
argument was causing issues when downloading fromhttps
directedftp
sites (Many thanks to @cmatKhan) #76 -
Fixing issue when trying to combine multiple summary-stats files where NA's were present in the list item that was passed along for combination in
meta.retrieval()
#73 (Many thanks to Dominik Merges) -
Fixing a bug in
download.database.all()
where the lack of removing listed file*-metadata.json
caused corruption of the download process (Many thanks to Jaruwatana Lotharukpong)