Utility for simplifying bulk downloading data from next-generation sequencing repositories, like NCBI SRA, MG-RAST, and iMicrobe.
Note: read downloads for some samples from MG-RAST are not working through their web interface or API currently (3/20/2020)
Install grabseqs and all dependencies via conda:
conda install grabseqs -c louiejtaylor -c bioconda -c conda-forge
Or with pip (and install the non-Python dependencies yourself):
pip install grabseqs
Note: If you're using SRA data, after you've installed sra-tools, run vdb-config -i
and turn off local file caching unless you want extra copies of the downloaded sequences taking up space (read more here).
Download all samples from a single SRA Project:
grabseqs sra SRP#######
Or any combination of projects (S/ERP), runs (S/ERR), BioProjects (PRJNA):
grabseqs sra SRR######## ERP####### PRJNA######## ERR########
If you'd like to do a dry run and just get a list of samples that will be downloaded, pass -l
:
grabseqs sra -l SRP########
Similar syntax works for MG-RAST:
grabseqs mgrast mgp##### mgm#######
And iMicrobe (prefixing the sample numbers with "s" and project numbers with "p"):
grabseqs imicrobe p4 s3
See the grabseqs FAQ for detailed troubleshooting tips!
Fun options:
grabseqs sra -t 10 -m metadata.csv -o proj/ -r 3 SRP#######
(translation: use 10 threads, save metadata to proj/metadata.csv
, download to the dir proj/
, retry failed downloads 3x, get all samples from SRP#######)
If you'd like to do a dry run and only get a list of samples that will be downloaded, pass -l
:
grabseqs sra -l SRP########
If you'd like to pass your own arguments to fasterq-dump
to get data in a slightly different format, you can do so like this:
grabseqs sra SRP####### -r 0 --custom_fqdump_args="--split-spot --progress"
Full usage:
grabseqs sra [-h] [-m METADATA] [-o OUTDIR] [-r RETRIES] [-t THREADS]
[-f] [-l] [--no_parsing] [--parse_run_ids]
[--use_fastq_dump]
id [id ...]
positional arguments:
id One or more BioProject, ERR/SRR or ERP/SRP number(s)
optional arguments:
-h, --help show this help message and exit
-m METADATA filename in which to save SRA metadata (.csv format,
relative to OUTDIR)
-o OUTDIR directory in which to save output. created if it doesn't
exist
-r RETRIES number of times to retry download
-t THREADS threads to use (for fasterq-dump/pigz)
-f force re-download of files
-l list (but do not download) samples to be grabbed
--parse_run_ids parse SRR/ERR identifers (do not pass straight to fasterq-
dump)
--custom_fqdump_args CUSTOM_FQD_ARGS
"string" containing args to pass to fastq-dump
--use_fastq_dump use legacy fastq-dump instead of fasterq-dump (no
multithreaded downloading)
Downloads .fastq.gz files to OUTDIR
(or the working directory if not specified). If the -m
flag is passed, saves metadata to OUTDIR
with filename METADATA
in csv format.
Similar options are available for downloading from MG-RAST:
grabseqs mgrast [-h] [-m METADATA] [-o OUTDIR] [-r RETRIES]
[-t THREADS] [-f] [-l]
rastid [rastid ...]
And iMicrobe:
grabseqs imicrobe [-h] [-m METADATA] [-o OUTDIR] [-r RETRIES]
[-t THREADS] [-f] [-l]
imicrobeid [imicrobeid ...]
See the grabseqs FAQ for detailed troubleshooting tips. If the FAQs don't fix your problem, feel free to open an issue!
- Python 3 (external packages req'd: requests, requests-html, pandas, fake-useragent)
- sra-tools>2.9
- pigz
- wget
If you use conda (on Linux), these will be installed for you!
Grabseqs runs on Mac or Linux. We've tested on these specific OSes:
Linux (conda or pip):
- CentOS 6, 7, and 8
- Debian 9 and 10
- Ubuntu 16.04, 18.04, and 19.10
- Red Hat Enterprise 6, 7, and 8
- SUSE Enterprise 12 and 15
Mac (pip):
- MacOS 10.14
Grabseqs has been tested and works with the following version of the Python dependencies (though these are neither minimal nor pinned version numbers):
- requests 2.22.0
- requests-html 0.10.0
- pandas 0.25.3
- fake-useragent 0.1.11
If you use grabseqs in your work, please cite:
Louis J Taylor, Arwa Abbas, Frederic D Bushman. "grabseqs: Simple downloading of reads and metadata from multiple next-generation sequencing data repositories." Bioinformatics, (2020), btaa167, https://doi.org/10.1093/bioinformatics/btaa167
Please also cite the researchers who generated the data (and the repository, if appropriate)!
Dev version (not yet released)
- Added a walk-through for adding a new repo using
template.py
- Better handling for invalid SRA accession numbers
0.7.0 (2020-01-29)
- Allow users to pass custom args to fast(er)q-dump
- Minor re-writes of download handling code for easier readability
0.6.1 (2019-12-20)
- Validate compressed files (fix #8 and #34)
0.6.0 (2019-12-12)
- Gracefully handle incomplete or missing dependencies
- Major rewrite of test suite
0.5.2 (2019-12-05)
- Improvements to work with multiple versions of Python 3
0.5.1 (2019-11-23)
- Hotfix handling outdated versions of sra-tools
0.5.0 (2019-04-11)
- Metadata available for all sources in .csv format
This project spawned out of/incorporates code from hisss; many thanks to ArwaAbbas for helping make this work!