You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Create routines that automate the downloading and processing of remote data from data repositories. This issue will serve as a summary of work done, preliminary documentation, and proposals.
Requirements
Reprocessing data from large consortia can often involve the downloading, renaming and storeing of a large number of large data files. This is both tedious, error prone and can take a very large amount of disk space (TCGA raw data is nearly 100TB).
Specifics
download and process raw data, such as fastq or BAM files from remote reposiotries
Do not sure the raw data locally, other than in temporary directories
Manage secure access via tokens etc.
Integrate into existing pipelines with minimal changes
Remote Access
Instead of placing input files in the current or input directory, files named .remote are used. These files are named the same way as normal input files (i.e. TISSUE-CONDITION-REPLICATE or similar), but contain details for accessing the file from a remote repository.
The data file itself is downloaded to the execution node processed and deleted - the raw data is never permanently stored, reducing disk space needed for processing large amounts of remote data.
.remote files
Each .remote file contains a two column table.
The first column contains the repository from which to download the data, currently supported values are SRA, ENA and TCGA.
The second column contains the accession of the file (e.g. SRR1016916 or ERR000916 for SRA/ENA or 21ae315a-a823-40c4-8145-ff5260af3084 for TCGA)
The third column is used only for TCGA files, and is the name of the downloaded file (for some reason TCGA saw fit not to give the files the same name as the accession).
Security
Downloading of secure/encrypted data is currently supported for SRA and TCGA.
For SRA the pipeline must be executed in a directory underneath the directory setup as the users secure ncbi workspace.
For TCGA the pipeline will look for a file matching the glob gdc-user-token* in the pipeline directory.
Repositories
Both SRA and ENA support download via ascp, a high speed download protocol. SRA only supports the download of SRA files this way, which are reference compressed, and must be extracted. Further, these files are downloaded to the users SRA-cache directory, meaning that they are not automatically removed when the are finished, but must be deleted with a call to Sra.clean_cache() or by running cache-mgr --clean. Dumped fastq files are automatically and SRA files are reference compressed - so smaller than fastq.gz. If does mean that if several tasks use the same file it will only be downloaded once.
ENA supports high-speed download of fastq files. All public files on SRA are also on ENA, so if you are downloading public data, ENA is generally preferred. It is envisaged that SRA will mainly be used for encrypted data.
TCGA does not support ascp, but currently does support fastq download (although this is in danger of being discountinued in favor of BAM only.)
My recommendation is to use ENA where possible.
Implementation
Most of these feature are implemented through additions to the preprocess method of the base SequenceCollectionProcessor class, and so should function transparently with any pipeline that uses PipelineMapping or PipelinePreprocess classes, simply passing .remote files as the infiles to the build method.
Additions have also been made to CGAT.Sra, which now includes prefetch, clean_cache, fetch_ENA, fetch_ENA_files(names) and fetch_TCGA_fastq methods. As this module no longer specifically deals with Sra, perhaps these functions should be moved or the module renamed.
A small number of changes need to be made to pipelines for these to work, mostly in recognizing input files. So far this has been done for pipeline_readqc and pipeline_mapping. It should probably be implemented for pipeline_transacriptdiffexpression shortly.
TODO: implement for TCGA BAM files.
Requirements
ENA download currently requires the installation of aspera's ascp, and the setting of the environment variables $ASCP_BIN_PATH and $ASCP_KEY_PATH. SRA download is very much sped up by the installation of ascp.
TCGA download requires installation of gdc-client from the genomic data commons.
The text was updated successfully, but these errors were encountered:
Create routines that automate the downloading and processing of remote data from data repositories. This issue will serve as a summary of work done, preliminary documentation, and proposals.
Requirements
Reprocessing data from large consortia can often involve the downloading, renaming and storeing of a large number of large data files. This is both tedious, error prone and can take a very large amount of disk space (TCGA raw data is nearly 100TB).
Specifics
Remote Access
Instead of placing input files in the current or input directory, files named
.remote
are used. These files are named the same way as normal input files (i.e. TISSUE-CONDITION-REPLICATE or similar), but contain details for accessing the file from a remote repository.The data file itself is downloaded to the execution node processed and deleted - the raw data is never permanently stored, reducing disk space needed for processing large amounts of remote data.
.remote
filesEach .remote file contains a two column table.
SRA
,ENA
andTCGA
.Security
Downloading of secure/encrypted data is currently supported for
SRA
andTCGA
.For
SRA
the pipeline must be executed in a directory underneath the directory setup as the users secure ncbi workspace.For
TCGA
the pipeline will look for a file matching the globgdc-user-token*
in the pipeline directory.Repositories
Both SRA and ENA support download via ascp, a high speed download protocol.
SRA
only supports the download of SRA files this way, which are reference compressed, and must be extracted. Further, these files are downloaded to the users SRA-cache directory, meaning that they are not automatically removed when the are finished, but must be deleted with a call to Sra.clean_cache() or by runningcache-mgr --clean
. Dumped fastq files are automatically and SRA files are reference compressed - so smaller than fastq.gz. If does mean that if several tasks use the same file it will only be downloaded once.ENA
supports high-speed download offastq
files. All public files on SRA are also on ENA, so if you are downloading public data, ENA is generally preferred. It is envisaged that SRA will mainly be used for encrypted data.TCGA
does not support ascp, but currently does supportfastq
download (although this is in danger of being discountinued in favor of BAM only.)My recommendation is to use
ENA
where possible.Implementation
Most of these feature are implemented through additions to the
preprocess
method of the baseSequenceCollectionProcessor
class, and so should function transparently with any pipeline that uses PipelineMapping or PipelinePreprocess classes, simply passing.remote
files as the infiles to thebuild
method.Additions have also been made to
CGAT.Sra
, which now includesprefetch
,clean_cache
,fetch_ENA
,fetch_ENA_files
(names) andfetch_TCGA_fastq
methods. As this module no longer specifically deals with Sra, perhaps these functions should be moved or the module renamed.A small number of changes need to be made to pipelines for these to work, mostly in recognizing input files. So far this has been done for
pipeline_readqc
andpipeline_mapping
. It should probably be implemented forpipeline_transacriptdiffexpression
shortly.Requirements
ENA download currently requires the installation of aspera's
ascp
, and the setting of the environment variables$ASCP_BIN_PATH
and$ASCP_KEY_PATH
. SRA download is very much sped up by the installation ofascp
.TCGA download requires installation of gdc-client from the genomic data commons.
The text was updated successfully, but these errors were encountered: