Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transparent access to data repositories #233

Open
2 of 5 tasks
IanSudbery opened this issue Aug 4, 2016 · 3 comments
Open
2 of 5 tasks

Transparent access to data repositories #233

IanSudbery opened this issue Aug 4, 2016 · 3 comments
Assignees

Comments

@IanSudbery
Copy link
Member

IanSudbery commented Aug 4, 2016

Create routines that automate the downloading and processing of remote data from data repositories. This issue will serve as a summary of work done, preliminary documentation, and proposals.

Requirements

Reprocessing data from large consortia can often involve the downloading, renaming and storeing of a large number of large data files. This is both tedious, error prone and can take a very large amount of disk space (TCGA raw data is nearly 100TB).

Specifics

  • download and process raw data, such as fastq or BAM files from remote reposiotries
  • Do not sure the raw data locally, other than in temporary directories
  • Manage secure access via tokens etc.
  • Integrate into existing pipelines with minimal changes

Remote Access

Instead of placing input files in the current or input directory, files named .remote are used. These files are named the same way as normal input files (i.e. TISSUE-CONDITION-REPLICATE or similar), but contain details for accessing the file from a remote repository.

The data file itself is downloaded to the execution node processed and deleted - the raw data is never permanently stored, reducing disk space needed for processing large amounts of remote data.

.remote files

Each .remote file contains a two column table.

  1. The first column contains the repository from which to download the data, currently supported values are SRA, ENA and TCGA.
  2. The second column contains the accession of the file (e.g. SRR1016916 or ERR000916 for SRA/ENA or 21ae315a-a823-40c4-8145-ff5260af3084 for TCGA)
  3. The third column is used only for TCGA files, and is the name of the downloaded file (for some reason TCGA saw fit not to give the files the same name as the accession).

Security

Downloading of secure/encrypted data is currently supported for SRA and TCGA.

For SRA the pipeline must be executed in a directory underneath the directory setup as the users secure ncbi workspace.

For TCGA the pipeline will look for a file matching the glob gdc-user-token* in the pipeline directory.

Repositories

Both SRA and ENA support download via ascp, a high speed download protocol. SRA only supports the download of SRA files this way, which are reference compressed, and must be extracted. Further, these files are downloaded to the users SRA-cache directory, meaning that they are not automatically removed when the are finished, but must be deleted with a call to Sra.clean_cache() or by running cache-mgr --clean. Dumped fastq files are automatically and SRA files are reference compressed - so smaller than fastq.gz. If does mean that if several tasks use the same file it will only be downloaded once.

ENA supports high-speed download of fastq files. All public files on SRA are also on ENA, so if you are downloading public data, ENA is generally preferred. It is envisaged that SRA will mainly be used for encrypted data.

TCGA does not support ascp, but currently does support fastq download (although this is in danger of being discountinued in favor of BAM only.)

My recommendation is to use ENA where possible.

Implementation

Most of these feature are implemented through additions to the preprocess method of the base SequenceCollectionProcessor class, and so should function transparently with any pipeline that uses PipelineMapping or PipelinePreprocess classes, simply passing .remote files as the infiles to the build method.

Additions have also been made to CGAT.Sra, which now includes prefetch, clean_cache, fetch_ENA, fetch_ENA_files(names) and fetch_TCGA_fastq methods. As this module no longer specifically deals with Sra, perhaps these functions should be moved or the module renamed.

A small number of changes need to be made to pipelines for these to work, mostly in recognizing input files. So far this has been done for pipeline_readqc and pipeline_mapping. It should probably be implemented for pipeline_transacriptdiffexpression shortly.

  • TODO: implement for TCGA BAM files.

Requirements

ENA download currently requires the installation of aspera's ascp, and the setting of the environment variables $ASCP_BIN_PATH and $ASCP_KEY_PATH. SRA download is very much sped up by the installation of ascp.

TCGA download requires installation of gdc-client from the genomic data commons.

@IanSudbery IanSudbery self-assigned this Aug 4, 2016
@IanSudbery
Copy link
Member Author

branch sudlab/CGATPiplines/IS_remote_access and sudlab/cgat/IS_remote_access_upstream_ready have the first impementations of this.

@sebastian-luna-valero
Copy link
Member

Many thanks!

It all looks good to me, but I would be grateful if @CGATOxford/contributors had a look as well before merging.

@AndreasHeger
Copy link
Member

@IanSudbery , @sebastian-luna-valero, I think this is a great capability to have, many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants