This is a Nextflow workflow designed to download data for an EGA dataset and arrange in a form suitable for analysis, with raw FASTQ files and a metadata table.
The workflow uses the pyega3 client to pull files directly from EGA.
- Nextflow installed. Current version tested with nextflow 24.04.3
- Formal access to the datasets of interest in EGA
- SLURM cluster management and job scheduling system
- An account and access the Seqera Platform (optional)
- Create a directory for the dataset (with appropriate access restrictions - this is controlled access data)
- Create 'metadata' and 'credentials' subdirectories
The python client authenticates via your user account, place a file called 'ega.credentials' in the credentials folder. It will look like:
{
"username": "me@foo.bar.uk",
"password": "abc123",
}
See the EGA documentation for more info.
Download the metadata bundle from the EGA page for each dataset. You'll have the option to download a zipped file with metadata in TSV, CSV and JSON. Download the CSV version, unzip it and place the files under 'metadata':
metadata
|- analyses.csv
|- analysis_sample.csv
|- ...
We now have all the information we need to download the raw data and process the metadata.
Clone this repository to the top directory.
Then run:
source envs.sh
nextflow run main.nf -c nextflow.config --EGA_DATASET_ID $EGA_DATASET_ID
... or
nextflow run main.nf -c nextflow.config --EGA_DATASET_ID $EGA_DATASET_ID -with-tower
to leverage the Seqera Platform capabilities (optional). You'll need to obtain a token and add it to envs.sh
.
The result will be:
- A metadata summary at a location like work/..../ega_metadata/EGAD00011223344.merged.csv
- FASTQ files at data/.../ega_data/(EGAFxxxxx)/fastq
Nexflow leaves a few things lying around, so once the above has succeeded, remove them:
rm -rf .nextflow*
A previous implementation was used that required BAM/ CRAM files downloaded from EGA to be converted to fastq in an endedness-specific manner (i.e. paired endedness detected and handled correctly). The previous workflow, in addition of using the pyega3 client to pull files directly from EGA, included also the Aspera dropbox method to download files from a 'dropbox' provided to you from EGA staff - this method is now deprecated. See release v1.0.0 for more information on the earlier version.