This repository contains automation scripts to process submissions from the Covid-19 data portal project: https://www.covid19dataportal.org/
Retrieve the latest version
pip git+install https://github.com/EBIvariation/covid19dp-submission.git@master
Retrieve a tagged version
pip git+install https://github.com/EBIvariation/covid19dp-submission.git@v0.1.2
ingest_covid19dp_submission.py --project-dir /path/to/project/dir/PRJEB45554 --num-analyses 10000 --processed-analyses-file /file/containing/list/of/analyses/already/processed --app-config-file /path/to/app_config.yml --nextflow-config-file /path/to/nextflow.config
See application configuration and nextflow configuration examples.
The above command will run the following steps (see workflow definition):
- Download analyses files using ENA rest-services.
- Run VCF validation on all the downloaded VCF files.
- Run bgzip compression and indexing on the VCF files.
- Run multi-stage vertical concatenation to combine the VCF files.
- Accession the resulting combined VCF file from the step above.
- Publish the accessioned files to the Covid-19 DP project directory in the public FTP.
- Cluster the variants in the SARS-Cov-2 assembly in the accessioning warehouse.
For usage in EBI cluster, see here (limited to EBI internal users only).
For resuming a previous run
ingest_covid19dp_submission.py --project-dir /path/to/project/dir/PRJEB45554 --num-analyses 10000 --processed-analyses-file /file/containing/list/of/analyses/already/processed --app-config-file /path/to/app_config.yml --nextflow-config-file /path/to/nextflow.config --resume-snapshot <processing_directory_name>
where the processing directory is formatted like 2022_05_18_11_00_41 inside the 30_eva_valid folder