Install dependencies and take note of where you put them. The executables either need to be in the PATH of the executing environment, or the full path to them will need to be supplied to the nextflow configs.
Data processing workflows are written in Nextflow's DSL.
Basis data is loaded in to backing database directly.
- Genenames: HUGO Gene Nomenclature Commitee (HGNC)
- Canonical Transcripts: Ensembl ID to Ensembl Transcript ID mapping
- OMIM File: Ensemble ID to OMIM ID mapping
- Gencode File: Source list of genes See for where to obtain this data.
CADD scores that get added to the VCF backing data during prepare_VCF
pipeline originate from here
The pregenerated "All possible SNVs of GRCh38/hg38" provided the backing data.
A pipeline to generate CADD scores locally is not in the scope of this project.
curl -o hs38DH.fa
curl -o hs38DH.fa.fai
Clone and install from bamUtil repo
Installation instructions on
Master branch of LoF plugin doesn't work for GRCh38 per this issue
git clone --depth 1 --branch grch38 --single-branch
Running loftee in the prepare vcf workflow
loftee_human_ancestor_fa = "/path/to/VEP/Plugins/loftee_data/human_ancestor.fa.gz"
loftee_conservation_file = "/path/to/VEP/Plugins/loftee_data/loftee.sql"
loftee_gerp_bigwig = "/path/to/VEP/Plugins/loftee_data/"
mkdir loftee/data
curl -o loftee/data/human_ancestor.fa.gz
curl -o loftee/data/human_ancestor.fa.gz.fai
curl -o loftee/data/human_ancestor.fa.gz.gzi
echo "f8c79d45c8fdffb52ef6926d540f2dd3 loftee/data/human_ancestor.fa.gz" | md5sum -c
echo "205a31051be5f1a312c31abf8a298ed7 loftee/data/human_ancestor.fa.gz.fai" | md5sum -c
echo "121343b868d5da87cc04646d55e806c3 loftee/data/human_ancestor.fa.gz.gzi" | md5sum -c
curl -o loftee/data/phylocsf_gerp.sql.gz
gunzip loftee/data/phylocsf_gerp.sql.gz
GERP scores for GRCh38 were lifted over from GRCh37 see tweet thread Information about conservation scores on the Ensembl site
curl -o loftee/data/