OSMP Annotation Workflow

OSMP is a tool that enables researchers to pull back variants and phenotypic information from various institutions. An important goal for uniting the data to have standardized annotation across different data nodes. We thereby implement a workflow that will be submitted as a job to the slurm scheduler on HPC4Health to retrieve annotation on single-nucleotide variants. The dedicated compute node on HPC allows the annotation to run as "on-the-fly" as possible, an important criteria for fast retrieval of genomic information.

Nextflow

Nextflow is used for data orchestration and parallel execution of various tasks required for annotation.

The input to the workflow is a JSON string of variants that contain chromosome, reference allele, alternative alelle, and position as seen in sample_variants.json. Specifically, the chromosome, start, and end are used to perform an annotation query.

Two main sources for annotation are used: CADD VEP and gnomAD. The diagram below illustrates the fields extracted from each.

Running your workflow

To run the workflow locally, first make sure you are on a qlogin node on HPC and have loaded the appropriate modules:

module load nextflow Singularity

Singularity

Singularity is a container engine alternative to Docker, which can be used with unprivileged permissions and doesn’t require a separate daemon process. These, along other features, makes Singularity a container engine better suited the requirements of HPC workloads.

To run Nextflow with Singularity:

cd jobs
nextflow run annotation.nf -profile annotation --json <your_fake_json>

Note that there's a JSON string for test variants in annotation.nf, so you can simply drop the --json flag and argument if you want to use the same test data.

Docker

Singularity is able to use existing Docker images, and pull from Docker registries. The Dockerfile here is built and published to GitHub here when there's a push to the main development branch.

Workflow Summary

JSON string containing variants are converted into a CSV.
Unique variants are extracted, and their coordiates (chr:start-end) are printed.
tabix is used to pull annotation information from CADD VEP and gnomAD.
Annotations and original variants are merged. A JSON is returned, which will subsequently be sent through the Slurm API to OSMP's backend.

Tabix

The main annotation step relies on tabix, which is a tool used to index a TAB-delimited genome position file and creates an index file. After indexing, tabix is able to quickly retrieve data lines overlapping regions specified in the format "chr:beginPos-endPos". (Coordinates specified in this region format are 1-based and inclusive.)

Both CADD VEP and gnomAD TSVs and their indexes produced from tabix are stored on HPF's data node.

Visualization

To see a DAG of your workflow execution, run:

nextflow run annotation.nf -with-dag flowchart.png

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
assets		assets
jobs		jobs
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.in		requirements.in
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OSMP Annotation Workflow

Nextflow

Running your workflow

Singularity

Docker

Workflow Summary

Tabix

Visualization

About

Releases

Packages

Languages

ccmbioinfo/osmp-annotation

Folders and files

Latest commit

History

Repository files navigation

OSMP Annotation Workflow

Nextflow

Running your workflow

Singularity

Docker

Workflow Summary

Tabix

Visualization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages