Skip to content

Job script to run annotation on variants pulled back from the One-sided Matching Portal

Notifications You must be signed in to change notification settings

ccmbioinfo/osmp-annotation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OSMP Annotation Workflow

OSMP is a tool that enables researchers to pull back variants and phenotypic information from various institutions. An important goal for uniting the data to have standardized annotation across different data nodes. We thereby implement a workflow that will be submitted as a job to the slurm scheduler on HPC4Health to retrieve annotation on single-nucleotide variants. The dedicated compute node on HPC allows the annotation to run as "on-the-fly" as possible, an important criteria for fast retrieval of genomic information.

Nextflow

Nextflow is used for data orchestration and parallel execution of various tasks required for annotation.

The input to the workflow is a JSON string of variants that contain chromosome, reference allele, alternative alelle, and position as seen in sample_variants.json. Specifically, the chromosome, start, and end are used to perform an annotation query.

Two main sources for annotation are used: CADD VEP and gnomAD. The diagram below illustrates the fields extracted from each.

Static Annotation Flow

Running your workflow

To run the workflow locally, first make sure you are on a qlogin node on HPC and have loaded the appropriate modules:

module load nextflow Singularity

Singularity

Singularity is a container engine alternative to Docker, which can be used with unprivileged permissions and doesn’t require a separate daemon process. These, along other features, makes Singularity a container engine better suited the requirements of HPC workloads.

To run Nextflow with Singularity:

cd jobs
nextflow run annotation.nf -profile annotation --json <your_fake_json> 

Note that there's a JSON string for test variants in annotation.nf, so you can simply drop the --json flag and argument if you want to use the same test data.

Docker

Singularity is able to use existing Docker images, and pull from Docker registries. The Dockerfile here is built and published to GitHub here when there's a push to the main development branch.

Workflow Summary

  1. JSON string containing variants are converted into a CSV.
  2. Unique variants are extracted, and their coordiates (chr:start-end) are printed.
  3. tabix is used to pull annotation information from CADD VEP and gnomAD.
  4. Annotations and original variants are merged. A JSON is returned, which will subsequently be sent through the Slurm API to OSMP's backend.

Annotation DAG

Tabix

The main annotation step relies on tabix, which is a tool used to index a TAB-delimited genome position file and creates an index file. After indexing, tabix is able to quickly retrieve data lines overlapping regions specified in the format "chr:beginPos-endPos". (Coordinates specified in this region format are 1-based and inclusive.)

Both CADD VEP and gnomAD TSVs and their indexes produced from tabix are stored on HPF's data node.

Visualization

To see a DAG of your workflow execution, run:

nextflow run annotation.nf -with-dag flowchart.png

About

Job script to run annotation on variants pulled back from the One-sided Matching Portal

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages