GDC DNA-Seq Alignment Workflow

This workflow takes a set of input WGS/WXS/Targeted Sequencing FASTQ/BAM files and generates a harmonized BAM file and an sqlite database of various metrics collected.

License Note

This repository is licensed under Apache License Version 2.0. Exceptions are code blocks licensed under CC-BY-SA-4.0.
The CC-BY-SA-4.0 code blocks are denoted by /begin <AUTHOR> CC-BY-SA-4.0 to /end <AUTHOR> CC-BY-SA-4.0.

Environment

The workflows are tested under multiple Ubuntu versions:

Ubuntu 14.04
Ubuntu 16.04
Ubuntu 18.04

The docker images are tested under multiple environments.
https://docs.docker.com/engine/reference/builder/
The most tested ones are:

Docker version 19.03.2, build 6a30dfc
Docker version 18.09.1, build 4c52b90
Docker version 18.03.0-ce, build 0520e24
Docker version 17.12.1-ce, build 7390fc6

The CWL are tested under multiple cwltools environments.
https://www.commonwl.org/
The most tested one is:

cwltool 1.0.20180306163216

For external users

The repository has only been tested on GDC data and in the particular environment GDC is running in. Some of the reference data required for the workflow production are hosted in GDC reference files. For any questions related to GDC data, please contact the GDC Help Desk at support@nci-gdc.datacommons.io.

The entrypoint CWL workflow for external users is workflows/main/gdc_dnaseq_main_workflow.cwl.

The example input json in example/main_workflow_example_wgs_input.json.

Inputs

Name	Type	Description
`bam_name`	`string`	basename of the final harmonized bam
`job_uuid`	`string`	unique identifier for the workflow run
`collect_wgs_metrics`	`boolean`	set to `true` to generate metrics for WGS data
`amplicon_kit_set_file_list`	`amplicon_kit_set_file[]`	array of objects containing the paths to the amplicon and target files (only for amplicon-based targeted/WXS sequencing)
`capture_kit_set_file_list`	`capture_kit_set_file[]`	array of objects containing the paths to the target and bait files (only for hybrid-selection targeted/WXS sequencing)
`readgroup_fastq_pe_file_list`	`readgroup_fastq_file[]`	array of objects containing the paths to paired-end fastq files and their associated readgroup metadata
`readgroup_fastq_se_file_list`	`readgroup_fastq_file[]`	array of objects containing the paths to single-end fastq files and their associated readgroup metadata
`readgroups_bam_file_list`	`readgroups_bam_file[]`	array of objects containing the paths to BAM files and their associated readgroup metadata
`common_biallelic_vcf`	`File`	tabix-indexed common biallelic VCF (e.g., gnomad)
`known_snp`	`File`	tabix-indexed dbSNP VCF
`run_markduplicates`	`boolean`	this should be `true` in all cases except for amplicon-based PCR sequencing libraries
`reference_sequence`	`File`	the reference fasta file and its associated BWA/fai/dict index files
`thread_count`	`long`	the number of cores to use for multi-threaded tools

Custom Data Types

amplicon_kit_set_file - contains amplicon sequencing kit files

Name	Type	Description
`amplicon_kit_amplicon_file`	`File`	amplicon baits interval file
`amplicon_kit_target_file`	`File`	amplicon target interval file

capture_kit_set_file - contains the hybrid-selection targeted squencing kit files

Name	Type	Description
`capture_kit_bait_file`	`File`	capture kit baits interval file
`capture_kit_target_file`	`File`	capture kit targets interval file

readgroup_fastq_file - contains readgroup level fastq files and the associated readgroup metadata

Name	Type	Description
`forward_fastq`	`File`	required R1 fastq file
`reverse_fastq`	`File?`	optional R2 fastq file (for paired-end reads)
`readgroup_meta`	`readgroup_meta`	object containing the readgroup metadata

readgroups_bam_file - contains a BAM file and the associated readgroup metadata

Name	Type	Description
`bam`	`File`	the BAM file
`readgroup_meta_list`	`readgroup_meta[]`	array of objects containing the readgroup metadata

readgroup_meta - contains readgroup metadata

Name	Type	Description
`CN`	`string?`	optional sequencing center
`DS`	`string?`	optional description
`DT`	`string?`	optional ISO8601 sequencing date
`FO`	`string?`	optional flow order array of nocleotide bases that corresponded to the nucleotides used for each flow of each read
`ID`	`string`	required read group ID
`KS`	`string?`	optional array of nucleotide bases that correspond to the key sequence of each read
`LB`	`string?`	optional library ID
`PI`	`string?`	optional predicted median insert size
`PL`	`string`	required platform
`PM`	`string?`	optional platform model
`PU`	`string?`	optional platform unit
`SM`	`string`	required sample ID

Outputs

Name	Type	Description
`output_bam`	`File`	harmonized and indexed BAM file
`sqlite`	`File`	sqlite file containing metrics data

GDC Users

The entrypoint CWL workflow for GDC users is workflows/gdc_dnaseq.bamfastq_align.workflow.cwl.

Name		Name	Last commit message	Last commit date
Latest commit History 2,472 Commits
example		example
tests		tests
tools		tools
workflows		workflows
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
LICENSE		LICENSE
README.md		README.md
cc-by-sa-4.0.txt		cc-by-sa-4.0.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GDC DNA-Seq Alignment Workflow

License Note

Environment

For external users

Inputs

Outputs

GDC Users

About

Releases

Packages

Languages

License

OpenGenomics/gdc-dnaseq-cwl

Folders and files

Latest commit

History

Repository files navigation

GDC DNA-Seq Alignment Workflow

License Note

Environment

For external users

Inputs

Outputs

GDC Users

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages