Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Spark for GATK #76

Merged
merged 12 commits into from
Dec 18, 2019
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,6 @@ jobs:
run: |
docker pull nfcore/sarek:dev
docker tag nfcore/sarek:dev nfcore/sarek:dev
- name: Run vriant calling test on specific tools
- name: Run variant calling test on specific tools
run: |
nextflow run . -profile test_tool,docker --verbose --tools ${{ matrix.tool }}
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) a
- [#60](https://github.com/nf-core/sarek/pull/60) - Add new minimal genomes (`TAIR10`, `EB2`, `UMD3.1`, `bosTau8`, `WBcel235`, `ce10`, `CanFam3.1`, `canFam3`, `GRCz10`, `danRer10`, `BDGP6`, `dm6`, `EquCab2`, `equCab2`, `EB1`, `Galgal4`, `galGal4`, `Gm01`, `hg38`, `hg19`, `Mmul_1`, `mm10`, `IRGSP-1.0`, `CHIMP2.1.4`, `panTro4`, `Rnor_6.0`, `rn6`, `R64-1-1`, `sacCer3`, `EF2`, `Sbi1`, `Sscrofa10.2`, `susScr3`, `AGPv3`) to `igenomes.config`
- [#61](https://github.com/nf-core/sarek/pull/61) - Add params `split_fastq`
- [#61](https://github.com/nf-core/sarek/pull/61) - Add test `SPLITFASTQ`
- [#66](https://github.com/nf-core/sarek/pull/66) - Add `Sentieon` possibilities to Sarek
- [#76](https://github.com/nf-core/sarek/pull/76) - Add `GATK Spark` possibilities to Sarek

### `Changed`

Expand All @@ -38,11 +40,14 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) a
- [#74](https://github.com/nf-core/sarek/pull/74) - Update docs
- [#74](https://github.com/nf-core/sarek/pull/74) - Improve CI tests (both Jenkins and GitHub actions tests)
- [#74](https://github.com/nf-core/sarek/pull/74) - Move all ci from `ci-extra.yml` to `ci.yml`
- [#76](https://github.com/nf-core/sarek/pull/76) - Use `MarkDuplicatesSpark` instead of `MarkDuplicates`
- [#76](https://github.com/nf-core/sarek/pull/76) - Use `gatk4-spark` instead of `gatk4` in `environment.yml`

### `Removed`

- [#46](https://github.com/nf-core/sarek/pull/46) - Remove mention of old `build.nf` script which was included in `main.nf`
- [#74](https://github.com/nf-core/sarek/pull/74) - Remove `download_image.sh` and `run_tests.sh` scripts
- [#76](https://github.com/nf-core/sarek/pull/76) - Remove `runOptions = "-u \$(id -u):\$(id -g)"` in `nextflow.config` to enable `Spark` possibilities

### `Fixed`

Expand Down
3 changes: 2 additions & 1 deletion docs/containers.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@ For annotation, the main container can be used, but the cache has to be download
- Contain **[Control-FREEC](https://github.com/BoevaLab/FREEC)** 11.5
- Contain **[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)** 0.11.8
- Contain **[FreeBayes](https://github.com/ekg/freebayes)** 1.3.1
- Contain **[GATK4](https://github.com/broadinstitute/gatk)** 4.1.4.0
- Contain **[GATK4](https://github.com/broadinstitute/gatk)** 4.1.4.1
- Contain **[GATK4-spark](https://github.com/broadinstitute/gatk)** 4.1.4.1
- Contain **[GeneSplicer](https://ccb.jhu.edu/software/genesplicer/)** 1.0
- Contain **[HTSlib](https://github.com/samtools/htslib)** 1.9
- Contain **[Manta](https://github.com/Illumina/manta)** 1.6.0
Expand Down
6 changes: 3 additions & 3 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ The pipeline processes data using the following steps:
- [Map to Reference](#map-to-reference)
- [BWA mem](#bwa-mem)
- [Mark Duplicates](#mark-duplicates)
- [GATK MarkDuplicates](#gatk-markduplicates)
- [GATK MarkDuplicatesSpark](#gatk-markduplicatesspark)
- [Base (Quality Score) Recalibration](#base-quality-score-recalibration)
- [GATK BaseRecalibrator](#gatk-baserecalibrator)
- [GATK ApplyBQSR](#gatk-applybqsr)
Expand Down Expand Up @@ -66,9 +66,9 @@ Such files are intermediate and not kept in the final files delivered to users.

### Mark Duplicates

#### GATK MarkDuplicates
#### GATK MarkDuplicatesSpark

[GATK MarkDuplicates](https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.4.0/picard_sam_markduplicates_MarkDuplicates.php) locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA.
[GATK MarkDuplicatesSpark](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_spark_transforms_markduplicates_MarkDuplicatesSpark.php) is a Spark implementation of [Picard MarkDuplicates](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_markduplicates_MarkDuplicates.php) and locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA.

This directory is the location for the BAM files delivered to users.
Besides the duplicate marked BAM files, the recalibration tables (`*.recal.table`) are also stored, and can be used to create base recalibrated files.
Expand Down
2 changes: 1 addition & 1 deletion environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ dependencies:
- ensembl-vep=98.2
- fastqc=0.11.8
- freebayes=1.3.1
- gatk4=4.1.4.0
- gatk4-spark=4.1.4.1
- genesplicer=1.0
- htslib=1.9
- manta=1.6.0
Expand Down
33 changes: 15 additions & 18 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -940,39 +940,37 @@ process IndexBamFile {

// STEP 2: MARKING DUPLICATES

process MarkDuplicates {
process MarkDuplicatesSpark {
label 'cpus_16'

tag {idPatient + "-" + idSample}

publishDir params.outdir, mode: params.publishDirMode,
saveAs: {
if (it == "${idSample}.bam.metrics" && 'markduplicates' in skipQC) null
else if (it == "${idSample}.bam.metrics") "Reports/${idSample}/MarkDuplicates/${it}"
if (it == "${idSample}.bam.metrics") "Reports/${idSample}/MarkDuplicates/${it}"
else "Preprocessing/${idSample}/DuplicateMarked/${it}"
}

input:
set idPatient, idSample, file("${idSample}.bam") from mergedBam

output:
set idPatient, idSample, file("${idSample}.md.bam"), file("${idSample}.md.bai") into duplicateMarkedBams
file ("${idSample}.bam.metrics") into markDuplicatesReport
set idPatient, idSample, file("${idSample}.md.bam"), file("${idSample}.md.bam.bai") into duplicateMarkedBams
file ("${idSample}.bam.metrics") optional true into markDuplicatesReport

when: params.knownIndels

script:
markdup_java_options = task.memory.toGiga() > 8 ? params.markdup_java_options : "\"-Xms" + (task.memory.toGiga() / 2).trunc() + "g -Xmx" + (task.memory.toGiga() - 1) + "g\""
metrics = 'markduplicates' in skipQC ? '' : "-M ${idSample}.bam.metrics"
"""
gatk --java-options ${markdup_java_options} \
MarkDuplicates \
--MAX_RECORDS_IN_RAM 50000 \
--INPUT ${idSample}.bam \
--METRICS_FILE ${idSample}.bam.metrics \
--TMP_DIR . \
--ASSUME_SORT_ORDER coordinate \
--CREATE_INDEX true \
--OUTPUT ${idSample}.md.bam
MarkDuplicatesSpark \
-I ${idSample}.bam \
-O ${idSample}.md.bam \
${metrics} \
--tmp-dir . \
--create-output-bam-index true
"""
}

Expand Down Expand Up @@ -1132,7 +1130,7 @@ recalTableTSV.map { idPatient, idSample ->
status = statusMap[idPatient, idSample]
gender = genderMap[idPatient]
bam = "${params.outdir}/Preprocessing/${idSample}/DuplicateMarked/${idSample}.md.bam"
bai = "${params.outdir}/Preprocessing/${idSample}/DuplicateMarked/${idSample}.md.bai"
bai = "${params.outdir}/Preprocessing/${idSample}/DuplicateMarked/${idSample}.md.bam.bai"
recalTable = "${params.outdir}/Preprocessing/${idSample}/DuplicateMarked/${idSample}.recal.table"
"${idPatient}\t${gender}\t${status}\t${idSample}\t${bam}\t${bai}\t${recalTable}\n"
}.collectFile(
Expand All @@ -1145,7 +1143,7 @@ recalTableSampleTSV
status = statusMap[idPatient, idSample]
gender = genderMap[idPatient]
bam = "${params.outdir}/Preprocessing/${idSample}/DuplicateMarked/${idSample}.md.bam"
bai = "${params.outdir}/Preprocessing/${idSample}/DuplicateMarked/${idSample}.md.bai"
bai = "${params.outdir}/Preprocessing/${idSample}/DuplicateMarked/${idSample}.md.bam.bai"
recalTable = "${params.outdir}/Preprocessing/${idSample}/DuplicateMarked/${idSample}.recal.table"
["duplicateMarked_${idSample}.tsv", "${idPatient}\t${gender}\t${status}\t${idSample}\t${bam}\t${bai}\t${recalTable}\n"]
}
Expand All @@ -1155,12 +1153,10 @@ bamApplyBQSR = bamMDToJoin.join(recalTable, by:[0,1])
if (step == 'recalibrate') bamApplyBQSR = inputSample

bamApplyBQSR = bamApplyBQSR.dump(tag:'BAM + BAI + RECAL TABLE')
// [DUMP: recal.table] ['normal', 'normal', normal.md.bam, normal.md.bai, normal.recal.table]

bamApplyBQSR = bamApplyBQSR.combine(intApplyBQSR)

bamApplyBQSR = bamApplyBQSR.dump(tag:'BAM + BAI + RECAL TABLE + INT')
// [DUMP: BAM + BAI + RECAL TABLE + INT] ['normal', 'normal', normal.md.bam, normal.md.bai, normal.recal.table, 1_1-200000.bed]

// STEP 4: RECALIBRATING

Expand Down Expand Up @@ -1517,7 +1513,8 @@ process GenotypeGVCFs {
// Using -L is important for speed and we have to index the interval files also
"""
gatk --java-options -Xmx${task.memory.toGiga()}g \
IndexFeatureFile -F ${gvcf}
IndexFeatureFile \
-I ${gvcf}

gatk --java-options -Xmx${task.memory.toGiga()}g \
GenotypeGVCFs \
Expand Down
1 change: 0 additions & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,6 @@ profiles {
docker {
enabled = true
fixOwnership = true
runOptions = "-u \$(id -u):\$(id -g)"
}
singularity.enabled = false
}
Expand Down