Merge branch 'flexible_input_reqts' into 'master'

Preps QuaC for public availability Closes #45, #49, #52, #53, and #56 See merge request center-for-computational-genomics-and-data-science/sciops/pipelines/quac!6
uab-cgds-worthey · Jan 20, 2023 · 92db1d1 · 92db1d1
2 parents df9cde5 + bec7502
commit 92db1d1
Show file tree

Hide file tree

Showing 48 changed files with 987 additions and 226 deletions.
diff --git a/.gitmodules b/.gitmodules
diff --git a/.test/README.md b/.test/README.md
@@ -1,29 +1,34 @@
 # Testing
 
-Output from [Small variant caller
-pipeline](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/pipelines/small_variant_caller_pipeline)
-are the inputs to QuaC pipeline. Hence following datasets are necessary for testing:
+Input directory structure to QuaC is based on the output directory structure of the [Small variant caller
+pipeline](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/pipelines/small_variant_caller_pipeline).
+Following files are necessary for testing:
 
 1. bams
 2. vcfs
-3. QC output (from tools fastqc, fastq-screen and picard-markduplicates)
-4. Sample rename config
+3. Capture regions bed file - Required only for exome mode
+4. QC output from tools fastqc, fastq-screen and picard-markduplicates - Required only if `priorQC` is used
+5. Sample rename config - Required only if `priorQC` is used
 
-Note: Be sure to preserve directory structure used in the output of Small variant caller
+**Note**: If `priorQC` is used, be sure to preserve directory structure used in the output of CGDS Small variant caller
 pipeline.
 
 ## Setup test datasets
 
-* To setup test bam and vcf files, which are from sub-sampled NA12878 data, run:
+### Required
+
+* To setup test bam, vcf and capture region bed files, which are from sub-sampled NA12878 data, run:
 
 ```sh
 cd .test
 ./setup_test_datasets.sh
 ```
 
-* QuaC also needs test QC outputs for fastq (and sample rename config), which get created by small var caller pipeline.
-  This was achieved by running the small variant caller pipeline using its test datasets with some modifications. Steps
-  are briefly shown here:
+### Optional - priorQC mode
+
+* If used in `priorQC` mode, QuaC also needs test QC outputs for fastq (and sample rename config), which at CGDS get
+  created by the small var caller pipeline. Below, we create fastq QC and sample rename config using the small variant
+  caller pipeline for samples `A` and `B`.
 
 ```sh
 cd <small_var_caller_pipeline_dir>

diff --git a/.test/configs/project_1_sample.ped → ...nfigs/include_priorQC/project_1sample.ped b/.test/configs/project_1_sample.ped → ...nfigs/include_priorQC/project_1sample.ped
diff --git a/.test/configs/project_2_samples.ped → ...figs/include_priorQC/project_2samples.ped b/.test/configs/project_2_samples.ped → ...figs/include_priorQC/project_2samples.ped
diff --git a/.test/configs/no_priorQC/project_1sample.ped b/.test/configs/no_priorQC/project_1sample.ped
@@ -0,0 +1,2 @@
+#family_id	sample_id	paternal_id	maternal_id	sex	phenotype
+unknown	C	father_1	mother_1	-9	-9
diff --git a/.test/configs/no_priorQC/project_2samples.ped b/.test/configs/no_priorQC/project_2samples.ped
@@ -0,0 +1,3 @@
+#family_id	sample_id	paternal_id	maternal_id	sex	phenotype
+unknown	C	father_1	mother_1	-9	-9
+unknown	D	father_1	mother_1	-9	-9
diff --git a/.test/ngs-data/test_project/analysis/A/qc/dedup/A-1.metrics.txt b/.test/ngs-data/test_project/analysis/A/qc/dedup/A-1.metrics.txt
@@ -1,5 +1,5 @@
 ## htsjdk.samtools.metrics.StringHeader
-# MarkDuplicates INPUT=[/data/scratch/manag/test_pipeline/small_variant_caller/interim/A/mapped/A-1.sorted.bam] OUTPUT=/data/scratch/manag/test_pipeline/small_variant_caller/interim/A/dedup/A-1.bam METRICS_FILE=/data/scratch/manag/test_pipeline/small_variant_caller/analysis/A/qc/dedup/A-1.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp]    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
+# MarkDuplicates INPUT=[/test_pipeline/small_variant_caller/interim/A/mapped/A-1.sorted.bam] OUTPUT=/test_pipeline/small_variant_caller/interim/A/dedup/A-1.bam METRICS_FILE=/test_pipeline/small_variant_caller/analysis/A/qc/dedup/A-1.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp]    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
 ## htsjdk.samtools.metrics.StringHeader
 # Started on: Fri Apr 02 19:39:58 UTC 2021
 

diff --git a/.test/ngs-data/test_project/analysis/A/qc/dedup/A-2.metrics.txt b/.test/ngs-data/test_project/analysis/A/qc/dedup/A-2.metrics.txt
@@ -1,5 +1,5 @@
 ## htsjdk.samtools.metrics.StringHeader
-# MarkDuplicates INPUT=[/data/scratch/manag/test_pipeline/small_variant_caller/interim/A/mapped/A-2.sorted.bam] OUTPUT=/data/scratch/manag/test_pipeline/small_variant_caller/interim/A/dedup/A-2.bam METRICS_FILE=/data/scratch/manag/test_pipeline/small_variant_caller/analysis/A/qc/dedup/A-2.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp]    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
+# MarkDuplicates INPUT=[/test_pipeline/small_variant_caller/interim/A/mapped/A-2.sorted.bam] OUTPUT=/test_pipeline/small_variant_caller/interim/A/dedup/A-2.bam METRICS_FILE=/test_pipeline/small_variant_caller/analysis/A/qc/dedup/A-2.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp]    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
 ## htsjdk.samtools.metrics.StringHeader
 # Started on: Fri Apr 02 19:40:06 UTC 2021
 

diff --git a/.test/ngs-data/test_project/analysis/B/qc/dedup/B-1.metrics.txt b/.test/ngs-data/test_project/analysis/B/qc/dedup/B-1.metrics.txt
@@ -1,5 +1,5 @@
 ## htsjdk.samtools.metrics.StringHeader
-# MarkDuplicates INPUT=[/data/scratch/manag/test_pipeline/small_variant_caller/interim/B/mapped/B-1.sorted.bam] OUTPUT=/data/scratch/manag/test_pipeline/small_variant_caller/interim/B/dedup/B-1.bam METRICS_FILE=/data/scratch/manag/test_pipeline/small_variant_caller/analysis/B/qc/dedup/B-1.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp]    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
+# MarkDuplicates INPUT=[/test_pipeline/small_variant_caller/interim/B/mapped/B-1.sorted.bam] OUTPUT=/test_pipeline/small_variant_caller/interim/B/dedup/B-1.bam METRICS_FILE=/test_pipeline/small_variant_caller/analysis/B/qc/dedup/B-1.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp]    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
 ## htsjdk.samtools.metrics.StringHeader
 # Started on: Fri Apr 02 19:39:58 UTC 2021
 

diff --git a/.test/ngs-data/test_project/analysis/B/qc/dedup/B-2.metrics.txt b/.test/ngs-data/test_project/analysis/B/qc/dedup/B-2.metrics.txt
@@ -1,5 +1,5 @@
 ## htsjdk.samtools.metrics.StringHeader
-# MarkDuplicates INPUT=[/data/scratch/manag/test_pipeline/small_variant_caller/interim/B/mapped/B-2.sorted.bam] OUTPUT=/data/scratch/manag/test_pipeline/small_variant_caller/interim/B/dedup/B-2.bam METRICS_FILE=/data/scratch/manag/test_pipeline/small_variant_caller/analysis/B/qc/dedup/B-2.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp]    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
+# MarkDuplicates INPUT=[/test_pipeline/small_variant_caller/interim/B/mapped/B-2.sorted.bam] OUTPUT=/test_pipeline/small_variant_caller/interim/B/dedup/B-2.bam METRICS_FILE=/test_pipeline/small_variant_caller/analysis/B/qc/dedup/B-2.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp]    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
 ## htsjdk.samtools.metrics.StringHeader
 # Started on: Fri Apr 02 19:40:06 UTC 2021
 

diff --git a/.test/ngs-data/test_project/analysis/C/bam/C.bam b/.test/ngs-data/test_project/analysis/C/bam/C.bam
diff --git a/.test/ngs-data/test_project/analysis/C/bam/C.bam.bai b/.test/ngs-data/test_project/analysis/C/bam/C.bam.bai
diff --git a/.test/ngs-data/test_project/analysis/C/configs/small_variant_caller/capture_regions.bed b/.test/ngs-data/test_project/analysis/C/configs/small_variant_caller/capture_regions.bed
@@ -0,0 +1,2 @@
+chr20	59992	3653078
+
diff --git a/.test/ngs-data/test_project/analysis/C/vcf/C.vcf.gz b/.test/ngs-data/test_project/analysis/C/vcf/C.vcf.gz
diff --git a/.test/ngs-data/test_project/analysis/C/vcf/C.vcf.gz.tbi b/.test/ngs-data/test_project/analysis/C/vcf/C.vcf.gz.tbi
diff --git a/.test/ngs-data/test_project/analysis/D/bam/D.bam b/.test/ngs-data/test_project/analysis/D/bam/D.bam
diff --git a/.test/ngs-data/test_project/analysis/D/bam/D.bam.bai b/.test/ngs-data/test_project/analysis/D/bam/D.bam.bai
diff --git a/.test/ngs-data/test_project/analysis/D/configs/small_variant_caller/capture_regions.bed b/.test/ngs-data/test_project/analysis/D/configs/small_variant_caller/capture_regions.bed
@@ -0,0 +1,2 @@
+chr20	59992	3653078
+
diff --git a/.test/ngs-data/test_project/analysis/D/vcf/D.vcf.gz b/.test/ngs-data/test_project/analysis/D/vcf/D.vcf.gz
diff --git a/.test/ngs-data/test_project/analysis/D/vcf/D.vcf.gz.tbi b/.test/ngs-data/test_project/analysis/D/vcf/D.vcf.gz.tbi
diff --git a/.test/setup_test_datasets.sh b/.test/setup_test_datasets.sh
@@ -19,7 +19,8 @@ TARGET_REGION="chr20:59993-3653078"
 samtools view -s 0.03 -b $NA12878_BAM $TARGET_REGION > $SUBSAMPLED_BAM
 
 PROJECT_DIR="ngs-data/test_project/analysis"
-for sample in A B; do
+SAMPLES="A B C D"
+for sample in $SAMPLES; do
     ### bams ###
     BAM_DIR="${PROJECT_DIR}/${sample}/bam"
     mkdir -p $BAM_DIR
@@ -44,7 +45,7 @@ rm -f $SUBSAMPLED_BAM
 echo "Setting up test vcf files..."
 NA12878_VCF="/data/project/worthey_lab/samples/NA12878/analysis/small_variants/na12878.vcf.gz"
 
-for sample in A B; do
+for sample in $SAMPLES; do
     VCF_DIR="${PROJECT_DIR}/${sample}/vcf"
     mkdir -p $VCF_DIR
     OUT_vcf=${VCF_DIR}/${sample}.vcf.gz
@@ -57,7 +58,9 @@ done
 
 ############# Regions file #############
 
-# Treat sample B as exome dataset and add a capture-regions bed file
-CAPTURE_FILE="${PROJECT_DIR}/B/configs/small_variant_caller/capture_regions.bed"
-mkdir -p $(dirname $CAPTURE_FILE)
-echo -e "chr20\t59992\t3653078\n" > $CAPTURE_FILE
+# For exome mode testing, add capture-regions bed file
+for sample in $SAMPLES; do
+    CAPTURE_FILE="${PROJECT_DIR}/${sample}/configs/small_variant_caller/capture_regions.bed"
+    mkdir -p $(dirname $CAPTURE_FILE)
+    echo -e "chr20\t59992\t3653078\n" > $CAPTURE_FILE
+done
diff --git a/Changelog.md b/Changelog.md
@@ -39,4 +39,16 @@ YYYY-MM-DD  John Doe
 2022-04-07  Manavalan Gajapathy
 
 * Previously hardcoded hardware resources for snakemake rules can now be supplied via `configs/workflow.yaml` (closes #48)
-* Modified multiqc conda env config to use explicit dependencies to get around installation issues (closes #47)
+* Modified multiqc conda env config to use explicit dependencies to get around installation issues (closes #47)
+
+
+2023-01-20  Manavalan Gajapathy
+
+As part of making QuaC publicly available, following updates were made to make it more generic to the environment and user friendly:
+
+* Removes prerun QC from small variant caller pipeline as requirement to QuaC (closes #45)
+* Explicitly defines conda environments (closes #49)
+* Uses container solution for `covviz` installation instead of conda to avoid pip based installation (closes #52)
+* Removes git submodules and instead saves their local copy to repo (closes #53)
+* Loads singularity module loading prior to executing the runner script
+* Uses minimal snakemake instead of full-featured snakemake (closes #56)