Skip to content

Commit

Permalink
Merge branch 'flexible_input_reqts' into 'master'
Browse files Browse the repository at this point in the history
Preps QuaC for public availability

Closes #45, #49, #52, #53, and #56

See merge request center-for-computational-genomics-and-data-science/sciops/pipelines/quac!6
  • Loading branch information
Manavalan Gajapathy committed Jan 20, 2023
2 parents df9cde5 + bec7502 commit 92db1d1
Show file tree
Hide file tree
Showing 48 changed files with 987 additions and 226 deletions.
6 changes: 0 additions & 6 deletions .gitmodules

This file was deleted.

25 changes: 15 additions & 10 deletions .test/README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,34 @@
# Testing

Output from [Small variant caller
pipeline](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/pipelines/small_variant_caller_pipeline)
are the inputs to QuaC pipeline. Hence following datasets are necessary for testing:
Input directory structure to QuaC is based on the output directory structure of the [Small variant caller
pipeline](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/pipelines/small_variant_caller_pipeline).
Following files are necessary for testing:

1. bams
2. vcfs
3. QC output (from tools fastqc, fastq-screen and picard-markduplicates)
4. Sample rename config
3. Capture regions bed file - Required only for exome mode
4. QC output from tools fastqc, fastq-screen and picard-markduplicates - Required only if `priorQC` is used
5. Sample rename config - Required only if `priorQC` is used

Note: Be sure to preserve directory structure used in the output of Small variant caller
**Note**: If `priorQC` is used, be sure to preserve directory structure used in the output of CGDS Small variant caller
pipeline.

## Setup test datasets

* To setup test bam and vcf files, which are from sub-sampled NA12878 data, run:
### Required

* To setup test bam, vcf and capture region bed files, which are from sub-sampled NA12878 data, run:

```sh
cd .test
./setup_test_datasets.sh
```

* QuaC also needs test QC outputs for fastq (and sample rename config), which get created by small var caller pipeline.
This was achieved by running the small variant caller pipeline using its test datasets with some modifications. Steps
are briefly shown here:
### Optional - priorQC mode

* If used in `priorQC` mode, QuaC also needs test QC outputs for fastq (and sample rename config), which at CGDS get
created by the small var caller pipeline. Below, we create fastq QC and sample rename config using the small variant
caller pipeline for samples `A` and `B`.

```sh
cd <small_var_caller_pipeline_dir>
Expand Down
File renamed without changes.
File renamed without changes.
2 changes: 2 additions & 0 deletions .test/configs/no_priorQC/project_1sample.ped
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#family_id sample_id paternal_id maternal_id sex phenotype
unknown C father_1 mother_1 -9 -9
3 changes: 3 additions & 0 deletions .test/configs/no_priorQC/project_2samples.ped
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#family_id sample_id paternal_id maternal_id sex phenotype
unknown C father_1 mother_1 -9 -9
unknown D father_1 mother_1 -9 -9
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
## htsjdk.samtools.metrics.StringHeader
# MarkDuplicates INPUT=[/data/scratch/manag/test_pipeline/small_variant_caller/interim/A/mapped/A-1.sorted.bam] OUTPUT=/data/scratch/manag/test_pipeline/small_variant_caller/interim/A/dedup/A-1.bam METRICS_FILE=/data/scratch/manag/test_pipeline/small_variant_caller/analysis/A/qc/dedup/A-1.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp] MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
# MarkDuplicates INPUT=[/test_pipeline/small_variant_caller/interim/A/mapped/A-1.sorted.bam] OUTPUT=/test_pipeline/small_variant_caller/interim/A/dedup/A-1.bam METRICS_FILE=/test_pipeline/small_variant_caller/analysis/A/qc/dedup/A-1.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp] MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Fri Apr 02 19:39:58 UTC 2021

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
## htsjdk.samtools.metrics.StringHeader
# MarkDuplicates INPUT=[/data/scratch/manag/test_pipeline/small_variant_caller/interim/A/mapped/A-2.sorted.bam] OUTPUT=/data/scratch/manag/test_pipeline/small_variant_caller/interim/A/dedup/A-2.bam METRICS_FILE=/data/scratch/manag/test_pipeline/small_variant_caller/analysis/A/qc/dedup/A-2.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp] MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
# MarkDuplicates INPUT=[/test_pipeline/small_variant_caller/interim/A/mapped/A-2.sorted.bam] OUTPUT=/test_pipeline/small_variant_caller/interim/A/dedup/A-2.bam METRICS_FILE=/test_pipeline/small_variant_caller/analysis/A/qc/dedup/A-2.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp] MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Fri Apr 02 19:40:06 UTC 2021

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
## htsjdk.samtools.metrics.StringHeader
# MarkDuplicates INPUT=[/data/scratch/manag/test_pipeline/small_variant_caller/interim/B/mapped/B-1.sorted.bam] OUTPUT=/data/scratch/manag/test_pipeline/small_variant_caller/interim/B/dedup/B-1.bam METRICS_FILE=/data/scratch/manag/test_pipeline/small_variant_caller/analysis/B/qc/dedup/B-1.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp] MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
# MarkDuplicates INPUT=[/test_pipeline/small_variant_caller/interim/B/mapped/B-1.sorted.bam] OUTPUT=/test_pipeline/small_variant_caller/interim/B/dedup/B-1.bam METRICS_FILE=/test_pipeline/small_variant_caller/analysis/B/qc/dedup/B-1.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp] MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Fri Apr 02 19:39:58 UTC 2021

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
## htsjdk.samtools.metrics.StringHeader
# MarkDuplicates INPUT=[/data/scratch/manag/test_pipeline/small_variant_caller/interim/B/mapped/B-2.sorted.bam] OUTPUT=/data/scratch/manag/test_pipeline/small_variant_caller/interim/B/dedup/B-2.bam METRICS_FILE=/data/scratch/manag/test_pipeline/small_variant_caller/analysis/B/qc/dedup/B-2.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp] MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
# MarkDuplicates INPUT=[/test_pipeline/small_variant_caller/interim/B/mapped/B-2.sorted.bam] OUTPUT=/test_pipeline/small_variant_caller/interim/B/dedup/B-2.bam METRICS_FILE=/test_pipeline/small_variant_caller/analysis/B/qc/dedup/B-2.metrics.txt REMOVE_DUPLICATES=true TMP_DIR=[/tmp] MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Fri Apr 02 19:40:06 UTC 2021

Expand Down
Binary file added .test/ngs-data/test_project/analysis/C/bam/C.bam
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
chr20 59992 3653078

Binary file not shown.
Binary file not shown.
Binary file added .test/ngs-data/test_project/analysis/D/bam/D.bam
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
chr20 59992 3653078

Binary file not shown.
Binary file not shown.
15 changes: 9 additions & 6 deletions .test/setup_test_datasets.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,8 @@ TARGET_REGION="chr20:59993-3653078"
samtools view -s 0.03 -b $NA12878_BAM $TARGET_REGION > $SUBSAMPLED_BAM

PROJECT_DIR="ngs-data/test_project/analysis"
for sample in A B; do
SAMPLES="A B C D"
for sample in $SAMPLES; do
### bams ###
BAM_DIR="${PROJECT_DIR}/${sample}/bam"
mkdir -p $BAM_DIR
Expand All @@ -44,7 +45,7 @@ rm -f $SUBSAMPLED_BAM
echo "Setting up test vcf files..."
NA12878_VCF="/data/project/worthey_lab/samples/NA12878/analysis/small_variants/na12878.vcf.gz"

for sample in A B; do
for sample in $SAMPLES; do
VCF_DIR="${PROJECT_DIR}/${sample}/vcf"
mkdir -p $VCF_DIR
OUT_vcf=${VCF_DIR}/${sample}.vcf.gz
Expand All @@ -57,7 +58,9 @@ done

############# Regions file #############

# Treat sample B as exome dataset and add a capture-regions bed file
CAPTURE_FILE="${PROJECT_DIR}/B/configs/small_variant_caller/capture_regions.bed"
mkdir -p $(dirname $CAPTURE_FILE)
echo -e "chr20\t59992\t3653078\n" > $CAPTURE_FILE
# For exome mode testing, add capture-regions bed file
for sample in $SAMPLES; do
CAPTURE_FILE="${PROJECT_DIR}/${sample}/configs/small_variant_caller/capture_regions.bed"
mkdir -p $(dirname $CAPTURE_FILE)
echo -e "chr20\t59992\t3653078\n" > $CAPTURE_FILE
done
14 changes: 13 additions & 1 deletion Changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,4 +39,16 @@ YYYY-MM-DD John Doe
2022-04-07 Manavalan Gajapathy

* Previously hardcoded hardware resources for snakemake rules can now be supplied via `configs/workflow.yaml` (closes #48)
* Modified multiqc conda env config to use explicit dependencies to get around installation issues (closes #47)
* Modified multiqc conda env config to use explicit dependencies to get around installation issues (closes #47)


2023-01-20 Manavalan Gajapathy

As part of making QuaC publicly available, following updates were made to make it more generic to the environment and user friendly:

* Removes prerun QC from small variant caller pipeline as requirement to QuaC (closes #45)
* Explicitly defines conda environments (closes #49)
* Uses container solution for `covviz` installation instead of conda to avoid pip based installation (closes #52)
* Removes git submodules and instead saves their local copy to repo (closes #53)
* Loads singularity module loading prior to executing the runner script
* Uses minimal snakemake instead of full-featured snakemake (closes #56)
Loading

0 comments on commit 92db1d1

Please sign in to comment.