Update ATAC-seq pipeline with latest working scripts #160

marinafloresp · 2024-07-16T14:30:00Z

Description

This pull request will update the scripts used in the ATAC-seq pipeline within steps 0 to 6. Older unused scripts have been removed and scripts have been updated to the latest working version. The README file has been updated with extra information. The pipeline's main scripts now include more output messages to let the user know what processes have been successfully run and subscripts check whether the right output have been produced, and inform the user if not.

Type of pull request

Bug fix
New feature/enhancement
Code refactor
Documentation update

Checklist

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have tested my code to check that it is functional
I have used linters to check for common sources of errors
I have implemented fail safes in my code to account for edge cases
I have made the corresponding changes to the documentation

sequencing/ATACSeq/jobSubmission/6_batchRunGenotypeCheck.sh

sof202 · 2024-07-16T15:03:29Z

sequencing/ATACSeq/jobSubmission/6_batchRunGenotypeCheck.sh

+ echo " "
+ echo "|| Running STEP 6.1 of ATAC-seq pipeline: COMPARE. Samples will be compared to their matching genotype data.||"
+ echo " "
+ echo "Output directory will be: ${ALIGNED_DIR}/genotypeConcordance"
+ echo "Output directory will be: ${ALIGNED_DIR}/baseRecalibrate"
+ echo " "


You might find the following useful for whenever you want to print a big chunk of text to the standard output:

cat << EOF text more text text || text || EOF

Just put all of your text you want to print inside the two EOF keywords.
This saves you from putting all of these echo statements and "" lines. Most Linux systems have cat installed so it should still be pretty portable. Look up heredoc for more information.

sof202 · 2024-07-16T15:12:01Z

sequencing/ATACSeq/jobSubmission/6_batchRunGenotypeCheck.sh

+ echo " "
+
+ # process a line from IDMap file
+ IDS=($(head -n ${SLURM_ARRAY_TASK_ID} ${META_DIR}/matchedVCFIDs.txt | tail -1))


Just a quick check, can ${SLURM_ARRAY_TASK_ID} here be 1? If it can be then the first row of matchedVCFIDs.txt is a header row and so this won't give you what you want.

Looking at an example of samples.txt the file doesn't have a header row, so I presume this will be pulling out the incorrect sample for all task ids (provided that my assumption that ${SLURM_ARRAY_TASK_ID} corresponds to the sample in samples.txt with the same row number).

I might be being silly and this works regardless due to 0/1 indexing, just checking.

I guess it will make more sense to add a header row to samples.txt so that the first sample will be index 1 and not 0. My programming side assumes that the first index is 0 XD

You can do that, sure. But then ${SLURM_ARRAY_TASK_ID} will need to be 2 to select the first data row in samples.txt and matchedVCFIDs.txt (as head -1 $file | tail -1 gets you the first row which will now be the header).

sof202 · 2024-07-16T15:22:00Z

sequencing/ATACSeq/jobSubmission/6_batchRunGenotypeCheck.sh

+ echo "Output directory will be: ${ALIGNED_DIR}/baseRecalibrate"
+ echo " "
+
+ cd ${ALIGNED_DIR}/genotypeConcordance/


Does this directory necessarily exist? The only time I see the directory being created is in
compareBamWithGenotypes.sh on line 94. However, that script is only being called if you enter COMPARE as the second argument to this script. If you don't have COMPARE but do have SWITCH in the second argument, this line will run (but the directory hasn't been created).

I guess the directory might be created with line 42 in searchBestGenoMatch.sh. But this script won't have run yet (not until line 152 of this script).

If this cd command fails, it looks like the only ramifications will be that lines 146 and 147 will likely fail (as they can't find these files).

Then again, if you were to create this directory here, then the awk commands below will still fail as these *.selfSM files won't be found regardless (as the directory would be empty).
Am I missing something here? It feels like these awk commands will fail if the COMPARE section is not ran.

This step of the script will only make sense to run if you have already run the COMPARE step, as it is where the results are output. The SWITCH step checks those samples that performed poorly in the COMPARE step and aims to find better genotype matches. That is why it is assumed that the genotypeConcordance directory exists already.

Is this in the README?

sof202 · 2024-07-16T15:39:48Z

sequencing/ATACSeq/subScripts/compareBamWithGenotypes.sh

+cd ${ALIGNED_DIR}/
+
+## If directory does not exist, create
+mkdir -p ${ALIGNED_DIR}/baseRecalibrate

 sampleName=$1
 vcfid=$2
 vcfid="${vcfid%"${vcfid##*[![:space:]]}"}" 


I'm struggling to tell what this is doing. I've taken a row from sortedBDR/0_metadata/matchedVCFIDs.txt and tried out this line here using a value from the VCFID column.

However, this doesn't seem to actually do anything to the string.
To elaborate:

vcfid=201023670019_R06C02_EX162 # Some VCFID from the file I mentioned vcfid="${vcfid%"${vcfid##*[![:space:]]}"}" echo $vcfid # returns 201023670019_R06C02_EX162 (the same value)

I know its not your code, just checking if this matters or not

sequencing/ATACSeq/subScripts/compareBamWithGenotypes.sh

sequencing/ATACSeq/Rscripts/collateSexChecks.r

sequencing/ATACSeq/jobSubmission/1_batchRunPreAnalysis.sh

sequencing/ATACSeq/Rscripts/collateSexChecks.r

marinafloresp · 2024-07-17T08:39:04Z

@sof202 Thank you for your comments, I will go over then and make appropriate changes. Some scripts have older parts that might be outdated or non-efficient.

sof202 · 2024-07-29T09:04:17Z

sequencing/ATACSeq/Rscripts/collateSexChecks.r

+colnames(xPeaks)[c(4,17,18)] <- c("sex-gene","counts","sampleID")
+colnames(yPeaks)[c(4,10,11)] <- c("peak-name","counts","sampleID")
+
+xPeaks$sampleID<-basename(xPeaks$"sampleID")
+yPeaks$sampleID<-basename(yPeaks$"sampleID")


This is much clearer now, thank you very much.

sof202 · 2024-08-01T15:52:53Z

sequencing/ATACSeq/Rscripts/collateDataQualityStats.Rmd

 colnames(corMergeStats)<- c('total\nreads', 'dedup', 'poor\nqual', 'mt\nreads', 'alignt\nrate', 'distinct\nreads', 
- 'NFR','PBC1', "PropNFR", "PropMono", "DipP","NPeaks_PE", "FRIP_PE")
+ 'NFR','PBC1', "PropNFR", "PropMono", "DipP","Periodicity","NPeaks_PE", "FRIP_PE")


I didn't know you could put escape sequences in colnames, thanks Marina

sof202 · 2024-08-29T12:16:30Z

sequencing/ATACSeq/subScripts/alignment.sh

@@ -73,7 +73,8 @@ samtools stats ${ALIGNED_DIR}/${sampleName}_noMT.bam > ${ALIGNED_DIR}/QCOutput/$
 # only keep properly paired reads
 echo "filtering aligned reads"
 samtools view -F 524 -f 2 -q 30 -u ${ALIGNED_DIR}/${sampleName}_noMT.bam | samtools sort -n /dev/stdin -o ${ALIGNED_DIR}/${sampleName}_q30.tmp.nmsrt.bam
-samtools view -h ${ALIGNED_DIR}/${sampleName}_q30.tmp.nmsrt.bam | $(which assign_multimappers.py) -k $multimap --paired-end | samtools fixmate -r /dev/stdin ${ALIGNED_DIR}/${sampleName}_q30.tmp.nmsrt.fixmate.bam
+echo $MULTIMAP/assign_multimappers.py
+samtools view -h ${ALIGNED_DIR}/${sampleName}_q30.tmp.nmsrt.bam | $MULTIMAP/assign_multimappers.py -k $multimap --paired-end | samtools fixmate -r /dev/stdin ${ALIGNED_DIR}/${sampleName}_q30.tmp.nmsrt.fixmate.bam


I forgot that this line requires assign_multimappers.py to be on $PATH, good find.

ejh243

Approving this to merge with main. Any bugs can be fixed as issues if needed.

marinafloresp and others added 30 commits March 13, 2024 13:55

Update step 1 ATAC pipeline

01275e2

Updated step 1 script with suggested changes

d1fca73

Update step 1 script ATAC

9107b38

Update script step 2 ATAC pipeline

2f63350

Update script step 1.3, step 2

adba657

Script for pipeline set-up

52d22fe

Add file to set up R env

b6322f4

Update files step 1 and 2 ATACSeq pipeline

05e131d

Update scripts step 2,3 ATACseq pipeline

1dd5b05

Add HMMRATAC script

bc12775

Update scripts with only MACS3 PE

846b62e

Update setUp file for ATACseq pipeline

90acd7f

Update README file

4577b54

Delete outdated files

8707326

Merge branch 'master' into atac-update

0cba971

Fix trimming step

bc36dfd

Merge remote-tracking branch 'origin/atac-update' into atac-update

d861ec7

Add example config files for running pipeline

edc3566

Clean files step 1

0aab117

Update intallLibraries.r

4bb86dd

Update intallLibraries.r

24cfdfd

Update config files examples

8d51c33

Merge remote-tracking branch 'origin/atac-update' into atac-update

90a2245

Update config.txt

de2e262

Update pipeline files

989eade

Merge remote-tracking branch 'origin/atac-update' into atac-update

297905f

Merge branch 'master' into atac-update

3c4c5c1

Update ATACseq pipeline

2dff18b

Update ATAC-seq pipeline

d06980a

Merge remote-tracking branch 'origin/atac-update' into atac-update

7f086e1

marinafloresp added 2 commits July 16, 2024 14:41

Update ATAC-seq pipeline

4b402aa

Update ATAC-seq pipeline

dc14651

marinafloresp added the ATACSeq ATAC-seq data label Jul 16, 2024

marinafloresp added 2 commits July 16, 2024 15:31

Update README.md

937bcd2

Update README.md

2390865