Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ATAC-seq pipeline with latest working scripts #160

Open
wants to merge 41 commits into
base: master
Choose a base branch
from

Conversation

marinafloresp
Copy link
Collaborator

Description

This pull request will update the scripts used in the ATAC-seq pipeline within steps 0 to 6. Older unused scripts have been removed and scripts have been updated to the latest working version. The README file has been updated with extra information. The pipeline's main scripts now include more output messages to let the user know what processes have been successfully run and subscripts check whether the right output have been produced, and inform the user if not.

Type of pull request

  • Bug fix
  • New feature/enhancement
  • Code refactor
  • Documentation update

Checklist

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have tested my code to check that it is functional
  • I have used linters to check for common sources of errors
  • I have implemented fail safes in my code to account for edge cases
  • I have made the corresponding changes to the documentation

@marinafloresp marinafloresp added the ATACSeq ATAC-seq data label Jul 16, 2024
Comment on lines 96 to 101
echo " "
echo "|| Running STEP 6.1 of ATAC-seq pipeline: COMPARE. Samples will be compared to their matching genotype data.||"
echo " "
echo "Output directory will be: ${ALIGNED_DIR}/genotypeConcordance"
echo "Output directory will be: ${ALIGNED_DIR}/baseRecalibrate"
echo " "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might find the following useful for whenever you want to print a big chunk of text to the standard output:

cat << EOF
text
more text

text


|| text ||
EOF

Just put all of your text you want to print inside the two EOF keywords.
This saves you from putting all of these echo statements and "" lines. Most Linux systems have cat installed so it should still be pretty portable. Look up heredoc for more information.

echo " "

# process a line from IDMap file
IDS=($(head -n ${SLURM_ARRAY_TASK_ID} ${META_DIR}/matchedVCFIDs.txt | tail -1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick check, can ${SLURM_ARRAY_TASK_ID} here be 1? If it can be then the first row of matchedVCFIDs.txt is a header row and so this won't give you what you want.

Looking at an example of samples.txt the file doesn't have a header row, so I presume this will be pulling out the incorrect sample for all task ids (provided that my assumption that ${SLURM_ARRAY_TASK_ID} corresponds to the sample in samples.txt with the same row number).

I might be being silly and this works regardless due to 0/1 indexing, just checking.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it will make more sense to add a header row to samples.txt so that the first sample will be index 1 and not 0. My programming side assumes that the first index is 0 XD

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do that, sure. But then ${SLURM_ARRAY_TASK_ID} will need to be 2 to select the first data row in samples.txt and matchedVCFIDs.txt (as head -1 $file | tail -1 gets you the first row which will now be the header).

echo "Output directory will be: ${ALIGNED_DIR}/baseRecalibrate"
echo " "

cd ${ALIGNED_DIR}/genotypeConcordance/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this directory necessarily exist? The only time I see the directory being created is in
compareBamWithGenotypes.sh on line 94. However, that script is only being called if you enter COMPARE as the second argument to this script. If you don't have COMPARE but do have SWITCH in the second argument, this line will run (but the directory hasn't been created).

I guess the directory might be created with line 42 in searchBestGenoMatch.sh. But this script won't have run yet (not until line 152 of this script).

If this cd command fails, it looks like the only ramifications will be that lines 146 and 147 will likely fail (as they can't find these files).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then again, if you were to create this directory here, then the awk commands below will still fail as these *.selfSM files won't be found regardless (as the directory would be empty).
Am I missing something here? It feels like these awk commands will fail if the COMPARE section is not ran.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step of the script will only make sense to run if you have already run the COMPARE step, as it is where the results are output. The SWITCH step checks those samples that performed poorly in the COMPARE step and aims to find better genotype matches. That is why it is assumed that the genotypeConcordance directory exists already.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this in the README?

cd ${ALIGNED_DIR}/

## If directory does not exist, create
mkdir -p ${ALIGNED_DIR}/baseRecalibrate

sampleName=$1
vcfid=$2
vcfid="${vcfid%"${vcfid##*[![:space:]]}"}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to tell what this is doing. I've taken a row from sortedBDR/0_metadata/matchedVCFIDs.txt and tried out this line here using a value from the VCFID column.

However, this doesn't seem to actually do anything to the string.
To elaborate:

vcfid=201023670019_R06C02_EX162 # Some VCFID from the file I mentioned
vcfid="${vcfid%"${vcfid##*[![:space:]]}"}"
echo $vcfid # returns 201023670019_R06C02_EX162 (the same value)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know its not your code, just checking if this matters or not

@marinafloresp
Copy link
Collaborator Author

@sof202 Thank you for your comments, I will go over then and make appropriate changes. Some scripts have older parts that might be outdated or non-efficient.

Comment on lines +36 to +40
colnames(xPeaks)[c(4,17,18)] <- c("sex-gene","counts","sampleID")
colnames(yPeaks)[c(4,10,11)] <- c("peak-name","counts","sampleID")

xPeaks$sampleID<-basename(xPeaks$"sampleID")
yPeaks$sampleID<-basename(yPeaks$"sampleID")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much clearer now, thank you very much.

Comment on lines 728 to 729
colnames(corMergeStats)<- c('total\nreads', 'dedup', 'poor\nqual', 'mt\nreads', 'alignt\nrate', 'distinct\nreads',
'NFR','PBC1', "PropNFR", "PropMono", "DipP","NPeaks_PE", "FRIP_PE")
'NFR','PBC1', "PropNFR", "PropMono", "DipP","Periodicity","NPeaks_PE", "FRIP_PE")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know you could put escape sequences in colnames, thanks Marina

@@ -73,7 +73,8 @@ samtools stats ${ALIGNED_DIR}/${sampleName}_noMT.bam > ${ALIGNED_DIR}/QCOutput/$
# only keep properly paired reads
echo "filtering aligned reads"
samtools view -F 524 -f 2 -q 30 -u ${ALIGNED_DIR}/${sampleName}_noMT.bam | samtools sort -n /dev/stdin -o ${ALIGNED_DIR}/${sampleName}_q30.tmp.nmsrt.bam
samtools view -h ${ALIGNED_DIR}/${sampleName}_q30.tmp.nmsrt.bam | $(which assign_multimappers.py) -k $multimap --paired-end | samtools fixmate -r /dev/stdin ${ALIGNED_DIR}/${sampleName}_q30.tmp.nmsrt.fixmate.bam
echo $MULTIMAP/assign_multimappers.py
samtools view -h ${ALIGNED_DIR}/${sampleName}_q30.tmp.nmsrt.bam | $MULTIMAP/assign_multimappers.py -k $multimap --paired-end | samtools fixmate -r /dev/stdin ${ALIGNED_DIR}/${sampleName}_q30.tmp.nmsrt.fixmate.bam
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot that this line requires assign_multimappers.py to be on $PATH, good find.

Copy link
Owner

@ejh243 ejh243 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this to merge with main. Any bugs can be fixed as issues if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATACSeq ATAC-seq data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants