-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ATAC-seq pipeline with latest working scripts #160
base: master
Are you sure you want to change the base?
Conversation
echo " " | ||
echo "|| Running STEP 6.1 of ATAC-seq pipeline: COMPARE. Samples will be compared to their matching genotype data.||" | ||
echo " " | ||
echo "Output directory will be: ${ALIGNED_DIR}/genotypeConcordance" | ||
echo "Output directory will be: ${ALIGNED_DIR}/baseRecalibrate" | ||
echo " " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might find the following useful for whenever you want to print a big chunk of text to the standard output:
cat << EOF
text
more text
text
|| text ||
EOF
Just put all of your text you want to print inside the two EOF
keywords.
This saves you from putting all of these echo
statements and ""
lines. Most Linux systems have cat
installed so it should still be pretty portable. Look up heredoc
for more information.
echo " " | ||
|
||
# process a line from IDMap file | ||
IDS=($(head -n ${SLURM_ARRAY_TASK_ID} ${META_DIR}/matchedVCFIDs.txt | tail -1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick check, can ${SLURM_ARRAY_TASK_ID}
here be 1? If it can be then the first row of matchedVCFIDs.txt
is a header row and so this won't give you what you want.
Looking at an example of samples.txt
the file doesn't have a header row, so I presume this will be pulling out the incorrect sample for all task ids (provided that my assumption that ${SLURM_ARRAY_TASK_ID}
corresponds to the sample in samples.txt
with the same row number).
I might be being silly and this works regardless due to 0/1 indexing, just checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it will make more sense to add a header row to samples.txt so that the first sample will be index 1 and not 0. My programming side assumes that the first index is 0 XD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do that, sure. But then ${SLURM_ARRAY_TASK_ID}
will need to be 2
to select the first data row in samples.txt
and matchedVCFIDs.txt
(as head -1 $file | tail -1
gets you the first row which will now be the header).
echo "Output directory will be: ${ALIGNED_DIR}/baseRecalibrate" | ||
echo " " | ||
|
||
cd ${ALIGNED_DIR}/genotypeConcordance/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this directory necessarily exist? The only time I see the directory being created is in
compareBamWithGenotypes.sh
on line 94. However, that script is only being called if you enter COMPARE
as the second argument to this script. If you don't have COMPARE
but do have SWITCH
in the second argument, this line will run (but the directory hasn't been created).
I guess the directory might be created with line 42 in searchBestGenoMatch.sh
. But this script won't have run yet (not until line 152 of this script).
If this cd
command fails, it looks like the only ramifications will be that lines 146 and 147 will likely fail (as they can't find these files).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then again, if you were to create this directory here, then the awk
commands below will still fail as these *.selfSM
files won't be found regardless (as the directory would be empty).
Am I missing something here? It feels like these awk
commands will fail if the COMPARE
section is not ran.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This step of the script will only make sense to run if you have already run the COMPARE step, as it is where the results are output. The SWITCH step checks those samples that performed poorly in the COMPARE step and aims to find better genotype matches. That is why it is assumed that the genotypeConcordance directory exists already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this in the README?
cd ${ALIGNED_DIR}/ | ||
|
||
## If directory does not exist, create | ||
mkdir -p ${ALIGNED_DIR}/baseRecalibrate | ||
|
||
sampleName=$1 | ||
vcfid=$2 | ||
vcfid="${vcfid%"${vcfid##*[![:space:]]}"}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling to tell what this is doing. I've taken a row from sortedBDR/0_metadata/matchedVCFIDs.txt
and tried out this line here using a value from the VCFID column.
However, this doesn't seem to actually do anything to the string.
To elaborate:
vcfid=201023670019_R06C02_EX162 # Some VCFID from the file I mentioned
vcfid="${vcfid%"${vcfid##*[![:space:]]}"}"
echo $vcfid # returns 201023670019_R06C02_EX162 (the same value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know its not your code, just checking if this matters or not
@sof202 Thank you for your comments, I will go over then and make appropriate changes. Some scripts have older parts that might be outdated or non-efficient. |
colnames(xPeaks)[c(4,17,18)] <- c("sex-gene","counts","sampleID") | ||
colnames(yPeaks)[c(4,10,11)] <- c("peak-name","counts","sampleID") | ||
|
||
xPeaks$sampleID<-basename(xPeaks$"sampleID") | ||
yPeaks$sampleID<-basename(yPeaks$"sampleID") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much clearer now, thank you very much.
colnames(corMergeStats)<- c('total\nreads', 'dedup', 'poor\nqual', 'mt\nreads', 'alignt\nrate', 'distinct\nreads', | ||
'NFR','PBC1', "PropNFR", "PropMono", "DipP","NPeaks_PE", "FRIP_PE") | ||
'NFR','PBC1', "PropNFR", "PropMono", "DipP","Periodicity","NPeaks_PE", "FRIP_PE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know you could put escape sequences in colnames
, thanks Marina
@@ -73,7 +73,8 @@ samtools stats ${ALIGNED_DIR}/${sampleName}_noMT.bam > ${ALIGNED_DIR}/QCOutput/$ | |||
# only keep properly paired reads | |||
echo "filtering aligned reads" | |||
samtools view -F 524 -f 2 -q 30 -u ${ALIGNED_DIR}/${sampleName}_noMT.bam | samtools sort -n /dev/stdin -o ${ALIGNED_DIR}/${sampleName}_q30.tmp.nmsrt.bam | |||
samtools view -h ${ALIGNED_DIR}/${sampleName}_q30.tmp.nmsrt.bam | $(which assign_multimappers.py) -k $multimap --paired-end | samtools fixmate -r /dev/stdin ${ALIGNED_DIR}/${sampleName}_q30.tmp.nmsrt.fixmate.bam | |||
echo $MULTIMAP/assign_multimappers.py | |||
samtools view -h ${ALIGNED_DIR}/${sampleName}_q30.tmp.nmsrt.bam | $MULTIMAP/assign_multimappers.py -k $multimap --paired-end | samtools fixmate -r /dev/stdin ${ALIGNED_DIR}/${sampleName}_q30.tmp.nmsrt.fixmate.bam |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot that this line requires assign_multimappers.py
to be on $PATH
, good find.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving this to merge with main. Any bugs can be fixed as issues if needed.
Description
This pull request will update the scripts used in the ATAC-seq pipeline within steps 0 to 6. Older unused scripts have been removed and scripts have been updated to the latest working version. The README file has been updated with extra information. The pipeline's main scripts now include more output messages to let the user know what processes have been successfully run and subscripts check whether the right output have been produced, and inform the user if not.
Type of pull request
Checklist