You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
These two technical replicates, when getting processed for cutadapt and merge, are overwritten and merged on top of each other. This leads to a bad fastqs which triggers error later on.
OS/Platform
OS/Platform: Slurm HPC with singularity (converted from the docker image)
The failure is on filter task with the following direct complaint:
Traceback (most recent call last):
File "/software/atac-seq-pipeline/src/encode_task_filter.py", line 438, in <module>
main()
File "/software/atac-seq-pipeline/src/encode_task_filter.py", line 344, in main
args.nth, args.mem_gb, args.out_dir)
File "/software/atac-seq-pipeline/src/encode_task_filter.py", line 178, in rm_unmapped_lowq_reads_pe
'No reads found aftering filtering "samtools fixmate"d PE BAM with '
ValueError: No reads found aftering filtering "samtools fixmate"d PE BAM with "samtools view -F 1804 -f 2". Reads are not properly paired even though mapping rate is good?
...
Honestly, it's still not clear to my how this end up with bad reads pairing. But I noticed that the reads are messed up as 1) having duplicated reads and 2) having parts where PE are mis-aligned. I managed to trace back to the align task at encode_task_trim_adapter.py. My reading of that script is it tries to first cutadapt for each fastqs then merge trimmed fastqs from the same biological replicates, and it triggers shell command to do so.
One important thing there is what was supplied as outputs for shell command is just filename with not particular path, i.e. all under current working directory. Because my input (tech-rep) fastqs from the same bio-rep having identical filenames and differ as different file only on parental directory (path), their cutadapt results all have the same file name thus, I think, gets overwrites on each other. And during merged where fastqs are supplied as a list, a single file (b/c of same filename) gets passed three times and three identical copies are merged together. I think this is where things get messed up.
If this makes sense and is correct in root cause analysis, I'm wondering if it's possible to improve the handling of intermediate results by giving cutadapt results temporary unique names? To be clear, Cromwell handles this problem OK as it localize each file to their unique directory (thus fine even their original name is kept).
The text was updated successfully, but these errors were encountered:
Yeah, this looks like an issue of conflicting file names between and R1 and R2 fastqs. cutadapt in encode_task_trim_adapter.py will try to write on the same output filename for R1 and R2.
Please rename R2 fastq and try again.
Yeah, this looks like an issue of conflicting file names between and R1 and R2 fastqs. cutadapt in encode_task_trim_adapter.py will try to write on the same output filename for R1 and R2. Please rename R2 fastq and try again.
Hi @leepc12 , thank you for the quick check and sorry for my delay 😄 Renaming works fine. I think I should clarify the reason I raise this issue is to see if there are any interests in updating the pipeline here to handle such cases so that renaming inputs is not needed? It's like a fixed pattern from our data source. So such renaming will be a little extra task prior to WDL for us. I'm wondering handling this inside the ATAC pipeline is proper.
Have also observed this as an issue. FYI when running on slurm this resulted im an oom error (for encode_task_filter.py), so the problem wasn't obvious until I ran the same fastq files interactively and didn't get the oom error.
Workaround by renaming worked fine.
I second @yunhailuo's suggestion to handle this in the pipeline
Hi @leepc12 , first, thank you for the great pipeline here 🙏
Describe the bug
The failure is partially due to my input fastqs having identical file name and differs as different file on parental directories (path). For example,
These two technical replicates, when getting processed for cutadapt and merge, are overwritten and merged on top of each other. This leads to a bad fastqs which triggers error later on.
OS/Platform
Input JSON file
Paste contents of your input JSON file.
Troubleshooting result
The failure is on filter task with the following direct complaint:
Honestly, it's still not clear to my how this end up with bad reads pairing. But I noticed that the reads are messed up as 1) having duplicated reads and 2) having parts where PE are mis-aligned. I managed to trace back to the align task at
encode_task_trim_adapter.py
. My reading of that script is it tries to first cutadapt for each fastqs then merge trimmed fastqs from the same biological replicates, and it triggers shell command to do so.One important thing there is what was supplied as outputs for shell command is just filename with not particular path, i.e. all under current working directory. Because my input (tech-rep) fastqs from the same bio-rep having identical filenames and differ as different file only on parental directory (path), their cutadapt results all have the same file name thus, I think, gets overwrites on each other. And during merged where fastqs are supplied as a list, a single file (b/c of same filename) gets passed three times and three identical copies are merged together. I think this is where things get messed up.
If this makes sense and is correct in root cause analysis, I'm wondering if it's possible to improve the handling of intermediate results by giving cutadapt results temporary unique names? To be clear, Cromwell handles this problem OK as it localize each file to their unique directory (thus fine even their original name is kept).
The text was updated successfully, but these errors were encountered: