Failed on fastqs having identical filename but different path #431

yunhailuo · 2023-11-03T21:10:44Z

Hi @leepc12 , first, thank you for the great pipeline here 🙏

Describe the bug

The failure is partially due to my input fastqs having identical file name and differs as different file on parental directories (path). For example,

        "/redacted/path/R9921/results/sample_7_R1.fastq.gz",
        "/redacted/path/R9935/results/sample_7_R1.fastq.gz"

These two technical replicates, when getting processed for cutadapt and merge, are overwritten and merged on top of each other. This leads to a bad fastqs which triggers error later on.

OS/Platform

OS/Platform: Slurm HPC with singularity (converted from the docker image)
Conda version: N/A
Pipeline version: v2.2.2
Caper version: N/A

Input JSON file

Paste contents of your input JSON file.

{
    "atac.enable_idr": "True",
    "atac.fastqs_rep1_R1": [
        "/redacted/path/R9921/results/sample_7_R1.fastq.gz",
        "/redacted/path/R9935/results/sample_7_R1.fastq.gz"
    ],
    "atac.fastqs_rep1_R2": [
        "/redacted/path/R9921/results/sample_7_R2.fastq.gz",
        "/redacted/path/R9935/results/sample_7_R2.fastq.gz"
    ],
    "atac.fastqs_rep2_R1": [
        "/redacted/path/R9921/results/sample_8_R1.fastq.gz",
        "/redacted/path/R9935/results/sample_8_R1.fastq.gz"
    ],
    "atac.fastqs_rep2_R2": [
        "/redacted/path/R9921/results/sample_8_R2.fastq.gz",
        "/redacted/path/R9935/results/sample_8_R2.fastq.gz"
    ],
    "atac.fastqs_rep3_R1": [
        "/redacted/path/R9921/results/sample_9_R1.fastq.gz",
        "/redacted/path/R9935/results/sample_9_R1.fastq.gz"
    ],
    "atac.fastqs_rep3_R2": [
        "/redacted/path/R9921/results/sample_9_R2.fastq.gz",
        "/redacted/path/R9935/results/sample_9_R2.fastq.gz"
    ],
    "atac.idr_thresh": "0.05",
    "atac.paired_end": true,
    "atac.genome_tsv": "/somewhere/hg38.tsv",
    "atac.pval_thresh": "0.01",
    "atac.sample_name": "redacted_sample_name",
    "atac.smooth_win": "150"
}

Troubleshooting result

The failure is on filter task with the following direct complaint:

Traceback (most recent call last):
  File "/software/atac-seq-pipeline/src/encode_task_filter.py", line 438, in <module>
    main()
  File "/software/atac-seq-pipeline/src/encode_task_filter.py", line 344, in main
    args.nth, args.mem_gb, args.out_dir)
  File "/software/atac-seq-pipeline/src/encode_task_filter.py", line 178, in rm_unmapped_lowq_reads_pe
    'No reads found aftering filtering "samtools fixmate"d PE BAM with '
ValueError: No reads found aftering filtering "samtools fixmate"d PE BAM with "samtools view -F 1804 -f 2". Reads are not properly paired even though mapping rate is good?
...

Honestly, it's still not clear to my how this end up with bad reads pairing. But I noticed that the reads are messed up as 1) having duplicated reads and 2) having parts where PE are mis-aligned. I managed to trace back to the align task at encode_task_trim_adapter.py. My reading of that script is it tries to first cutadapt for each fastqs then merge trimmed fastqs from the same biological replicates, and it triggers shell command to do so.

One important thing there is what was supplied as outputs for shell command is just filename with not particular path, i.e. all under current working directory. Because my input (tech-rep) fastqs from the same bio-rep having identical filenames and differ as different file only on parental directory (path), their cutadapt results all have the same file name thus, I think, gets overwrites on each other. And during merged where fastqs are supplied as a list, a single file (b/c of same filename) gets passed three times and three identical copies are merged together. I think this is where things get messed up.

If this makes sense and is correct in root cause analysis, I'm wondering if it's possible to improve the handling of intermediate results by giving cutadapt results temporary unique names? To be clear, Cromwell handles this problem OK as it localize each file to their unique directory (thus fine even their original name is kept).

The text was updated successfully, but these errors were encountered:

leepc12 · 2023-11-06T20:25:26Z

Yeah, this looks like an issue of conflicting file names between and R1 and R2 fastqs.
cutadapt in encode_task_trim_adapter.py will try to write on the same output filename for R1 and R2.
Please rename R2 fastq and try again.

yunhailuo · 2023-11-10T17:37:15Z

Yeah, this looks like an issue of conflicting file names between and R1 and R2 fastqs. cutadapt in encode_task_trim_adapter.py will try to write on the same output filename for R1 and R2. Please rename R2 fastq and try again.

Hi @leepc12 , thank you for the quick check and sorry for my delay 😄 Renaming works fine. I think I should clarify the reason I raise this issue is to see if there are any interests in updating the pipeline here to handle such cases so that renaming inputs is not needed? It's like a fixed pattern from our data source. So such renaming will be a little extra task prior to WDL for us. I'm wondering handling this inside the ATAC pipeline is proper.

kloot · 2023-12-31T13:33:00Z

Have also observed this as an issue. FYI when running on slurm this resulted im an oom error (for encode_task_filter.py), so the problem wasn't obvious until I ran the same fastq files interactively and didn't get the oom error.
Workaround by renaming worked fine.

I second @yunhailuo's suggestion to handle this in the pipeline

Thanks for the great tool!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed on fastqs having identical filename but different path #431

Failed on fastqs having identical filename but different path #431

yunhailuo commented Nov 3, 2023

leepc12 commented Nov 6, 2023

yunhailuo commented Nov 10, 2023

kloot commented Dec 31, 2023

Failed on fastqs having identical filename but different path #431

Failed on fastqs having identical filename but different path #431

Comments

yunhailuo commented Nov 3, 2023

Describe the bug

OS/Platform

Input JSON file

Troubleshooting result

leepc12 commented Nov 6, 2023

yunhailuo commented Nov 10, 2023

kloot commented Dec 31, 2023