Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes issue with fastq.gz files having _I1_ etc. #109

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 29 additions & 8 deletions sequence_processing_pipeline/FastQCJob.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
from os import listdir, makedirs
from os.path import exists, join, basename
from os.path import basename, exists, join, split
charles-cowart marked this conversation as resolved.
Show resolved Hide resolved
from sequence_processing_pipeline.Job import Job
from sequence_processing_pipeline.PipelineError import PipelineError
from functools import partial
from json import dumps
import logging
from re import compile


class FastQCJob(Job):
Expand Down Expand Up @@ -94,13 +95,33 @@ def _find_projects(self, path_to_run_id_data_fastq_dir, is_raw_input):
'zero_files' not in x]

# break files up into R1, R2, I1, I2
# assume _R1_ does not occur in the path as well.
r1_only = [x for x in files if '_R1_' in x]
r2_only = [x for x in files if '_R2_' in x]

# amplicon runs may or may not have an i2. this is okay.
i1_only = [x for x in files if '_I1_' in x]
i2_only = [x for x in files if '_I2_' in x]
# use capturing to handle both raw files as well as trimmed and
# filtered files. We don't need to process the captured string.
i1_files = compile(r"^.*_L\d{3}_I1_\d{3}\.(trimmed\.|filtered"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the capture group, it doesn't seem like it's used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A side of effect of using the capture group is that I can use one regex to handle '.trimmed.fastq.gz', '.filtered.fastq.gz', and '.fastq.gz' files. Brackets ('[',']') and ORs ('|') didn't work for me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not good practice to rely on side effects... Suggest adding ?:, see

>>> import re
>>> re.compile(r'(?:foo|bar)?baz').match('baz')
<re.Match object; span=(0, 3), match='baz'>
>>> re.compile(r'(?:foo|bar)?baz').match('foobaz')
<re.Match object; span=(0, 6), match='foobaz'>
>>> re.compile(r'(?:foo|bar)?baz').match('barbaz')
<re.Match object; span=(0, 6), match='barbaz'>
>>> re.compile(r'(?:foo|bar)?baz').match('fbaz')
>>>

r"\.|)fastq\.gz$")
i2_files = compile(r"^.*_L\d{3}_I2_\d{3}\.(trimmed\.|filtered"
r"\.|)fastq\.gz$")
r1_files = compile(r"^.*_L\d{3}_R1_\d{3}\.(trimmed\.|filtered"
r"\.|)fastq\.gz$")
r2_files = compile(r"^.*_L\d{3}_R2_\d{3}\.(trimmed\.|filtered"
r"\.|)fastq\.gz$")

# i1_only, i2_only, r1_only, r2_only = ([] for i in range(4))
charles-cowart marked this conversation as resolved.
Show resolved Hide resolved
i1_only = []
i2_only = []
r1_only = []
r2_only = []

for some_path in files:
charles-cowart marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a regression test to verify that the bug exists and these changes fix it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll add a test with the old filenames to show how they fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. It's important that bugs are coupled with regression tests to ensure they are fixed and don't arise again

_, file_name = split(some_path)
if i1_files.match(file_name) is not None:
i1_only.append(some_path)
elif i2_files.match(file_name) is not None:
i2_only.append(some_path)
elif r1_files.match(file_name) is not None:
r1_only.append(some_path)
elif r2_files.match(file_name) is not None:
r2_only.append(some_path)
charles-cowart marked this conversation as resolved.
Show resolved Hide resolved

if not self.is_amplicon and len(i1_only) != len(i2_only):
raise PipelineError('counts of I1 and I2 files do not match')
Expand Down