Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected input file name changes output file file name format on split_on_adapter #30

Open
groodri opened this issue Jan 12, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@groodri
Copy link

groodri commented Jan 12, 2023

When input FASTX file names include a dot (.) that is not a file extension suffix (example: testfile.1.fastq.gz), split_on_adapter will read .1.fastq.gz as the whole suffix, instead of .fastq.gz. Thus, the output file will be called testfile.fastq.gz, instead of testfile.1_split.fastq.gz. This can break processes downstream in pipelines, because the output file name is not as expected when new naming schemes are introduced.

This is due to lines 123-126 in split_on_adapter.py.

For example:

>>> from natsort import natsorted
>>> from pathlib import Path
>>> fastxs = natsorted(list(Path('.').rglob('*.fastq*')), key=str)
>>> fastx = fastxs[2]
>>> fastx.name
'testfile.1.fastq.gz'
>>> fastx.with_name(fastx.name.replace('.fastq', '').replace('.gz', '') + '_split').with_suffix('.fastq.gz')
PosixPath('testfile.fastq.gz')

Can be solved with this example:

>>> fastx.with_name(fastx.name.replace('.fastq', '').replace('.gz', '') + '_split.fastq.gz')
PosixPath('testfile.1_split.fastq.gz')

Essentially the current code is just overwriting its own addition of '_split' when an unexpected "suffix" occurs.
Accounting for these unexpected suffixes with the --pattern flag can be quite difficult (what would work for this case, assuming there will be more files in the folder named .2.fastq.gz, ..., .600.fastq.gz?), so this seems a pertinent change.

File names that include non-suffix dots can happen due to a variety of reasons. For example, when FASTQ files are split into multiple files with N number of reads in each, for better memory management.

@onordesjo
Copy link

Thanks @groodri, I would agree with you that it would be sensible to fix this.

In the meantime, if this is an issue that needs an immediate workaround (and for the benefit of other people who may need a fix), please feel free to rename the files like below:

rename "s/testfile./testfile_/g" *.fastq.gz

@ollenordesjo ollenordesjo added the bug Something isn't working label Feb 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants