Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix exception when scanning small fastq files #28

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

zwets
Copy link

@zwets zwets commented Oct 19, 2021

Scanning a file with fewer than 100 reads yields a StopIteration exception, due to an exhausted iterator in next(all_lines).

@zwets
Copy link
Author

zwets commented Oct 20, 2021

Actually, looking over the code for check_sequence_identifier_format(file) again, that whole block of code doesn't do what it is supposed to do, namely check the first 100 and last 100 reads for the 'new format'. Instead it returns true if and only if the last read in the file has the new format (but does so in a very roundabout way).

There is also no reason to collect the 2x100 read headers. All the code needs to do is scan headers until it encounters a "new format" header - which will most likely happen on the first read already. I'll be happy to submit a patch.

@zwets
Copy link
Author

zwets commented Oct 20, 2021

Given that checking just the final read has apparently always worked (or downstream bugs would have shown up), I'd suggest replacing the whole block by just checking the first and last read in the file. That would also eliminate the brittle tail -100 system call.

@zwets
Copy link
Author

zwets commented Oct 20, 2021

Given that checking just the final read has apparently always worked (or downstream bugs would have shown up), I'd suggest replacing the whole block by just checking the first and last read in the file.

I have just added a second commit to the PR that does just that: check the first and last read.

If you'd prefer to check more reads at the head of the file, then that should be doable with two extra lines lines of code, but remember that the old code never even checked the first read, so it's probably not worth the bother.

@wbazant
Copy link

wbazant commented Apr 19, 2022

I can't see clearly what your PR does, but based on issue description I might have fixed this too. This is what I thought what the fix was, when I ran into it: rdemko2332@379a7c8

@wbazant
Copy link

wbazant commented Apr 19, 2022

Right actually the latest commit is about this too! 23a3834

@ljmciver sorry to open a conversation here rather than on your forum but I hope it works well enough!

You may not need to kill the run if the file is too short - I was able to run kneaddata successfully on files with less than a 100 lines, with rdemko2332@379a7c8 . I think I've found and fixed this when I was doing some testing for the memory issue I once opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants