Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look at ParaText for insight into how to speed up our FASTQ parsing #1500

Open
ctb opened this issue Nov 2, 2016 · 2 comments
Open

Look at ParaText for insight into how to speed up our FASTQ parsing #1500

ctb opened this issue Nov 2, 2016 · 2 comments

Comments

@ctb
Copy link
Member

ctb commented Nov 2, 2016

http://www.wise.io/tech/paratext

@betatim
Copy link
Member

betatim commented Dec 13, 2016

#1554 was started to understand this a bit more.

From a first look at their code it seems they split the input file into chunks and process each chunk in a different thread. At least for plain text files. They determine the chunks by seek'ing through the file for a predetermined number of bytes, then continue until they find a newline (and deal with quoted sections that span newlines). After the first pass through the file based on seek()ing they then start the threads to do the actual work. Not sure how we would copy this idea if we want to support streaming (can't seek).

@betatim
Copy link
Member

betatim commented Dec 13, 2016

Not sure I understand why they do this. Reading the bytes from disk shouldn't take a lot of CPU/time. Stuffing stuff into a Q needs extra memory but removes the need to do complicated seeking.

How big a buffer do we need to keep N>1 consumers busy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants