Look at ParaText for insight into how to speed up our FASTQ parsing #1500

ctb · 2016-11-02T13:52:44Z

http://www.wise.io/tech/paratext

betatim · 2016-12-13T15:05:41Z

#1554 was started to understand this a bit more.

From a first look at their code it seems they split the input file into chunks and process each chunk in a different thread. At least for plain text files. They determine the chunks by seek'ing through the file for a predetermined number of bytes, then continue until they find a newline (and deal with quoted sections that span newlines). After the first pass through the file based on seek()ing they then start the threads to do the actual work. Not sure how we would copy this idea if we want to support streaming (can't seek).

betatim · 2016-12-13T15:08:11Z

Not sure I understand why they do this. Reading the bytes from disk shouldn't take a lot of CPU/time. Stuffing stuff into a Q needs extra memory but removes the need to do complicated seeking.

How big a buffer do we need to keep N>1 consumers busy?

ctb added the optimization label Dec 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Look at ParaText for insight into how to speed up our FASTQ parsing #1500

Look at ParaText for insight into how to speed up our FASTQ parsing #1500

ctb commented Nov 2, 2016

betatim commented Dec 13, 2016

betatim commented Dec 13, 2016

Look at ParaText for insight into how to speed up our FASTQ parsing #1500

Look at ParaText for insight into how to speed up our FASTQ parsing #1500

Comments

ctb commented Nov 2, 2016

betatim commented Dec 13, 2016

betatim commented Dec 13, 2016