IO experiments #1554

betatim · 2016-12-13T14:36:20Z

This is a small study of different ways to handle the input. The idea is to gather some data on what works, what doesn't and what is slow/fast.

https://gist.github.com/betatim/d712d0b47a6136998c16561c8f1ca686

Interesting observations:

gunzip'ing is slow
reading a plain text file with pure python is as fast as using ReadParser (but this is a dumb, none robust python version)
open(fname 'rb') beats everyone if you do not decode the bytes (treat everything as int)
open(.., 'r', encoding='ascii') is faster than decoding each line ourselves
messing about with buffering=... doesn't seem to do much

Not many conclusions yet.

The text was updated successfully, but these errors were encountered:

betatim · 2016-12-13T14:38:06Z

Some example output from running on the ecoli file:

<function read_bytes at 0x101cf3620> 12.95s
<function read_readparser at 0x101cf38c8> 25.55s
<function read_gzcat2 at 0x101cf37b8> 25.81s
<function read_plain at 0x101cf3510> 24.93s
<function read_plain2 at 0x101cf3598> 17.61s

ctb · 2016-12-13T14:46:00Z

nice! Do we know yet why @camillescott's Cython version of read parsing was so much faster? Is it just the call-into-Python problem?

betatim · 2016-12-13T14:55:47Z

One more note: there must be something wrong with how I used mmap because it runs "forever"?? Any wisdom?

I gave the cython ReadParser a try here but it didn't speed up the overall script.

betatim · 2016-12-16T15:12:13Z

In light of the experiments in #1553 I'm thinking that worrying about how to read stuff from disk faster won't give us big wins compared to speeding up entering things into a countgraph and friends.

I think we should do some thinking how to take advantage of the fact that a super simple bit of python can be ~2 times faster than ReadParser (even read_plain2 is faster). For example, can we use a simple fast "parser" with no error handling until we encounter something that confuses the "parser" and only then switch to a robust/slow parser to try and recover from the error. Ideally all without making the code more complex.

ctb · 2016-12-16T15:14:00Z

On Fri, Dec 16, 2016 at 07:12:13AM -0800, Tim Head wrote: In light of the experiments in #1553 I'm thinking that worrying about how to read stuff from disk faster won't give us big wins compared to speeding up entering things into a countgraph and friends. I think we should do some thinking how to take advantage of the fact that a super simple bit of python can be ~2 times faster than `ReadParser` (even `read_plain2` is faster). For example, can we use a simple fast "parser" with no error handling until we encounter something that confuses the "parser" and only then switch to a robust/slow parser to try and recover from the error. Ideally all without making the code more complex.

+1 for brainstorming and trying stuff out! I've thought idly about stuff like this in the past but never gone anywhere with it; it'd be great to try it out.

betatim mentioned this issue Dec 13, 2016

Look at ParaText for insight into how to speed up our FASTQ parsing #1500

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IO experiments #1554

IO experiments #1554

betatim commented Dec 13, 2016

betatim commented Dec 13, 2016

ctb commented Dec 13, 2016

betatim commented Dec 13, 2016 •

edited

Loading

betatim commented Dec 16, 2016

ctb commented Dec 16, 2016 via email

IO experiments #1554

IO experiments #1554

Comments

betatim commented Dec 13, 2016

betatim commented Dec 13, 2016

ctb commented Dec 13, 2016

betatim commented Dec 13, 2016 • edited Loading

betatim commented Dec 16, 2016

ctb commented Dec 16, 2016 via email

betatim commented Dec 13, 2016 •

edited

Loading