Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO experiments #1554

Open
betatim opened this issue Dec 13, 2016 · 5 comments
Open

IO experiments #1554

betatim opened this issue Dec 13, 2016 · 5 comments

Comments

@betatim
Copy link
Member

betatim commented Dec 13, 2016

This is a small study of different ways to handle the input. The idea is to gather some data on what works, what doesn't and what is slow/fast.

https://gist.github.com/betatim/d712d0b47a6136998c16561c8f1ca686

Interesting observations:

  • gunzip'ing is slow
  • reading a plain text file with pure python is as fast as using ReadParser (but this is a dumb, none robust python version)
  • open(fname 'rb') beats everyone if you do not decode the bytes (treat everything as int)
  • open(.., 'r', encoding='ascii') is faster than decoding each line ourselves
  • messing about with buffering=... doesn't seem to do much

Not many conclusions yet.

@betatim
Copy link
Member Author

betatim commented Dec 13, 2016

Some example output from running on the ecoli file:

<function read_bytes at 0x101cf3620> 12.95s
<function read_readparser at 0x101cf38c8> 25.55s
<function read_gzcat2 at 0x101cf37b8> 25.81s
<function read_plain at 0x101cf3510> 24.93s
<function read_plain2 at 0x101cf3598> 17.61s

@ctb
Copy link
Member

ctb commented Dec 13, 2016

nice! Do we know yet why @camillescott's Cython version of read parsing was so much faster? Is it just the call-into-Python problem?

@betatim
Copy link
Member Author

betatim commented Dec 13, 2016

One more note: there must be something wrong with how I used mmap because it runs "forever"?? Any wisdom?

I gave the cython ReadParser a try here but it didn't speed up the overall script.

@betatim
Copy link
Member Author

betatim commented Dec 16, 2016

In light of the experiments in #1553 I'm thinking that worrying about how to read stuff from disk faster won't give us big wins compared to speeding up entering things into a countgraph and friends.

I think we should do some thinking how to take advantage of the fact that a super simple bit of python can be ~2 times faster than ReadParser (even read_plain2 is faster). For example, can we use a simple fast "parser" with no error handling until we encounter something that confuses the "parser" and only then switch to a robust/slow parser to try and recover from the error. Ideally all without making the code more complex.

@ctb
Copy link
Member

ctb commented Dec 16, 2016 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants