Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory allocation vs file load time: the cost of NUMA? #1665

Open
ctb opened this issue Apr 5, 2017 · 9 comments
Open

Memory allocation vs file load time: the cost of NUMA? #1665

ctb opened this issue Apr 5, 2017 · 9 comments

Comments

@ctb
Copy link
Member

ctb commented Apr 5, 2017

In the vein of @betatim's I/O experiments #1554, I did some benchmarking on my laptop with hashsizes and load times; results here:

https://gist.github.com/ctb/beb9a7eebf1a1282562ff0272efb4dd6

On my laptop between the ranges of 5 MB and 2 GB we get a nearly linear slowdown (m=0.7) in sequencefile => counttable.

This is likely due to some nasty combination of NUMA and cache (in)coherence of our Bloom filters/CountMin sketches.

@ctb
Copy link
Member Author

ctb commented Apr 5, 2017

The tl;dr is that if we do nothing but increase memory/table size by 10x, we slow down k-mer loading by a factor of 7x.

See comments #1553 (comment) and a rundown of two possible solutions in #1553 (comment). Note that speeding up insertion will still not change the speed of retrieval, which should be subject to the same badness.

The ultimate goal should probably be to use more cache-coherent data structures (see some comments in e.g. #1215 and #1198).

@ctb
Copy link
Member Author

ctb commented Apr 5, 2017

@rob-p has some cool stuff going on with efficient alternatives to (counting) Bloom filters; I wonder if he would be interested in solving our problems for us? Should he stumble across this comment, I would direct his attention to our newly factored out Storage class in storage.hh, and point out that we only need a get_count and an add implementation to try out new data types.

@ctb
Copy link
Member Author

ctb commented Apr 5, 2017

Ahh, I see @rob-p's cool stuff is available! https://github.com/splatlab/cqf thx @luizirber

@standage
Copy link
Member

standage commented Apr 5, 2017

This paper popped up on Biorxiv the other day, linking to this technical report describing the CQF in detail.

@betatim
Copy link
Member

betatim commented Apr 6, 2017

Some benchmarking of my own: https://gist.github.com/betatim/24886949ded51b90a3ca066c2a7f0439

I ran the benchmark with the process pinned to a CPU (and then varied which one it is pinned to). The thinking is that by pinning our (single threaded) process to a certain CPU we don't end up being scheduled on a different socket with caches that contain nothing useful for us. The best you can do is pin the process to a CPU and that doesn't seem to change anything. So I am unconvinced by the NUMA explanation (or missing something).

Last plot shows the runtime when you subtract the time to run the same benchmark on a file containing only 20kmers. This is an attempt at measuring how long it takes to allocate and zero out the memory we request. I think for reasonable sizes (read: large) of memory this takes a linear amount of time. So some of the linear slow down is because it takes time to allocate the memory.

@betatim
Copy link
Member

betatim commented Apr 6, 2017

Edited the gist to add a few more points for the "delta" plot. I'd now claim that up to ~1.5GB it takes a constant amount of time to process the data once you subtract the 'startup' time.

@betatim
Copy link
Member

betatim commented Apr 6, 2017

What happens if you change -N from 1 to 4?

image

not quite sure what this tells us. Somehow measures the extra cost of computing N indices and accessing that bit of memory

@ctb
Copy link
Member Author

ctb commented Apr 6, 2017

well, more N means more compute, so this all makes straightforward sense to me...

@luizirber
Copy link
Member

Another point: https://fgiesen.wordpress.com/2017/04/11/memory-bandwidth/
But it doesn't help us much at all: we don't have extra compute to do while waiting for memory access...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants