-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory allocation vs file load time: the cost of NUMA? #1665
Comments
The tl;dr is that if we do nothing but increase memory/table size by 10x, we slow down k-mer loading by a factor of 7x. See comments #1553 (comment) and a rundown of two possible solutions in #1553 (comment). Note that speeding up insertion will still not change the speed of retrieval, which should be subject to the same badness. The ultimate goal should probably be to use more cache-coherent data structures (see some comments in e.g. #1215 and #1198). |
@rob-p has some cool stuff going on with efficient alternatives to (counting) Bloom filters; I wonder if he would be interested in solving our problems for us? Should he stumble across this comment, I would direct his attention to our newly factored out |
Ahh, I see @rob-p's cool stuff is available! https://github.com/splatlab/cqf thx @luizirber |
This paper popped up on Biorxiv the other day, linking to this technical report describing the CQF in detail. |
Some benchmarking of my own: https://gist.github.com/betatim/24886949ded51b90a3ca066c2a7f0439 I ran the benchmark with the process pinned to a CPU (and then varied which one it is pinned to). The thinking is that by pinning our (single threaded) process to a certain CPU we don't end up being scheduled on a different socket with caches that contain nothing useful for us. The best you can do is pin the process to a CPU and that doesn't seem to change anything. So I am unconvinced by the NUMA explanation (or missing something). Last plot shows the runtime when you subtract the time to run the same benchmark on a file containing only 20kmers. This is an attempt at measuring how long it takes to allocate and zero out the memory we request. I think for reasonable sizes (read: large) of memory this takes a linear amount of time. So some of the linear slow down is because it takes time to allocate the memory. |
Edited the gist to add a few more points for the "delta" plot. I'd now claim that up to ~1.5GB it takes a constant amount of time to process the data once you subtract the 'startup' time. |
well, more N means more compute, so this all makes straightforward sense to me... |
Another point: https://fgiesen.wordpress.com/2017/04/11/memory-bandwidth/ |
In the vein of @betatim's I/O experiments #1554, I did some benchmarking on my laptop with hashsizes and load times; results here:
https://gist.github.com/ctb/beb9a7eebf1a1282562ff0272efb4dd6
On my laptop between the ranges of 5 MB and 2 GB we get a nearly linear slowdown (m=0.7) in sequencefile => counttable.
This is likely due to some nasty combination of NUMA and cache (in)coherence of our Bloom filters/CountMin sketches.
The text was updated successfully, but these errors were encountered: