-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF, TODO: Ensure 128-bit alignment all rows/columns for SIMD friendliness #3146
Comments
Already almost 99% sure that arrays are not always 128-bit aligned at the start (I tried testing for it manually at runtime from Cython, requiring both input and output arrays to be 128-bit aligned basically eliminates the effect of my refinement, meaning they're almost never both aligned unless randomly so), I don't know if that's causing the entire performance/variability issue but it can't be helping. |
This sounds like it might involve some pain to apply throughout? Just getting some perf numbers on a toy example, measuring the This really is turning into an ATLAS/BLAS style line of research.. :) |
@y-p can probably just target a few key cases (block consolidation, DataFrame construction, etc.) to handle 90%...and with a helper function |
Making performant code often involves making the codebase more fragmented, Would this be somehthing users can employ in their own code, so if they |
@y-p I think just handling allocations that Python does internally is good enough...if someone passes us a non-aligned array then we just won't trigger the optimized case (detecting alignment at runtime is easy, just convert the pointer to an integer and mod 16)...to be honest, just handling the block consolidation case might do 75% of what we need in practice (but not for the test suite, since it almost never consolidates), and that's just one line of code. |
Also, notice that on the low-end, I've gotten results with my SIMD-triggering small-array optimization like:
This happened more than once and on an overnight run with no competing processes, so it's probably not a fluke. The high end results:
make sense too because it's repeatedly calling the vectorized code in a loop, but the vectorized code needs to check for alignment at the beginning and end of both input and output source arrays. If the stride is not a multiple of 128-bits, then the alignments will be random (since the indexing is random as well) and branch prediction will not only defeated, you'll actually pay a significant penalty having to revert the mispredicted pipeline. (See Why is processing a sorted array faster than an unsorted array?) |
That looks really promising. |
@wesm any thoughts on the space overhead? (again, worst case would be at 288 bits per row, bumped up to 384 bits per row, all downhill from there. (guess we have to see how much performance improves at that size, first, of course, just curious if you think its enough overhead to be relevant) |
I suspect there might be a nice trade off here say u increase memory by 1.5x but get 1.5 speed improvement |
@jreback looks more like absolute-worst-case 33% (i.e. 1.33x) space for up to 125% (i.e. 2.25x) improvement (or possibly more, with further refinement), we'll see after more testing :)...we can always bump up the minimum size for alignment but that means that anyone with fewer columns per block than the minimum gets the short shrift even if they'd prefer the speed |
then that's even better wonder how that scales to lots of rows ... |
@y-p non-cached builds have the same variability, apparently, so your build cacher is vindicated :) (sorry for the bad guess!) |
if 288 -> 384 is ok, then 192 -> 256 should be ok too (same ratio)...so this allows speedup anytime >= 3 int64/float64 are stored together |
The percent improvement should stay constant (or even get better, due to better branch prediction) as the number of rows increases; the real variable is the size of the copied row (not sure if that's what you meant...)...presumably it should improve the most with smaller rows and then gradually even out until you get to 256 bytes (which is where we start calling |
@stephenwlin , do you have a direct way of checking the alignment of arrays? or are you I thought perhaps it would be worthwhile to repeat the tests with py33, just in case some |
@y-p I can check at runtime easily, it's just hard to get that information out as output from the test suite directly; basically, I just turned off the optimization except in 128-bit alignment cases, and almost all the performance changes (positive and negative) disappeared (because doubles are only 32-bit aligned on my machine, by default, so two arrays that happen to be 128-bit aligned is rare). turning them off except in 64-bit alignment cases made the effect larger than the 128-bit requirement case, but not as much as turning off the requirement entirely. (that's where I'm getting the results that vary from ~0.45 to ~3.4) EDIT: I have definitely confirmed that numpy arrays are not 128-bit aligned by default |
this works for allocating SIMD-friendly arrays: output: (it handles row/column alignment as long as it can do so within 4/3 of the original size, and it's better than the version above in that it uses the largest usable type for allocation rather than wasting a constant 15 bytes every time...i.e. if uint64 is usable, then the maximum penalty is 8 bytes, if uint32 then 12 bytes, if uint16 then 14 bytes...it even works with float96. EDIT: will now also fall back to smaller power of 2 alignments if possible within 4/3 of the original size) can one of you on 64-bit run it and make sure all the assertions pass? I can't imagine that they wouldn't... |
|
fyi...this is how I setup a virtual 32-bit box, you can easily do the same for 64-bit. It has maintained a persistent ssh connection (I have not restarted it), and build many times. (e.g. I just installed numpy 1.7.0 to test ou a change the other day). This is not the exact link I used, but should get you started. http://www.vagrantup.com/ |
Hmm, can you do a 64-bit virtual box within a 32-bit host?
I guess your machine doesn't support 'float96' (at least in 64-bit mode?) 96 doesn't divide evenly into 64-bit, so that makes sense. Anyway, I'll take a look when I get a 64-bit build either way, "float96" is a corner case anyway but I wanted to test a type whose size wasn't a factor of 128-bits. |
these are remote hosts, so u can find exactly what u want (arch and os) |
oh ok...well, I might just dual boot, it won't be that hard |
OK something very screwy is going on with i'll update this in a week or so after doing a lot of systematic (50 vb-suite overnights) with targeted changes, that should get to the bottom of this. EDIT: ok, just found this, testing a dummy commit (just changed a release notes file) against its immediate parent
something is definitely amiss and it will make it hard to get clean performance numbers until we figure this out... Independent of the
for the top 10 improvements in
for the top 10 worst regressions, across 50 tests (0.84586 mean +- 0.17275 standard deviation)...these two tests specifically copy small arrays over and over so would be particularly alignment-sensitive. So I'm still betting on alignment as part of the issue: if we fix the alignment problem we ought to be able to lock in the improvements (modulo whatever random variation is coming from the the unknown noise affecting |
tweaked test_perf a bit, should have less overhead, significant if you're running
|
actually, just realized geometric mean is more important here, since it's ratios (ratios skew the arithmetic mean higher, since the average of 1/x and x is greater than 1 for positive x)
so it's an average of 17% improvement, +- 23% std deviation...if the deviation is due to a factor that can be controlled somehow, then the improvement could easily be greater that the current average. |
Here are performance results from C with SSE, 32-bit Ubuntu Linux with GCC 4.7.2, -O3 flag.
"count" is the number of doubles to be copied in a loop, "naive" is the equivalent of the simple cython loop (variable strides), "memmove" is with a simple call to memmove (no threshold), "contig" is loop that assumes contiguous copies, and "sse" is an aligned sse version, higher numbers are better. So SSE and alignment can definitely make a difference (probably even more so in other algorithms, this is just a basic example). EDIT: Updated the results above with a further refinement. |
In [22]: pd.load("vb_suite/o.pickle")
Out[22]:
t_head t_baseline ratio
name
stats_rank2d_axis0_average 28.916097 28.847098 1.002392
frame_reindex_axis0 0.826338 0.814686 1.014302 |
@y-p thanks--I'm already scripted to use to |
results are updated above, now small arrays w/ <=10 doubles improves 3x-5x, w/ <=64 doubles improves up to 2x, big arrays seem to have steady 1.36x improvement. |
Wow! Here's what you get when you apply this to copying 32-bit values (i.e. float32/int32):
|
that's in 32-bit arch |
The performance of the optimized version for 64-bit types should be similar to the first results I posted before, since the SSE2 instructions are the same (also, I'm actually using a 64-bit processor anyway, just in 32-bit mode; I don't think pure SSE2 code is different depending on the mode). However, the baseline speed for 64-bit types is probably faster on 64-bit arch, so the speedup might not be as noticeable (i.e. the optimized version will be just as optimized, but the baseline it's compared against will not be as slow). |
makes sense. the point of 64-bit sse is to do more operations at a time. the advantage we have (with your knowledge) is that we know things about he data that the general case cannot know and thus cannot handle |
Here's performance numbers on x86-64: instead of normalizing by row I decided to print out the raw times this time; however, I normalized the number of iterations to keep the total number of bytes copied as close to constant as possible, barring rounding...(learned a lot of fun facts about x86 performance tuning while doing this, by the way!) First the results using
And here's using
Results for other data types are somewhere in between these two extremes. Interestingly, naive version (basically what Cython does by default, with generic code handling variable strides) is basically just as performant as any of the optimized ones for large enough Anyway, so this definitely does make a difference, just not in the case we happen to be exercising in vbench ( In particular, since |
@stephenwlin any follow up on this? |
I'm sorry that @stephenwlin is no longer with us. I have created #33 to ensure we do aligned / padded allocations in pandas 2.0 where we are allocating our own memory. |
Split off from #3114 ...
It doesn't seem (from my testing) that numpy guarantees 128-bit alignment the start of an array (which is odd, since
malloc
usually does that no matter what, from my testing), and it certainly doesn't 128-bit align each row/column since they're always contiguous.Apparently it can be done manually via a method like described below:
http://mail.scipy.org/pipermail/scipy-user/2009-March/020283.html
(This just aligns the start of the array, but if you overallocate more and play with strides you can do the same thing to every c-contig row or f-contig column)
The downside is that this has to be done everywhere where allocation currently happens...even with a helper function to do it it's going to be somewhat messy.
Also, to avoid too much overhead, I think this should only be done when each row/column is at least 256 bits (32 bytes) naturally, so the absolute worst case space penalty is 33% + 96 bits constant (on a 32-bit machine) or 20% + 64 bits constant (on a 64-bit machine)...
We can bump that up a bit higher to 512 bits if that still seems like too much, but it means that DataFrames with 5-7 float64/int64 columns could be pathologically misaligned. (So basically, this won't help at all until at least 9 columns are present...)
The text was updated successfully, but these errors were encountered: