-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize matrix functions, color conversion performance #183
Conversation
} else { | ||
((x + from_f64(0.055)) / from_f64(1.055)).powf(from_f64(2.4)) | ||
((x + from_f64(0.055)) * from_f64::<T>(1.055).recip()).powf(from_f64(2.4)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like in some cases when there's a variable being divided by a constant that this recip
trick helps.
z: (c[6] * f.x) + (c[7] * f.y) + (c[8] * f.z), | ||
x: x1 + x2 + x3, | ||
y: y1 + y2 + y3, | ||
z: z1 + z2 + z3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this with the multiply_xyz_to_rgb directly below this and it was consistently regressing while this one showed improvements over the previous version.
|
||
out | ||
[o0, o1, o2, o3, o4, o5, o6, o7, o8] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The destructuring seemed to help here. There's no need to zero-out an array since we never do anything with that value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a nice stylistic improvement too. 👍
if !det.is_normal() { | ||
panic!("The given matrix is not invertible") | ||
} | ||
let mut det = a[0] * d0 - a[1] * d1 + a[2] * d2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The placement here is actually very important which makes sense because it's right after d0, d1, d2. Putting it at the end of this section before the is_normal
check has a massive slowdown.
t6 * s_matrix.red, | ||
t7 * s_matrix.green, | ||
t8 * s_matrix.blue, | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed an improvement for this, will have to see if it's just my machine or not.
let z = y - (color.b / from_f64(200.0)); | ||
let y = (color.l + from_f64(16.0)) * from_f64::<T>(116.0).recip(); | ||
let x = y + (color.a * from_f64::<T>(500.0).recip()); | ||
let z = y - (color.b * from_f64::<T>(200.0).recip()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same concept as the Srgb into linear optimization. This had the most dramatic effect, doubled the throughput.
Cie family/lab to xyz time: [861.87 ns 864.43 ns 867.12 ns]
thrpt: [161.45 Melem/s 161.96 Melem/s 162.44 Melem/s]
change:
time: [-34.197% -33.915% -33.587%] (p = 0.00 < 0.05)
thrpt: [+50.573% +51.321% +51.968%]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting how these have an effect when it's essentially the same thing. I'm aware that division is usually relatively slow, so I would understand the case with the determinant when the number of divisions is reduced. In this case you seem to add an instruction, but maybe there's a faster instruction for 1 / x
specifically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking maybe it helped the compiler see a better path for evaluating the values at compile time and pipelining instructions. In most other cases, trying to do this had no effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compilers are strange beings.
This is quite interesting and I will have to come back to it with a more well rested brain. Just a question, though. Was there a significant difference between asserts and destructuring? If not, I'm preferring destructuring to not even introduce the panic paths. Even if they are likely just removed by the optimizer. And I think it reads nicer. 🙂 |
I started by focusing on the matrix multiplications. The asserts definitely showed improvements, in some cases I could take it farther. For the smaller improvements in the range of 2-3%, I thought they were worth including because I saw a consistent improvement when I put them in and consistent lower performance with them removed for a few runs. A quick thing I was using to check the difference was the amount of samples it would be collecting in the 5 seconds. When the numbers were substantially more, it seemed like a change worth keeping especially on the Rgb to linear path. I'd need you to double check to make sure that the numbers aren't only on my end. This does seem like it brings some speed to the linear conversions which were noticed as a slow path from Rgb to Xyz, and apparently a huge roadblock was removed from Lab to Xyz.
These were the benches with notable changes. Improvements
I can't explain these but I know benchmarking isn't perfect. Regressions
|
It looks like the problem with the benchmark is here? https://github.com/matchai/criterion-compare-action/blob/master/entrypoint.js#L30 |
Ah, right, it may have been rate limited after all of the requests I caused. I'll run it again later. Those are some good improvements! Well done! Floats and assembly is not my strong side either, so a lot of it is a mystery to me. I will at least run it on my computer and see if I get the same results. It will however have to wait until at least tomorrow. Those regressions seem odd. Something to investigate a bit more. |
} | ||
|
||
/// Invert a 3x3 matrix and panic if matrix is not invertible. | ||
pub fn matrix_inverse<T: Float>(a: &Mat3<T>) -> Mat3<T> { | ||
assert!(a.len() > 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably good to add a comment here to explain why it's different. The same in other places where there's a risk of someone coming by and "fixing" them. 🙂
I just came to think of that it can sometimes help to add |
I noticed that too but wasn't sure if I wanted to go down that rabbit hole, seemed likely to blow up with the size of some of the functions. The TransferFn for Srgb is also missing an inline attribute. I tried playing with the convert int_to_float and float_to_int macros but I'm not sure what more can be done in the current state. Any improvements seemed to cause regressions somewhere else when I played with inline. |
It has been forgotten in many places, so don't assume it's for any particular reason. 😄 But just leave it as it is if there's no clear improvement. |
Chopping down some more time for two important functions, small impact on the others, and some are neutral. I think this is about the limit of what can be done for now.
|
I reran the master baseline since I was unsure of it before I started looking at the regressions. Compared to the current feature branch, the tests listed as regressions now see speedups.
|
Here are the results from my own run. No regressions and really good improvements! Improvements
Unchanged or minor difference
|
As for the automatic benchmarks, it seem to hit what's described in https://github.saobby.my.eu.orgmunity/t5/GitHub-Actions/GitHub-actions-are-severely-limited-on-PRs/m-p/54669#M9249. That's very unfortunate and there doesn't seem to be a good alternative that makes it available for forks. |
That makes actions seem pretty pointless then for a lot of purposes. Otherwise, that's great to see that the big wins were relatively consistent along 🙂. Do you have any idea why Hsl to Hsv and Hwb to Hsv would have improvements? I'm really happy that the conversion to and from linear is so much faster but those benches make little sense to me. |
A bit, yes. Or at least less convenient. But there may be some way to do this without commenting on the PR. It can run everything else, so I'm thinking it could print the results in the logs instead. Not as nice, but almost there.
My CPU is only a bit newer (2014 it seems), but I got really stable results by not having anything heavier than the console and system monitor running.
I don't know about HSL to HSV, but for HWB to HSV I accidentally left RGB space conversion in there, meaning it will pull in RGB <-> XYZ code in there too. It's just a theory, but could be why it's affected. I'll see what happens if I change that function, but then I have to close this browser to get the stable results, so I will post this comment first. |
I was doing the same thing of having just the bare minimum open and getting stable results. I was surprised by how many outliers were in the results you posted though compared to what I get on my desktop and laptop. Just to make sure, you don't have to run the whole benchmark, you can type in a pattern that matches the name of the bench. It helps to iterate over changes faster that way. I think printing the log with the compare tool results would be the best course, we don't need the other capabilities as much. |
I removed the RGB space conversion and got
compared to your changes, but it did also cause a somewhat large improvement to hsv to hsl, but a regression to hsl to hsv. 🤷♂️ Still compared to your change. To explain the rest, I think it need much deeper analysis than just looking at the code and numbers. But the code tells me there is at least some room for improvements. They are pretty much implemented off of mathematical formulas, so they may be more elaborate than necessary. I don't know why I had so many outliers in some places, but maybe they are extra sensitive? Or didn't I reduce the interference enough? No idea. They didn't seem noisy when I ran master against master. I will have to come back to all of this later and look into the details. |
That's encouraging to hear that there's still more improvements that can be made. I don't really use any of the hue types other than Lch, and am more concerned with the time for srgb-linsrgb-xyz-lab, but at least we have a test bench now to measure things. I tried shuffling the ordering of some of the benchmarks around to see if the numbers still stay the same, and they mostly did except for hwb to hsl and hsv to hsl. I guess the black box really isn't enough for those. Is there anything else I should do with the current code? |
Hmm, the black box may be an issue. It's apparently not perfect on
It's perhaps out of scope for this, could you do me a favor and remove the RGB space conversion when going from HWB to HSV? They are supposed to have the same color spaces, like when converting from HSL. It's a small change in hsv.rs. I don't have anything else that needs to be fixed, so feel free to wrap it up. 🙂 |
Destructure slices to avoid bounds checking and panic paths Improve Rgb into linear transfer function Improve Lab to Xyz Remove RgbSpace conversion for Hsv from Hwb
I removed the Sp from Hwb to Hsv. I glossed over that conversion and didn't think anything of it when I was looking at all the I agree something with a bigger test load would be better since it'd result in lower iterations. I didn't want to add any images in the first bench PR but now it'd be fine. I'm not sure how you feel but I personally think that the images should be small enough that the benches can run in the ~5 seconds like they do now. I don't think I'm impatient but it already feels really long running through all the benches and not being able to touch the computer during it. |
Perfect, thank you! It bothered me after I discovered it. And thank you for finding all of these improvements! There are already a few images in the repo that can be used, but randomized data, mixed with the test data, may be a better challenge. I don't know enough about CPUs, but it wouldn't do to have it branch predict its way through it. It seem to group them as 100 samples and always run in close to 5 seconds, so as long as it can run at least a few hundred iterations in 5 seconds it's probably no issue. Let's hope the broken benchmark isn't going to block the merge, but I don't think it's required by default. bors r+ |
Build succeeded: |
Destructure slices to avoid bounds checking and panic path
Add #[inline] attribute to matrix functions
Improve Rgb into linear
Improve Lab to Xyz
Remove RgbSpace conversion from Hwb to Hsv