Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing the types used for intermediate calculations #477

Closed
kimikage opened this issue May 14, 2021 · 5 comments · Fixed by #506
Closed

Optimizing the types used for intermediate calculations #477

kimikage opened this issue May 14, 2021 · 5 comments · Fixed by #506

Comments

@kimikage
Copy link
Collaborator

Currently, many Float64-based calculations are used for color space conversions and color difference calculations, regardless of the input/output type. However, Float32 is accurate enough for practical use in many color and image related applications.

I intend to speed up RGB{N0f8}-->XYZ{Float32}-->Lab{Float32} conversions and colordiff.

Probably the breaking change is that the colordiff for RGB{N0f8} will return Float32 instead of Float64.

@kimikage
Copy link
Collaborator Author

BTW, I was wondering why the required accuracy of colordiff (DE_2000) was so low.

eps_cdiff = 0.01

And then I found two typos in the test data.

((60.2574, -34.0099, 36.2577), (60.4626, -34.1751, 39.4387), 1.2644),
((63.0109, -31.0961, -5.8663), (62.8187, -29.7946, -4.0864), 1.2530),

((60.2574, -34.0099, 36.2"6"77), (60.4626, -34.1751, 39.4387), 1.2644),
((63.0109, -31.0961, -5.8663), (62.8187, -29.7946, -4.0864), 1.2"6"30),

The fixes will allow us to minimize the tolerance (i.e. eps_cdiff = 0.00005).

@kimikage
Copy link
Collaborator Author

One of the most costly parts of the DE_2000 is the following t ( T ) calculation.

# hue weight
t = 1 - 0.17 * cosd(mh - 30) +
0.24 * cosd(2mh) +
0.32 * cosd(3mh + 6) -
0.20 * cosd(4mh - 63)
sh = 1 + 0.015 * mc * t

When t is computed with Float32, the accuracy is about 8 ULP (the average is about 7 ULP), even though it requires about 40 ns.
I think this is a polynomial approximation of the whole t, rather than optimizing cosd(::Float32), which would be more advantageous in terms of accuracy.
I don't have a good idea about the Float64 case yet.

@kimikage
Copy link
Collaborator Author

The accuracy bottleneck in calculating DE_2000 is in delta_h. delta_h involves the calculation of sin(Δh/2), where Δh is the angle between the ( a*, b*) vectors of the two colors.

Prior to PR #476, this calculation was done in LCHab space. However, the conversion to LCHab space introduces numerical errors in the hue (atand). (PR #495 could also improve the accuracy of DE_2000, but it is not a solution for thee delta_h accuracy.)

PR #476 changed the code to calculate delta_h without conversion to LCHab space. (FYI, the expression is used in CIE94.) However, even with the method, if Δh is small, cancellation of significant digits will occur. This is essentially due to the fact that the method uses 2 * sin(Δh/2)^2 = 1 - cos(Δh). Probably, it would be more accurate to calculate sin(Δh/2) from tan(Δh) without going through cos(Δh).

Another problem related to delta_h is the rounding error in modifying the a* component (i.e. scaling with (1 + g)). The difference between the a* components of the two colors is important for the delta_h calculation, but currently the mdification results are rounded independently for the two colors.

a_Lab_r = Lab(a_Lab.l, a_Lab.a * (1 + g), a_Lab.b)
b_Lab_r = Lab(b_Lab.l, b_Lab.a * (1 + g), b_Lab.b)

@kimikage
Copy link
Collaborator Author

kimikage commented Jul 1, 2021

The following is effective against the precompile problem in that it does not call exp, but it did not improve the speed, possibly due to stalling caused by subnormal numbers. 🤔

Edit: I added the workaround for subnormal numbers. The Float64 version will not be adopted because the speed improvement is not significant compared to the total DE_2000 costs.

const DE2000_SINEXP_F32 = [Float32/3 * exp(-i)) for i = 0.0:0.25:87.25]
@inline function _de2000_rot(mh::Float32)
    dh2 = ((mh - 275.0f0) * (1.0f0 / 25))^2
    di = reinterpret(UInt32, dh2 + Float32(0x3p20))
    i = di % UInt16 # round(UInt16, dh2 * 4.0)
    i >= UInt16(350) && return 0.0f0 # avoid subnormal numbers
    t = (reinterpret(Float32, di) - Float32(0x3p20)) - dh2 # |t| <= 0.125
    sinexp = @inbounds DE2000_SINEXP_F32[i + 1] # π/3 * exp(-dh2) = (π/3 * exp(-i/4)) * exp(t)
    em1 = @evalpoly(t, 1.0f0, 0.49999988f0, 0.16666684f0, 0.041693877f0, 0.008323605f0) * t
    ex = muladd(sinexp, em1, sinexp)
    ex < eps(0.5f0) && return ex
    sn = @evalpoly(ex^2, -0.16666667f0, 0.008333333f0, -0.00019841234f0, 2.7550889f-6, -2.4529042f-8)
    return muladd(sn * ex, ex^2, ex)
end

@kimikage
Copy link
Collaborator Author

kimikage commented Aug 2, 2021

There is still room for this kind of optimization in many places. However, in the interest of importance, I would like to close this issue with PR #506.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant