Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster minimum/maximum/extrema #45581

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

mikmoore
Copy link
Contributor

@mikmoore mikmoore commented Jun 4, 2022

This PR uses a new method to greatly accelerate minimum/maximum/extrema for IEEEFloat values. The concept is to convert floats to signed integers that respect the same ordering that min/max do for floats. Since integers can be min/max'd much more quickly than floats, this leads to a considerable speedup. This is a minor extension to the concept used by isless since #39090. I used the changes here to slightly accelerate that version of isless as well.

Note that the old versions had special-cased _mapreduce_impl for min/max. The reckless application of these led to very poor performance for integer minimum/maximum. This implementation eliminates those specializations, improving integer performance as well.

Benchmark:

using BenchmarkTools
a = randn(1000);
b = zeros(1000);
c = @. ifelse(rand() < 0.1, missing, a);
j = rand(Int,1000);
@btime minimum($a); # "typical" case
@btime minimum($b); # pathological case for old version
@btime extrema($a);
@btime minimum($c); # with missing
@btime minimum($j); # integer performance
@btime searchsortedlast($a, x) setup=(x=randn()); # isless performance

Before:

  812.222 ns (0 allocations: 0 bytes)
  1.620 μs (0 allocations: 0 bytes)
  4.629 μs (0 allocations: 0 bytes)
  1.220 μs (0 allocations: 0 bytes)
  262.315 ns (0 allocations: 0 bytes)
  14.729 ns (0 allocations: 0 bytes)

This PR:

  173.764 ns (0 allocations: 0 bytes)
  173.721 ns (0 allocations: 0 bytes)
  278.276 ns (0 allocations: 0 bytes)
  954.839 ns (0 allocations: 0 bytes)
  92.453 ns (0 allocations: 0 bytes)
  11.311 ns (0 allocations: 0 bytes)

#43725: EDIT: never mind - it sounds like I didn't get that benchmark right

Results are similar for maximum.
This idea does not support values that may be Missing whereas the original specialization did. Interestingly, eliminating that specialization (none of the new code here is run with Missing) still led to a performance improvement.

For Float64, this PR makes minimum/maximum 4.7x faster (more in pathological case), extrema 16x faster, and isless 1.3x faster. Note that sort already has specializations for floats, which remains faster than the naive application of this updated isless. This also makes minimum/maximum 2.8x faster for Int.

The technical idea here is finished, but I could use some help on the architecture. The current version makes _mapreduce(f,op,...) check for whether a registered acceleration function exists for op for the type of f applied to the return type. It only takes effect if the return_type of f applied to the input is known to be a IEEEFloat. In the case that such a function exists, it wraps f in the accelerating transformation and applies an appropriate inverse transformation to the result. Even if the architectural idea is sound, I'm not sure I put the corresponding functions in reasonable places within the code.

@N5N3
Copy link
Member

N5N3 commented Jun 5, 2022

I'm affraid you didn't set up #43725 successfully. As all-zero input should not be a pain there.

This solution is good. The biggest concern might be it treats NaNs as ordered numbers:

julia> using QNaNs

julia> minimum([qnan(2),qnan(1)]) |> qnan
1
julia> minimum([qnan(1),qnan(2)]) |> qnan
1

I not sure whether @tkf would be happy with this. (This would generate inference-dependent result.)

If the core-devs are happy with this change.
I think we'd better do this optimization at mapreduce_impl level. This would helps solve the Currently breakage on our type-based reduce_empty.

Also, I think it would be good to also accelerate slow dimensional reduction. e.g. (minimum(a, dims = (1, 3))).
To achieve this, I think we'd better make floatorder_min return a float and define a new reduction function to handle them.

@N5N3 N5N3 requested a review from tkf June 5, 2022 02:11
@N5N3 N5N3 added the performance Must go faster label Jun 5, 2022
@mikmoore
Copy link
Contributor Author

mikmoore commented Jun 5, 2022

Thanks for the feedback.

I imagined I might have had that benchmarking wrong for that other PR. I removed the comparison from my earlier post.

You are correct that this version of minimum would differ from a nonspecializing reduce(min,...) in its preference among multiple NaNs. This may be objectionable.

I've fixed the breaking-empty-collections issue, although it still looks brittle to me.

I agree that this should also apply to dimensional reductions. I'm not sure I understand what you mean by "make floatorder_min return a float". While it's totally possible to define a pairwise min function based on this method, using it in a reduction involves redundant transformations to the integer space and so carries a significant performance penalty (such that other methods tend to perform better). Can you elaborate a little?
For these reductions, the representations are bit-compatible so I might imagine doing the dimensional reduction into a reinterpret(Int64,...) array, then applying the inverse-transforms in-place and reinterpreting as Float64. Although that approach feels brittle, too.

The perfect world would involve a float min function that would vectorize under reduce -- none of this specialization would be needed at all. Unfortunately, I don't know compilers well enough to win that fight. My tinkering has been unsuccessful.

@nalimilan
Copy link
Member

Radix sort uses a similar approach:

julia/base/sort.jl

Lines 1434 to 1444 in bd8dbc3

uint_map(x::Float32, ::Left) = ~reinterpret(UInt32, x)
uint_unmap(::Type{Float32}, u::UInt32, ::Left) = reinterpret(Float32, ~u)
uint_map(x::Float32, ::Right) = reinterpret(UInt32, x)
uint_unmap(::Type{Float32}, u::UInt32, ::Right) = reinterpret(Float32, u)
UIntMappable(::Type{Float32}, ::Union{Left, Right}) = UInt32
uint_map(x::Float64, ::Left) = ~reinterpret(UInt64, x)
uint_unmap(::Type{Float64}, u::UInt64, ::Left) = reinterpret(Float64, ~u)
uint_map(x::Float64, ::Right) = reinterpret(UInt64, x)
uint_unmap(::Type{Float64}, u::UInt64, ::Right) = reinterpret(Float64, u)
UIntMappable(::Type{Float64}, ::Union{Left, Right}) = UInt64

Cc: @LilithHafner

Copy link
Member

@LilithHafner LilithHafner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to re-use from Sort.jl, but I believe the specifications here are sufficiently different with respect to handling NaNs that I don't think re-use is viable.

base/reduce.jl Outdated Show resolved Hide resolved
base/float.jl Show resolved Hide resolved
base/reduce.jl Outdated Show resolved Hide resolved
base/reduce.jl Outdated Show resolved Hide resolved
Co-authored-by: Lilith Orion Hafner <60898866+LilithHafner@users.noreply.github.com>
@mikmoore
Copy link
Contributor Author

mikmoore commented Jun 6, 2022

I'm increasingly unhappy with how I've bolted this together. Hijacking the f function to do the transformation seems a little fragile. While the implementation here isn't too complicated, I'm not too excited to write the dimensional version (for, eg, minimum(...;dims=...)). I'm open to other ideas if people can think of something more composable.

@N5N3
Copy link
Member

N5N3 commented Jun 6, 2022

@mikmoore f85099c is a very primitive extension for slow dimensional reduction.
I removed the acceleration for foldl branch, as I believe we should inplement a general fast mapreduce_impl for non-linear reduction.

@mikmoore
Copy link
Contributor Author

mikmoore commented Jun 6, 2022

@mikmoore f85099c is a very primitive extension for slow dimensional reduction.

I like that approach much better! I'll incorporate it when I have some time.

@mcabbott
Copy link
Contributor

mcabbott commented Jan 6, 2023

I didn't manage to build this branch, but doing @eval Base ... to load its new functions into master, I see ... [probably just a mistake]

the following problems: ``` julia> extrema([1,2,Inf]) (1.0, 3.999999999999999)

julia> extrema([1,2,NaN])
(NaN, -1.3482698511467371e308)

These don't occur with `maximum`. They do occur with Float32. 

@N5N3
Copy link
Member

N5N3 commented Jan 6, 2023

How did you set it?
My local build has pulled in a further version of this PR. And it has no such problem.

julia> extrema([1,2,Inf])
(1.0, Inf)

julia> extrema([1,2,NaN])
(NaN, NaN)

julia> extrema([1,2,NaN32])
(NaN32, NaN32)

julia> extrema([1,2,Inf32])
(1.0f0, Inf32)

@N5N3
Copy link
Member

N5N3 commented Jan 6, 2023

@mikmoore would you mind me push f85099c here if you have no time to finish this.
Since triage said we dont care about which NaN to return, I think it would be good to finish this PR and see whether arm platforms really need it?

@mikmoore
Copy link
Contributor Author

mikmoore commented Jan 6, 2023

Sorry. I did spend some time a bit ago incorporating the sort of reorganization you suggested. I pushed that just now but wouldn't be upset if you integrated parts or all of your changes instead.

I spent some more time on this over the holidays but have been stuck on minimum!/maximum!. It's been a week or two, but as I recall they're causing allocations and are not as fast as they should be -- probably something to do with not specializing function arguments but I've been unsuccessfully in my attempts to fix it. Until that gets fixed, I don't think this can be finished. Feel free to take a shot at that as well, if you like.

Here is the set I've been using to test this:

using BenchmarkTools
a = randn(1000);
b = @. ifelse(a <= 0, zero(eltype(a)), a); # convert all negatives to +0.0
d = @. ifelse(rand() < 0.1, missing, a); # convert some values to missing
m = reshape(a,20,50);
r = Matrix{Float64}(undef,size(m,1),1); c = Matrix{Float64}(undef,1,size(m,2));
j = rand(Int,1000);
@btime minimum($a); # "typical" case
@btime minimum($b); # pathological case for old version
@btime minimum($m;dims=1);
@btime minimum($m;dims=2);
@btime minimum!($c,$m);
@btime minimum!($r,$m);
@btime map!(minimum,$c,eachcol($m)); # this appears to be faster than the previous but shouldn't be
@btime map!(minimum,$r,eachrow($m)); # this appears to be faster than the previous but shouldn't be
@btime extrema($a);
@btime minimum($d); # with missing
@btime minimum($j); # integer performance
@btime searchsortedlast($a, x) setup=(x=randn()); # isless performance

@mikmoore mikmoore marked this pull request as draft January 6, 2023 05:59
@mcabbott mcabbott mentioned this pull request Jan 6, 2023
@N5N3
Copy link
Member

N5N3 commented Jan 6, 2023

have been stuck on minimum!/maximum!.

I'm afraid your change didn't touch the minimum!/maximum!. As they call mapreducedim! directly.
And I think mapreducedim! wont like _makefast_mapreduction as there's no guarantee that the dest array has the same eltype as f.(A).
(And only the dim1 reduction would benefit from the optimization)

N5N3 added 2 commits January 6, 2023 15:42
It helps to avoid extra allocation (caused by non-specialization)
The accuracy won't be better.
@N5N3 N5N3 force-pushed the floatminimummaximum branch from 6077812 to 068082a Compare January 6, 2023 07:50
@N5N3
Copy link
Member

N5N3 commented Jan 6, 2023

With 413b739 I have

julia> @btime minimum!($c,$m);
  861.905 ns (0 allocations: 0 bytes)

julia> @btime minimum!($r,$m);
  450.000 ns (0 allocations: 0 bytes)

julia> @btime map!(minimum,$c,eachcol($m));
  1.220 μs (0 allocations: 0 bytes)

julia> @btime map!(minimum,$r,eachrow($m));
  915.789 ns (0 allocations: 0 bytes)

So I think the original concern get resolved.

Copy link
Member

@LilithHafner LilithHafner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using dispatch and a lazy map instead of hooks? i.e.

mapreduce_impl(f, op, A::AbstractArrayOrBroadcasted, [...]) =
    mapreduce_impl(_mapped_eltype(f, A), f, op, A, [...])
mapreduce_impl(_, f, op, A::AbstractArrayOrBroadcasted, [...]) =
    # existing implementation
for op in (min, max)
    @eval mapreduce_impl(::IEEEFloat, f, typeof($op), A::AbstractArrayOrBroadcasted, [...]) =
        postprocess(mapreduce_impl(f, fast_op, Iterators.map(preprocess, A), [...]))
end

base/reduce.jl Outdated Show resolved Hide resolved
base/reduce.jl Outdated Show resolved Hide resolved
base/float.jl Show resolved Hide resolved
Co-Authored-By: Lilith Orion Hafner <60898866+LilithHafner@users.noreply.github.com>
@N5N3
Copy link
Member

N5N3 commented Jan 6, 2023

Iterators.map would return a Generator and fallback to mapfoldl. I'm not sure we want to add a similar hook there.

@LilithHafner
Copy link
Member

I mistakenly assumed that Iterators.map(::AbstractArray)::AbstractArray 💔

@LilithHafner
Copy link
Member

What about

for op in (min, max)
    @eval mapreduce_impl(::IEEEFloat, f, typeof($op), A::AbstractArrayOrBroadcasted, [...]) =
        postprocess(mapreduce_impl(preprocess  f, fast_op, A, [...]))
end

@N5N3
Copy link
Member

N5N3 commented Jan 6, 2023

@mikmoore tried this at _mapreduce level (see 81a0ff5).
But this approach seems not work well with mapreducedim! (as we only want to activate the optimization for dim1 reduction.)
For me it makes sense to keep mapreducedim! and mapreduce_impl share a similar code style.

@LilithHafner
Copy link
Member

That makes sense. Thanks for humoring me. It's frustrating how complex the mapreduce implementation is.

end
R[i1,IR] = r
R[i1,IR] = post(r)
end
else
@inbounds for IA in CartesianIndices(indsAt)
Copy link
Contributor Author

@mikmoore mikmoore Jan 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be possible to apply the same series of transformations when reducing along non-first dimensions? Or am I missing where that has already been handled?

EDIT: maybe now I see. This was your comment about the destination array possibly being of a different type. Could this still be done if it has a same-size type (or at least the exact same type) so that they are reinterpret-able?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. And all the optimizations here are reinterpret-able.

Copy link
Member

@N5N3 N5N3 Jan 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think non-dim1 cases would be accelerated a lot as we will have to add a map!(pre, dest, dest) berfore the kernal loop, and a map!(post, dest, dest) after it.
reinterpret might be 0-cost, but flipneg+add offset+mem access not.

And non dim1 minimum/maximum is vectorlized quite well

julia> A = randn(512, 512);

julia> @btime minimum($A, dims = 1);
  65.400 μs (7 allocations: 4.33 KiB)

julia> @btime minimum($A, dims = 2);
  56.500 μs (7 allocations: 4.33 KiB)

On my PC it's faster than dim1 with optimizaiton.

Copy link
Member

@N5N3 N5N3 Jan 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple bench

f(b, c) = begin
    pre, op, post = Base._makefast_reducution(min, Float64)
    Base.initarray!(b, identity, min, true, c)
    map!(pre, b, b)
    Base.mapreducedim!(pre, op, b, c);
    map!(post, b, b)
end

A = randn(512, 512); B = randn(512);
----------------------------------------
julia> @btime f($B, $A);
  58.500 μs (0 allocations: 0 bytes)

julia> @btime minimum!($B, $A);
  57.700 μs (0 allocations: 0 bytes)

EDIT: profile shows that the overhead of map!(pre, b, b) & map!(post, b, b) is neglible is this case.

@dpinol
Copy link

dpinol commented Oct 9, 2024

Hi,
any chance this PR could be moved on?
thanks

@mikmoore
Copy link
Contributor Author

mikmoore commented Oct 9, 2024

I'm increasingly un-enamored with this PR. However, I still think that the current specializations should still be removed (or at least limited only to IEEEFloat) and the isless improvement seems fine.

But the rest of it is a lot of code and complication (and maintenance) for some modest improvements. Further, I suspect that this would be a regression on AMD systems where the desired min/max semantics are available as native instructions.

Currently, the reduce family of functions are just too complicated to make specializations like this palatable. I suspect that's why those of us that worked on this eventually burned out. And there's further work happening on those families (universal pairwise reduce, changes to init semantics) that may some adjustment here.

But feel free to pick this up and get it over the line yourself, if you like. It would be easier (almost done, probably?) if you abandoned dimensional reduction and merely aimed to improve the whole-collection case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants