std.algorithm.searching: no mapping-specialization for extremum #4265

wilzbach · 2016-04-29T16:15:33Z

As #4257 (exposing the extremum function) depends on andralex, I thought I separate the optimization for the noop-function case in a separate PR. Here it goes. The idea is to reduce the unnecessary element and function call if no mapping is specifiecd, e.g. from:

MapType mapElement = mapFun(r[i]);
if (selectorFun(mapElement, extremeElementMapped))
{
   extremeElement = r[i];
   extremeElementMapped = mapElement;
}

this PR reduces it down to:

if (selectorFun(r[i], extremeElement))
{
   extremeElement = r[i];
}

Moreover a small goodie was added that checks whether the first template function can be used as an mapping function (unary function) and otherwise it falls back to the selector (binary function). Useful in terms of #4257

wilzbach · 2016-04-29T17:12:25Z

Assembler diff with dmd: http://www.mergely.com/966byvI1 (right is the current version, you can unaryFun is included)
Benchmarks (1: is the new function)

tl;dr it does make a huge difference!

dmd -release foo.d && ./foo
0 6 secs, 736 ms, 286 μs, and 1 hnsec
1 4 secs, 407 ms, 872 μs, and 2 hnsecs
./foo  11.16s user 0.00s system 100% cpu 11.155 total

dmd -release -O foo.d && ./foo
0 4 secs, 610 ms, 609 μs, and 5 hnsecs
1 3 secs, 769 ms, 89 μs, and 4 hnsecs

ldc -release foo.d && ./foo
0 5 secs, 444 ms, 236 μs, and 7 hnsecs
1 4 secs, 244 ms, 484 μs, and 4 hnsecs

ldc -release -O3 foo.d && ./foo
0 1 sec, 345 ms, and 914 μs
1 216 ms, 347 μs, and 1 hnsec

Source for benchmarks: http://sprunge.us/QSWP

wilzbach · 2016-04-30T07:57:22Z

@9il nope doesn't change anything. with optimization the difference the existing variant is 5x slower!

ldc -O -release -boundscheck=off foo.d && ./foo
0 1 sec, 345 ms, 145 μs, and 3 hnsecs
1 216 ms, 700 μs, and 2 hnsecs

ldc -O3 -release -boundscheck=off foo.d && ./foo
0 1 sec, 345 ms, 432 μs, and 1 hnsec
1 216 ms, 223 μs, and 1 hnsec

Furthermore with ldc -O -release -boundscheck=off -output-s unaryFun is still called.

(I tried with both ldc 0.17 and 1.0-beta)

9il · 2016-04-30T08:06:44Z

This is very strange and it is significant issue. Maybe it is not inlined because this function exist in static phobos?

wilzbach · 2016-04-30T08:11:53Z

This is very strange and it is significant issue. Maybe it is not inlined because this function exist in static phobos?

Just tried by copying over - same results :/

9il · 2016-04-30T08:13:56Z

common lambda a => a the same?

9il · 2016-04-30T08:15:38Z

I am interested in sort perfomance ...

9il · 2016-04-30T08:16:13Z

probably this is pandora door)

wilzbach · 2016-04-30T08:43:14Z

common lambda a => a the same?

Yes :/

I am interested in sort perfomance ...

Luckily no measurable difference.

probable this is pandora door)

I don't hope so, as far as I checked inlining is done correctly when no intermediate value is saved. So maybe it might be even more efficient not to store the mapResult at all - even if it's not the identitiyFun?

9il · 2016-04-30T08:51:57Z

Looks like DMD FE bug

wilzbach · 2016-04-30T08:55:00Z

maybe it might be even more efficient not to store the mapResult at all

Turns out to be even slower.

Looks like DMD FE bug

So the huge difference is between using unaryFun and not using it.

wilzbach · 2016-04-30T09:25:58Z

@9il - can you have a look at this benchmark?
a & b seem to be a lot slower.

auto a(alias map = "a", R)(R r)
{
    import std.functional;
    alias mapFun = unaryFun!map;
    auto extremeElement = r[0];
    auto m = mapFun(r[0]);
    foreach (const i; 0 .. r.length)
    {
        auto k = mapFun(r[i]);
        if (k < m)
        {
            extremeElement = r[i];
            m = k;
        }
    }
    return extremeElement;
}

auto b(alias map = "a", R)(R r)
{
    import std.functional;
    alias mapFun = unaryFun!map;
    auto extremeElement = r[0];
    auto m = mapFun(r[0]);
    foreach (const i; 0 .. r.length)
    {
        if (mapFun(r[i]) < m)
        {
            extremeElement = r[i];
            m = mapFun(r[i]);
        }
    }
    return extremeElement;
}

auto c(alias map = "a", R)(R r)
{
    import std.functional;
    alias mapFun = unaryFun!map;
    auto extremeElement = r[0];
    foreach (const i; 0 .. r.length)
    {
        if (mapFun(r[i]) < extremeElement)
        {
            extremeElement = mapFun(r[i]);
        }
    }
    return extremeElement;
}

auto d(R)(R r)
{
    auto extremeElement = r[0];
    foreach (const i; 0 .. r.length)
    {
        if (r[i] < extremeElement)
        {
            extremeElement = r[i];
        }
    }
    return extremeElement;
}

void main() {
    import std.datetime: benchmark, Duration;
    import std.stdio: writeln;
    import std.array: array;
    import std.conv: to;
    import std.random: randomShuffle;
    import std.range:iota;
    auto arr = iota(100_000).array;
    arr.randomShuffle;
    auto i = 0;
    void f0(){ i += arr.a; }
    void f1(){ i += arr.b; }
    void f2(){ i += arr.c; }
    void f3(){ i += arr.d; }
    auto rs = benchmark!(f0, f1, f2, f3)(10_000);
    foreach(j,r;rs)
        writeln(j, " ", r.to!Duration);
    writeln(i);
}

and runtime:

> ldc2 -O3 -release -boundscheck=off foo.d && ./foo
0 1 sec, 747 ms, and 892 μs
1 1 sec, 748 ms, 739 μs, and 3 hnsecs
2 223 ms, 562 μs, and 4 hnsecs
3 227 ms, 73 μs, and 3 hnsecs

wilzbach · 2016-05-01T21:28:16Z

btw if you run it with dmd it's a lot slower, but funnily 3) which is the proposed optimization here (3) for identity mapping functions is still a lot faster.

0 2 secs, 121 ms, 660 μs, and 9 hnsecs
1 2 secs, 534 ms, and 509 μs
2 2 secs, 950 ms, 672 μs, and 6 hnsecs
3 844 ms, 454 μs, and 6 hnsecs

If we use a custom mapping function, variant c(=2) which only has one assigment in the loop, but calls the mapping function two times in each iteration is faster

0 1 sec, 747 ms, 555 μs, and 5 hnsecs
1 1 sec, 748 ms, 328 μs, and 9 hnsecs
2 1 sec, 480 ms, 504 μs, and 5 hnsecs

At least for dmd, the natural feeling that calling the map function twice should be more expensive is true:

0 11 secs, 722 ms, 248 μs, and 4 hnsecs
1 11 secs, 721 ms, 43 μs, and 7 hnsecs
2 14 secs, 227 ms, 179 μs, and 6 hnsecs

I used tuples to check the map function.

My summary: we should definitely get the optimization for the identity function (5x ldc, 2x dmd), whether we should call the mapping function twice is more complicated.

wilzbach · 2016-05-06T11:07:33Z

@klickverbot could you have a short look at this?
Is it possible for ldc to optimize this?
Afaict the main problem is that having just one assigment in a loop is a lot faster and neither ldc nor dmd can realize that second assigment isn't needed.

wilzbach · 2016-05-13T10:56:25Z

@9il do you already have an opinion on this?
Should the backend really be able to infer that these two loop statement are equal for a noop mapFun?

        if (mapFun(r[i]) < extremeElement)
        {
            extremeElement = mapFun(r[i]);
        }

        auto k = mapFun(r[i]);
        if (k < m)
        {
            extremeElement = r[i];
            m = k;
        }

9il · 2016-05-13T11:02:10Z

i think so

JackStouffer · 2016-06-02T14:42:05Z

std/algorithm/searching.d

    }
 }

+private struct IdentityFun {};


Please add a comment here describing why this exists.

wilzbach · 2016-06-04T14:32:32Z

Can't you just test for "if map is a string and it == a".

Turns out my knowledge about D's dark magic has grown and yeah we can do that :)

Should the backend really be able to infer that these two loop statement are equal for a noop mapFun?

@klickverbot is such an optimization possible in the foreseeable future? Otherwise this adds an optimization for the quite popular identity case ([1,2].minElement) that is 2x faster for dmd and 5x faster for ldc without changing the API.

andralex · 2016-06-04T15:35:33Z

Great work, thanks. I'd say we should just proceed with this PR even if implementations will get better in the future.

Tactically, I wanted for a long time to get rid of the clunky specialization using string predicates. Fortunately in this case it's easy: define two overloads:

private auto extremum(alias selector = "a < b", Range)(Range r)
    if (isInputRange!Range && !isInfinite!Range);
private auto extremum(alias map, alias selector = "a < b", Range)(Range r)
    if (isInputRange!Range && !isInfinite!Range);

Now there's no need to look at the predicate - if missing, it's identity. Would this work?

wilzbach · 2016-06-04T15:47:38Z

Now there's no need to look at the predicate - if missing, it's identity. Would this work?

Yes, but you would also need to add the overload for the seed and then for these same overloads to minElement and maxElement too - is this acceptable then?

private auto extremum(alias selector = "a < b", Range)(Range r)
    if (isInputRange!Range && !isInfinite!Range)

private auto extremum(alias map = "a", alias selector = "a < b", Range)(Range r)
    if (isInputRange!Range && !isInfinite!Range)

private auto extremum(alias selector = "a < b", Range,
                      RangeElementType = ElementType!Range)
                     (Range r, RangeElementType seedElement)
    if (isInputRange!Range && !isInfinite!Range &&
        !is(CommonType!(ElementType!Range, RangeElementType) == void))

private auto extremum(alias map = "a", alias selector = "a < b", Range,
                      RangeElementType = ElementType!Range)
                     (Range r, RangeElementType seedElement)
    if (isInputRange!Range && !isInfinite!Range &&
        !is(CommonType!(ElementType!Range, RangeElementType) == void))

dnadlinger · 2016-06-04T15:49:28Z

@wilzbach: The reason for the big difference in speed on LDC is that the c/d case is auto-vectorised. As an interesting aside, on a wider CPU the difference is thus even bigger:

0 1 sec, 180 ms, 949 μs, and 2 hnsecs
1 1 sec, 182 ms, 515 μs, and 9 hnsecs
2 91 ms, 457 μs, and 9 hnsecs
3 100 ms, 229 μs, and 9 hnsecs

The first two are partially unrolled, but for reasons not immediately obvious to me, the loop vectorisation pass doesn't kick in.

I think if you were to have a closer look at what's going on, you'd either find that there is a subtle difference in the IR for the two cases – for example concerning index overflow that we know will never occur, etc. –, or (and probably more likely) that the first just happens not to be detected by the pattern matcher in the loop vectoriser. This would definitely be a fun thing to investigate, but I'm a bit short on time right now.

dnadlinger · 2016-06-04T17:12:32Z

On another note, I'm rather concerned by the proposal to duplicate the otherwise identical implementation for indexable ranges, especially since this is not the first such change to pop up recently. There is an obvious argument to be made about maintainability, but of course in this case that could be dismissed pointing out that the standard library is the one place where providing a variety of heavily optimised implementation is actually worth it (and even expected).

However, I think that we have a larger issue at hand here. Ranges are of prime strategic importance for D as an accessible way of composing complex operations out of reusable primitives. A large part of this comes from the fact that they are supposed to be zero-cost abstractions – after the compiler optimiser has had its way with a piece of code, we expect it to be equivalent to a hand-written loop. This is what turns ranges from a neat toy design into a production-ready feature for a performance-oriented language like D. (There are much more interesting and less error-prone designs to be had if performance was not of concern, cf. the eternal transient front discussion.)

If we now find ourselves compelled to duplicate a simple loop that simply iterates over a range one-by-one to use indices where available, the only possible conclusion is that the concept of a zero-cost generalisation has failed somewhere at a very fundamental level. This is not a case where specialising on a richer type allows us to make use of the additional capabilities; it's just restating the same operation in a different way.

In this specific instance, we might be able to paint over the underlying issue by manually adding a special case, but not only is this prohibitively expensive in terms of maintenance burden for many situations, it might also be plain impossible if a similar issue emerges in the composition of two separate higher-level primitives (think two nested range algorithms not being inlined properly into each other).

I just did a quick check of the range-based identity path against d, and thankfully, they seem to generate almost identical code on LDC (most importantly, both are auto-vectorised). But even in general – where similar cases might actually show a performance difference – I think we should be very cautious about adding such band-aid special cases. The effort spent to write, test and maintain them would be much better directed towards figuring out why the abstractions are not optimised away as they should be.

wilzbach · 2016-06-04T17:38:23Z

However, I think that we have a larger issue at hand here. Ranges are of prime strategic importance for D as an accessible way of composing complex operations out of reusable primitives. A large part of this comes from the fact that they are supposed to be zero-cost abstractions – after the compiler optimiser has had its way with a piece of code, we expect it to be equivalent to a hand-written loop.

I absolutely agree, did a bit of benchmarking and opened a forum discussion:
http://forum.dlang.org/post/mqqkaquqxodqjiqzuzky@forum.dlang.org

The effort spent to write, test and maintain them would be much better directed towards figuring out why the abstractions are not optimised away as they should be.

Agreed, started the testing - see the thread ;-)

dnadlinger · 2016-06-04T17:47:14Z

(To clarify: I think we should probably go ahead and add the identity specialisation, but not the random-access one.)

dnadlinger · 2016-06-04T22:55:26Z

std/algorithm/searching.d

+    import std.stdio;
+    assert([-2., 0, 2].extremum!`cmp(a, b) < 0` == -2.0);
+
+    // remember there is reduce too


Huh, how is this related? This doesn't seem to be in a documented unit test.

dnadlinger · 2016-06-04T22:56:35Z

Apparently, there is a performance difference between indexed/primitive versions in the non-identity-map case even for LDC, so let's go with this for now.

wilzbach · 2016-06-04T23:47:56Z

Huh, how is this related? This doesn't seem to be in a documented unit test.

Ouch it was previously documented (before we decided extremum needs Andrei's approval as well) -> removed.

Apparently, there is a performance difference between indexed/primitive versions in the non-identity-map case even for LDC, so let's go with this for now.

We should really dig deeper into this anyhow!

Should we go with this version or with the overloads? As mentioned above we would need to add six overloads (2x extremum, 2x minElement, 2x maxElement).

dnadlinger · 2016-06-05T00:07:05Z

We should really dig deeper into this anyhow!

The LDC performance issue in the identity-mapped case is due to a bug in the LLVM optimizer (instcombine), see https://llvm.org/bugs/show_bug.cgi?id=28006.

As for which implementation to go for, I'll leave it up to @andralex since he brought up the issue. Overloads are conceptually a bit cleaner, but since we have the string literal syntax in place already, the string comparison is really just checking whether the map parameter has been set explicitly or not.

wilzbach · 2016-08-28T15:34:13Z

Should we go with this version or with the overloads? As mentioned above we would need to add six overloads (2x extremum, 2x minElement, 2x maxElement).
As for which implementation to go for, I'll leave it up to @andralex since he brought up the issue.

AFAIK we are waiting here for feedback and not for work -> changed the labels.

wilzbach · 2016-12-10T20:12:34Z

Should we go with this version or with the overloads? As mentioned above we would need to add six overloads (2x extremum, 2x minElement, 2x maxElement).
As for which implementation to go for, I'll leave it up to @andralex since he brought up the issue. Overloads are conceptually a bit cleaner, but since we have the string literal syntax in place already, the string comparison is really just checking whether the map parameter has been set explicitly or not.

Ping @andralex - for convenience I listed both ways below:

String comparison (currently implemented in this PR)

private auto extremum(alias map = "a", alias selector = "a < b", Range)(Range r)
private auto extremum(alias map = "a", alias selector = "a < b", Range,
                       RangeElementType = ElementType!Range)
                      (Range r, RangeElementType seedElement)

and then it's a simple check against the value of map

static if (isSomeString!(typeof(map)))
    enum isIdentity = map == "a";
else
    enum isIdentity = false;

Overloads

The biggest disadvantage is the template explosion. As mentioned above we would need to add six overloads (2x extremum, 2x minElement, 2x maxElement).
So extremum would become this:

private auto extremum(alias selector = "a < b", Range)(Range r)
    if (isInputRange!Range && !isInfinite!Range)

private auto extremum(alias map = "a", alias selector = "a < b", Range)(Range r)
    if (isInputRange!Range && !isInfinite!Range)

private auto extremum(alias selector = "a < b", Range,
                      RangeElementType = ElementType!Range)
                     (Range r, RangeElementType seedElement)
    if (isInputRange!Range && !isInfinite!Range &&
        !is(CommonType!(ElementType!Range, RangeElementType) == void))

private auto extremum(alias map = "a", alias selector = "a < b", Range,
                      RangeElementType = ElementType!Range)
                     (Range r, RangeElementType seedElement)
    if (isInputRange!Range && !isInfinite!Range &&
        !is(CommonType!(ElementType!Range, RangeElementType) == void))

andralex · 2016-12-24T14:54:47Z

std/algorithm/searching.d

+
+    // check for identity ("a")
+    static if (isSomeString!(typeof(map)))
+        enum isIdentity = map == "a";


It would be nicer to specialize on no mapping function. Would that require a lot more code?

wilzbach · 2016-12-27T06:11:13Z

It would be nicer to specialize on no mapping function. Would that require a lot more code?

I opened it as new PR, s.t. it's easier to compare both approaches:

#5001

andralex · 2017-02-28T18:39:58Z

Let's go with #5001 - thanks!

wilzbach mentioned this pull request Apr 29, 2016

std.algorithm.search: expose extremum #4257

Closed

wilzbach force-pushed the optimize_extremum branch from a39e5b1 to ab9f35d Compare April 29, 2016 16:29

wilzbach mentioned this pull request May 1, 2016

std.algorithm.searching: minmaxElement #4248

Closed

JackStouffer reviewed Jun 2, 2016
View reviewed changes

wilzbach force-pushed the optimize_extremum branch 2 times, most recently from a79f3c0 to 09ebd0f Compare June 4, 2016 14:28

dnadlinger reviewed Jun 4, 2016
View reviewed changes

std.algorithm.searching: no mapping-specialization for extremum

f637d7b

wilzbach force-pushed the optimize_extremum branch from 09ebd0f to f637d7b Compare June 4, 2016 23:39

wilzbach mentioned this pull request Jul 21, 2016

infer elementType for ranges in makeArray,makeSlice #4263

Merged

9il added Review:Needs Work Severity:Enhancement Severity:Optimization labels Jul 29, 2016

wilzbach added Review:Needs Decision and removed Review:Needs Work labels Aug 28, 2016

wilzbach added the @andralex label Aug 31, 2016

wilzbach assigned andralex Dec 10, 2016

andralex reviewed Dec 24, 2016

View reviewed changes

wilzbach mentioned this pull request Dec 27, 2016

std.algorithm.searching: no mapping-specialization for extremum #5001

Merged

andralex closed this Feb 28, 2017

wilzbach deleted the optimize_extremum branch February 28, 2017 19:34

wilzbach mentioned this pull request Jul 5, 2017

Fix Issue 8472 - Replace walkLength() with an improved count() #5545

Closed

wilzbach mentioned this pull request Dec 21, 2017

Implement initial version of lambda comparison dlang/dmd#7484

Merged

wilzbach mentioned this pull request Jan 23, 2018

Fix Issue 18280 - std.algorithm.comparison.cmp for non-strings should call opCmp only once per item pair #6056

Merged

wilzbach mentioned this pull request Feb 8, 2018

Use __traits(isSame) for min,maxElement #6144

Merged

Uh oh!

std.algorithm.searching: no mapping-specialization for extremum #4265

std.algorithm.searching: no mapping-specialization for extremum #4265

Uh oh!

Conversation

wilzbach commented Apr 29, 2016

Uh oh!

wilzbach commented Apr 29, 2016

Uh oh!

wilzbach commented Apr 30, 2016

Uh oh!

9il commented Apr 30, 2016

Uh oh!

wilzbach commented Apr 30, 2016

Uh oh!

9il commented Apr 30, 2016

Uh oh!

9il commented Apr 30, 2016

Uh oh!

9il commented Apr 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wilzbach commented Apr 30, 2016

Uh oh!

9il commented Apr 30, 2016

Uh oh!

wilzbach commented Apr 30, 2016

Uh oh!

wilzbach commented Apr 30, 2016

Uh oh!

wilzbach commented May 1, 2016

Uh oh!

wilzbach commented May 6, 2016

Uh oh!

wilzbach commented May 13, 2016

Uh oh!

9il commented May 13, 2016

Uh oh!

JackStouffer Jun 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wilzbach commented Jun 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andralex commented Jun 4, 2016

Uh oh!

wilzbach commented Jun 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dnadlinger commented Jun 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dnadlinger commented Jun 4, 2016

Uh oh!

wilzbach commented Jun 4, 2016

Uh oh!

dnadlinger commented Jun 4, 2016

Uh oh!

dnadlinger Jun 4, 2016

Choose a reason for hiding this comment

Uh oh!

dnadlinger commented Jun 4, 2016

Uh oh!

wilzbach commented Jun 4, 2016

Uh oh!

dnadlinger commented Jun 5, 2016

Uh oh!

wilzbach commented Aug 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wilzbach commented Dec 10, 2016

String comparison (currently implemented in this PR)

Overloads

Uh oh!

andralex Dec 24, 2016

Choose a reason for hiding this comment

Uh oh!

wilzbach commented Dec 27, 2016

Uh oh!

9il commented Apr 30, 2016 •

edited

Loading

JackStouffer Jun 2, 2016 •

edited

Loading

wilzbach commented Jun 4, 2016 •

edited

Loading

wilzbach commented Jun 4, 2016 •

edited

Loading

dnadlinger commented Jun 4, 2016 •

edited

Loading

wilzbach commented Aug 28, 2016 •

edited

Loading