Skip to content

Conversation

@wilzbach
Copy link
Contributor

As #4257 (exposing the extremum function) depends on andralex, I thought I separate the optimization for the noop-function case in a separate PR. Here it goes. The idea is to reduce the unnecessary element and function call if no mapping is specifiecd, e.g. from:

MapType mapElement = mapFun(r[i]);
if (selectorFun(mapElement, extremeElementMapped))
{
   extremeElement = r[i];
   extremeElementMapped = mapElement;
}

this PR reduces it down to:

if (selectorFun(r[i], extremeElement))
{
   extremeElement = r[i];
}

Moreover a small goodie was added that checks whether the first template function can be used as an mapping function (unary function) and otherwise it falls back to the selector (binary function). Useful in terms of #4257

@wilzbach
Copy link
Contributor Author

  1. Assembler diff with dmd: http://www.mergely.com/966byvI1 (right is the current version, you can unaryFun is included)
  2. Benchmarks (1: is the new function)

tl;dr it does make a huge difference!

dmd -release foo.d && ./foo
0 6 secs, 736 ms, 286 μs, and 1 hnsec
1 4 secs, 407 ms, 872 μs, and 2 hnsecs
./foo  11.16s user 0.00s system 100% cpu 11.155 total
dmd -release -O foo.d && ./foo
0 4 secs, 610 ms, 609 μs, and 5 hnsecs
1 3 secs, 769 ms, 89 μs, and 4 hnsecs
ldc -release foo.d && ./foo
0 5 secs, 444 ms, 236 μs, and 7 hnsecs
1 4 secs, 244 ms, 484 μs, and 4 hnsecs
ldc -release -O3 foo.d && ./foo
0 1 sec, 345 ms, and 914 μs
1 216 ms, 347 μs, and 1 hnsec

Source for benchmarks: http://sprunge.us/QSWP

@wilzbach
Copy link
Contributor Author

@9il nope doesn't change anything. with optimization the difference the existing variant is 5x slower!

ldc -O -release -boundscheck=off foo.d && ./foo
0 1 sec, 345 ms, 145 μs, and 3 hnsecs
1 216 ms, 700 μs, and 2 hnsecs
ldc -O3 -release -boundscheck=off foo.d && ./foo
0 1 sec, 345 ms, 432 μs, and 1 hnsec
1 216 ms, 223 μs, and 1 hnsec

Furthermore with ldc -O -release -boundscheck=off -output-s unaryFun is still called.

(I tried with both ldc 0.17 and 1.0-beta)

@9il
Copy link
Member

9il commented Apr 30, 2016

This is very strange and it is significant issue. Maybe it is not inlined because this function exist in static phobos?

@wilzbach
Copy link
Contributor Author

This is very strange and it is significant issue. Maybe it is not inlined because this function exist in static phobos?

Just tried by copying over - same results :/

@9il
Copy link
Member

9il commented Apr 30, 2016

common lambda a => a the same?

@9il
Copy link
Member

9il commented Apr 30, 2016

I am interested in sort perfomance ...

@9il
Copy link
Member

9il commented Apr 30, 2016

probably this is pandora door)

@wilzbach
Copy link
Contributor Author

common lambda a => a the same?

Yes :/

I am interested in sort perfomance ...

Luckily no measurable difference.

probable this is pandora door)

I don't hope so, as far as I checked inlining is done correctly when no intermediate value is saved. So maybe it might be even more efficient not to store the mapResult at all - even if it's not the identitiyFun?

@9il
Copy link
Member

9il commented Apr 30, 2016

Looks like DMD FE bug

@wilzbach
Copy link
Contributor Author

maybe it might be even more efficient not to store the mapResult at all

Turns out to be even slower.

Looks like DMD FE bug

So the huge difference is between using unaryFun and not using it.

@wilzbach
Copy link
Contributor Author

@9il - can you have a look at this benchmark?
a & b seem to be a lot slower.

auto a(alias map = "a", R)(R r)
{
    import std.functional;
    alias mapFun = unaryFun!map;
    auto extremeElement = r[0];
    auto m = mapFun(r[0]);
    foreach (const i; 0 .. r.length)
    {
        auto k = mapFun(r[i]);
        if (k < m)
        {
            extremeElement = r[i];
            m = k;
        }
    }
    return extremeElement;
}

auto b(alias map = "a", R)(R r)
{
    import std.functional;
    alias mapFun = unaryFun!map;
    auto extremeElement = r[0];
    auto m = mapFun(r[0]);
    foreach (const i; 0 .. r.length)
    {
        if (mapFun(r[i]) < m)
        {
            extremeElement = r[i];
            m = mapFun(r[i]);
        }
    }
    return extremeElement;
}

auto c(alias map = "a", R)(R r)
{
    import std.functional;
    alias mapFun = unaryFun!map;
    auto extremeElement = r[0];
    foreach (const i; 0 .. r.length)
    {
        if (mapFun(r[i]) < extremeElement)
        {
            extremeElement = mapFun(r[i]);
        }
    }
    return extremeElement;
}

auto d(R)(R r)
{
    auto extremeElement = r[0];
    foreach (const i; 0 .. r.length)
    {
        if (r[i] < extremeElement)
        {
            extremeElement = r[i];
        }
    }
    return extremeElement;
}

void main() {
    import std.datetime: benchmark, Duration;
    import std.stdio: writeln;
    import std.array: array;
    import std.conv: to;
    import std.random: randomShuffle;
    import std.range:iota;
    auto arr = iota(100_000).array;
    arr.randomShuffle;
    auto i = 0;
    void f0(){ i += arr.a; }
    void f1(){ i += arr.b; }
    void f2(){ i += arr.c; }
    void f3(){ i += arr.d; }
    auto rs = benchmark!(f0, f1, f2, f3)(10_000);
    foreach(j,r;rs)
        writeln(j, " ", r.to!Duration);
    writeln(i);
}

and runtime:

> ldc2 -O3 -release -boundscheck=off foo.d && ./foo
0 1 sec, 747 ms, and 892 μs
1 1 sec, 748 ms, 739 μs, and 3 hnsecs
2 223 ms, 562 μs, and 4 hnsecs
3 227 ms, 73 μs, and 3 hnsecs

@wilzbach
Copy link
Contributor Author

wilzbach commented May 1, 2016

btw if you run it with dmd it's a lot slower, but funnily 3) which is the proposed optimization here (3) for identity mapping functions is still a lot faster.

0 2 secs, 121 ms, 660 μs, and 9 hnsecs
1 2 secs, 534 ms, and 509 μs
2 2 secs, 950 ms, 672 μs, and 6 hnsecs
3 844 ms, 454 μs, and 6 hnsecs

If we use a custom mapping function, variant c(=2) which only has one assigment in the loop, but calls the mapping function two times in each iteration is faster

0 1 sec, 747 ms, 555 μs, and 5 hnsecs
1 1 sec, 748 ms, 328 μs, and 9 hnsecs
2 1 sec, 480 ms, 504 μs, and 5 hnsecs

At least for dmd, the natural feeling that calling the map function twice should be more expensive is true:

0 11 secs, 722 ms, 248 μs, and 4 hnsecs
1 11 secs, 721 ms, 43 μs, and 7 hnsecs
2 14 secs, 227 ms, 179 μs, and 6 hnsecs

I used tuples to check the map function.

My summary: we should definitely get the optimization for the identity function (5x ldc, 2x dmd), whether we should call the mapping function twice is more complicated.

@wilzbach
Copy link
Contributor Author

wilzbach commented May 6, 2016

@klickverbot could you have a short look at this?
Is it possible for ldc to optimize this?
Afaict the main problem is that having just one assigment in a loop is a lot faster and neither ldc nor dmd can realize that second assigment isn't needed.

@wilzbach
Copy link
Contributor Author

@9il do you already have an opinion on this?
Should the backend really be able to infer that these two loop statement are equal for a noop mapFun?

        if (mapFun(r[i]) < extremeElement)
        {
            extremeElement = mapFun(r[i]);
        }
        auto k = mapFun(r[i]);
        if (k < m)
        {
            extremeElement = r[i];
            m = k;
        }

@9il
Copy link
Member

9il commented May 13, 2016

i think so

}
}

private struct IdentityFun {};
Copy link
Contributor

@JackStouffer JackStouffer Jun 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment here describing why this exists.

@wilzbach wilzbach force-pushed the optimize_extremum branch 2 times, most recently from a79f3c0 to 09ebd0f Compare June 4, 2016 14:28
@wilzbach
Copy link
Contributor Author

wilzbach commented Jun 4, 2016

Can't you just test for "if map is a string and it == a".

Turns out my knowledge about D's dark magic has grown and yeah we can do that :)

Should the backend really be able to infer that these two loop statement are equal for a noop mapFun?

@klickverbot is such an optimization possible in the foreseeable future? Otherwise this adds an optimization for the quite popular identity case ([1,2].minElement) that is 2x faster for dmd and 5x faster for ldc without changing the API.

@andralex
Copy link
Member

andralex commented Jun 4, 2016

Great work, thanks. I'd say we should just proceed with this PR even if implementations will get better in the future.

Tactically, I wanted for a long time to get rid of the clunky specialization using string predicates. Fortunately in this case it's easy: define two overloads:

private auto extremum(alias selector = "a < b", Range)(Range r)
    if (isInputRange!Range && !isInfinite!Range);
private auto extremum(alias map, alias selector = "a < b", Range)(Range r)
    if (isInputRange!Range && !isInfinite!Range);

Now there's no need to look at the predicate - if missing, it's identity. Would this work?

@wilzbach
Copy link
Contributor Author

wilzbach commented Jun 4, 2016

Now there's no need to look at the predicate - if missing, it's identity. Would this work?

Yes, but you would also need to add the overload for the seed and then for these same overloads to minElement and maxElement too - is this acceptable then?

private auto extremum(alias selector = "a < b", Range)(Range r)
    if (isInputRange!Range && !isInfinite!Range)

private auto extremum(alias map = "a", alias selector = "a < b", Range)(Range r)
    if (isInputRange!Range && !isInfinite!Range)

private auto extremum(alias selector = "a < b", Range,
                      RangeElementType = ElementType!Range)
                     (Range r, RangeElementType seedElement)
    if (isInputRange!Range && !isInfinite!Range &&
        !is(CommonType!(ElementType!Range, RangeElementType) == void))

private auto extremum(alias map = "a", alias selector = "a < b", Range,
                      RangeElementType = ElementType!Range)
                     (Range r, RangeElementType seedElement)
    if (isInputRange!Range && !isInfinite!Range &&
        !is(CommonType!(ElementType!Range, RangeElementType) == void))

@dnadlinger
Copy link
Contributor

dnadlinger commented Jun 4, 2016

@wilzbach: The reason for the big difference in speed on LDC is that the c/d case is auto-vectorised. As an interesting aside, on a wider CPU the difference is thus even bigger:

0 1 sec, 180 ms, 949 μs, and 2 hnsecs
1 1 sec, 182 ms, 515 μs, and 9 hnsecs
2 91 ms, 457 μs, and 9 hnsecs
3 100 ms, 229 μs, and 9 hnsecs

The first two are partially unrolled, but for reasons not immediately obvious to me, the loop vectorisation pass doesn't kick in.

I think if you were to have a closer look at what's going on, you'd either find that there is a subtle difference in the IR for the two cases – for example concerning index overflow that we know will never occur, etc. –, or (and probably more likely) that the first just happens not to be detected by the pattern matcher in the loop vectoriser. This would definitely be a fun thing to investigate, but I'm a bit short on time right now.

@dnadlinger
Copy link
Contributor

On another note, I'm rather concerned by the proposal to duplicate the otherwise identical implementation for indexable ranges, especially since this is not the first such change to pop up recently. There is an obvious argument to be made about maintainability, but of course in this case that could be dismissed pointing out that the standard library is the one place where providing a variety of heavily optimised implementation is actually worth it (and even expected).

However, I think that we have a larger issue at hand here. Ranges are of prime strategic importance for D as an accessible way of composing complex operations out of reusable primitives. A large part of this comes from the fact that they are supposed to be zero-cost abstractions – after the compiler optimiser has had its way with a piece of code, we expect it to be equivalent to a hand-written loop. This is what turns ranges from a neat toy design into a production-ready feature for a performance-oriented language like D. (There are much more interesting and less error-prone designs to be had if performance was not of concern, cf. the eternal transient front discussion.)

If we now find ourselves compelled to duplicate a simple loop that simply iterates over a range one-by-one to use indices where available, the only possible conclusion is that the concept of a zero-cost generalisation has failed somewhere at a very fundamental level. This is not a case where specialising on a richer type allows us to make use of the additional capabilities; it's just restating the same operation in a different way.

In this specific instance, we might be able to paint over the underlying issue by manually adding a special case, but not only is this prohibitively expensive in terms of maintenance burden for many situations, it might also be plain impossible if a similar issue emerges in the composition of two separate higher-level primitives (think two nested range algorithms not being inlined properly into each other).

I just did a quick check of the range-based identity path against d, and thankfully, they seem to generate almost identical code on LDC (most importantly, both are auto-vectorised). But even in general – where similar cases might actually show a performance difference – I think we should be very cautious about adding such band-aid special cases. The effort spent to write, test and maintain them would be much better directed towards figuring out why the abstractions are not optimised away as they should be.

@wilzbach
Copy link
Contributor Author

wilzbach commented Jun 4, 2016

However, I think that we have a larger issue at hand here. Ranges are of prime strategic importance for D as an accessible way of composing complex operations out of reusable primitives. A large part of this comes from the fact that they are supposed to be zero-cost abstractions – after the compiler optimiser has had its way with a piece of code, we expect it to be equivalent to a hand-written loop.

I absolutely agree, did a bit of benchmarking and opened a forum discussion:
http://forum.dlang.org/post/mqqkaquqxodqjiqzuzky@forum.dlang.org

The effort spent to write, test and maintain them would be much better directed towards figuring out why the abstractions are not optimised away as they should be.

Agreed, started the testing - see the thread ;-)

@dnadlinger
Copy link
Contributor

(To clarify: I think we should probably go ahead and add the identity specialisation, but not the random-access one.)

import std.stdio;
assert([-2., 0, 2].extremum!`cmp(a, b) < 0` == -2.0);

// remember there is reduce too
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, how is this related? This doesn't seem to be in a documented unit test.

@dnadlinger
Copy link
Contributor

Apparently, there is a performance difference between indexed/primitive versions in the non-identity-map case even for LDC, so let's go with this for now.

@wilzbach wilzbach force-pushed the optimize_extremum branch from 09ebd0f to f637d7b Compare June 4, 2016 23:39
@wilzbach
Copy link
Contributor Author

wilzbach commented Jun 4, 2016

Huh, how is this related? This doesn't seem to be in a documented unit test.

Ouch it was previously documented (before we decided extremum needs Andrei's approval as well) -> removed.

Apparently, there is a performance difference between indexed/primitive versions in the non-identity-map case even for LDC, so let's go with this for now.

We should really dig deeper into this anyhow!

Should we go with this version or with the overloads? As mentioned above we would need to add six overloads (2x extremum, 2x minElement, 2x maxElement).

@dnadlinger
Copy link
Contributor

We should really dig deeper into this anyhow!

The LDC performance issue in the identity-mapped case is due to a bug in the LLVM optimizer (instcombine), see https://llvm.org/bugs/show_bug.cgi?id=28006.

As for which implementation to go for, I'll leave it up to @andralex since he brought up the issue. Overloads are conceptually a bit cleaner, but since we have the string literal syntax in place already, the string comparison is really just checking whether the map parameter has been set explicitly or not.

@wilzbach
Copy link
Contributor Author

wilzbach commented Aug 28, 2016

Should we go with this version or with the overloads? As mentioned above we would need to add six overloads (2x extremum, 2x minElement, 2x maxElement).
As for which implementation to go for, I'll leave it up to @andralex since he brought up the issue.

AFAIK we are waiting here for feedback and not for work -> changed the labels.

@wilzbach
Copy link
Contributor Author

Should we go with this version or with the overloads? As mentioned above we would need to add six overloads (2x extremum, 2x minElement, 2x maxElement).
As for which implementation to go for, I'll leave it up to @andralex since he brought up the issue. Overloads are conceptually a bit cleaner, but since we have the string literal syntax in place already, the string comparison is really just checking whether the map parameter has been set explicitly or not.

Ping @andralex - for convenience I listed both ways below:

String comparison (currently implemented in this PR)

private auto extremum(alias map = "a", alias selector = "a < b", Range)(Range r)
private auto extremum(alias map = "a", alias selector = "a < b", Range,
                       RangeElementType = ElementType!Range)
                      (Range r, RangeElementType seedElement)	

and then it's a simple check against the value of map

static if (isSomeString!(typeof(map)))
    enum isIdentity = map == "a";
else
    enum isIdentity = false;

Overloads

The biggest disadvantage is the template explosion. As mentioned above we would need to add six overloads (2x extremum, 2x minElement, 2x maxElement).
So extremum would become this:

private auto extremum(alias selector = "a < b", Range)(Range r)
    if (isInputRange!Range && !isInfinite!Range)

private auto extremum(alias map = "a", alias selector = "a < b", Range)(Range r)
    if (isInputRange!Range && !isInfinite!Range)

private auto extremum(alias selector = "a < b", Range,
                      RangeElementType = ElementType!Range)
                     (Range r, RangeElementType seedElement)
    if (isInputRange!Range && !isInfinite!Range &&
        !is(CommonType!(ElementType!Range, RangeElementType) == void))

private auto extremum(alias map = "a", alias selector = "a < b", Range,
                      RangeElementType = ElementType!Range)
                     (Range r, RangeElementType seedElement)
    if (isInputRange!Range && !isInfinite!Range &&
        !is(CommonType!(ElementType!Range, RangeElementType) == void))


// check for identity ("a")
static if (isSomeString!(typeof(map)))
enum isIdentity = map == "a";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nicer to specialize on no mapping function. Would that require a lot more code?

@wilzbach
Copy link
Contributor Author

It would be nicer to specialize on no mapping function. Would that require a lot more code?

I opened it as new PR, s.t. it's easier to compare both approaches:

#5001

@andralex
Copy link
Member

Let's go with #5001 - thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants