-
-
Notifications
You must be signed in to change notification settings - Fork 748
std.algorithm.searching: no mapping-specialization for extremum #4265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a39e5b1 to
ab9f35d
Compare
tl;dr it does make a huge difference! Source for benchmarks: http://sprunge.us/QSWP |
|
@9il nope doesn't change anything. with optimization the difference the existing variant is 5x slower! Furthermore with (I tried with both ldc 0.17 and 1.0-beta) |
|
This is very strange and it is significant issue. Maybe it is not inlined because this function exist in static phobos? |
Just tried by copying over - same results :/ |
|
common lambda |
|
I am interested in |
|
probably this is pandora door) |
Yes :/
Luckily no measurable difference.
I don't hope so, as far as I checked inlining is done correctly when no intermediate value is saved. So maybe it might be even more efficient not to store the mapResult at all - even if it's not the identitiyFun? |
|
Looks like DMD FE bug |
Turns out to be even slower.
So the huge difference is between using unaryFun and not using it. |
|
@9il - can you have a look at this benchmark? and runtime: |
|
btw if you run it with dmd it's a lot slower, but funnily 3) which is the proposed optimization here (3) for identity mapping functions is still a lot faster. If we use a custom mapping function, variant c(=2) which only has one assigment in the loop, but calls the mapping function two times in each iteration is faster At least for dmd, the natural feeling that calling the map function twice should be more expensive is true: I used tuples to check the map function. My summary: we should definitely get the optimization for the identity function (5x ldc, 2x dmd), whether we should call the mapping function twice is more complicated. |
|
@klickverbot could you have a short look at this? |
|
@9il do you already have an opinion on this? |
|
i think so |
std/algorithm/searching.d
Outdated
| } | ||
| } | ||
|
|
||
| private struct IdentityFun {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment here describing why this exists.
a79f3c0 to
09ebd0f
Compare
Turns out my knowledge about D's dark magic has grown and yeah we can do that :)
@klickverbot is such an optimization possible in the foreseeable future? Otherwise this adds an optimization for the quite popular identity case ( |
|
Great work, thanks. I'd say we should just proceed with this PR even if implementations will get better in the future. Tactically, I wanted for a long time to get rid of the clunky specialization using string predicates. Fortunately in this case it's easy: define two overloads: private auto extremum(alias selector = "a < b", Range)(Range r)
if (isInputRange!Range && !isInfinite!Range);
private auto extremum(alias map, alias selector = "a < b", Range)(Range r)
if (isInputRange!Range && !isInfinite!Range);Now there's no need to look at the predicate - if missing, it's identity. Would this work? |
Yes, but you would also need to add the overload for the seed and then for these same overloads to private auto extremum(alias selector = "a < b", Range)(Range r)
if (isInputRange!Range && !isInfinite!Range)
private auto extremum(alias map = "a", alias selector = "a < b", Range)(Range r)
if (isInputRange!Range && !isInfinite!Range)
private auto extremum(alias selector = "a < b", Range,
RangeElementType = ElementType!Range)
(Range r, RangeElementType seedElement)
if (isInputRange!Range && !isInfinite!Range &&
!is(CommonType!(ElementType!Range, RangeElementType) == void))
private auto extremum(alias map = "a", alias selector = "a < b", Range,
RangeElementType = ElementType!Range)
(Range r, RangeElementType seedElement)
if (isInputRange!Range && !isInfinite!Range &&
!is(CommonType!(ElementType!Range, RangeElementType) == void)) |
|
@wilzbach: The reason for the big difference in speed on LDC is that the c/d case is auto-vectorised. As an interesting aside, on a wider CPU the difference is thus even bigger: The first two are partially unrolled, but for reasons not immediately obvious to me, the loop vectorisation pass doesn't kick in. I think if you were to have a closer look at what's going on, you'd either find that there is a subtle difference in the IR for the two cases – for example concerning index overflow that we know will never occur, etc. –, or (and probably more likely) that the first just happens not to be detected by the pattern matcher in the loop vectoriser. This would definitely be a fun thing to investigate, but I'm a bit short on time right now. |
|
On another note, I'm rather concerned by the proposal to duplicate the otherwise identical implementation for indexable ranges, especially since this is not the first such change to pop up recently. There is an obvious argument to be made about maintainability, but of course in this case that could be dismissed pointing out that the standard library is the one place where providing a variety of heavily optimised implementation is actually worth it (and even expected). However, I think that we have a larger issue at hand here. Ranges are of prime strategic importance for D as an accessible way of composing complex operations out of reusable primitives. A large part of this comes from the fact that they are supposed to be zero-cost abstractions – after the compiler optimiser has had its way with a piece of code, we expect it to be equivalent to a hand-written loop. This is what turns ranges from a neat toy design into a production-ready feature for a performance-oriented language like D. (There are much more interesting and less error-prone designs to be had if performance was not of concern, cf. the eternal transient front discussion.) If we now find ourselves compelled to duplicate a simple loop that simply iterates over a range one-by-one to use indices where available, the only possible conclusion is that the concept of a zero-cost generalisation has failed somewhere at a very fundamental level. This is not a case where specialising on a richer type allows us to make use of the additional capabilities; it's just restating the same operation in a different way. In this specific instance, we might be able to paint over the underlying issue by manually adding a special case, but not only is this prohibitively expensive in terms of maintenance burden for many situations, it might also be plain impossible if a similar issue emerges in the composition of two separate higher-level primitives (think two nested range algorithms not being inlined properly into each other). I just did a quick check of the range-based identity path against |
I absolutely agree, did a bit of benchmarking and opened a forum discussion:
Agreed, started the testing - see the thread ;-) |
|
(To clarify: I think we should probably go ahead and add the identity specialisation, but not the random-access one.) |
std/algorithm/searching.d
Outdated
| import std.stdio; | ||
| assert([-2., 0, 2].extremum!`cmp(a, b) < 0` == -2.0); | ||
|
|
||
| // remember there is reduce too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, how is this related? This doesn't seem to be in a documented unit test.
|
Apparently, there is a performance difference between indexed/primitive versions in the non-identity-map case even for LDC, so let's go with this for now. |
09ebd0f to
f637d7b
Compare
Ouch it was previously documented (before we decided extremum needs Andrei's approval as well) -> removed.
We should really dig deeper into this anyhow! Should we go with this version or with the overloads? As mentioned above we would need to add six overloads (2x extremum, 2x minElement, 2x maxElement). |
The LDC performance issue in the identity-mapped case is due to a bug in the LLVM optimizer (instcombine), see https://llvm.org/bugs/show_bug.cgi?id=28006. As for which implementation to go for, I'll leave it up to @andralex since he brought up the issue. Overloads are conceptually a bit cleaner, but since we have the string literal syntax in place already, the string comparison is really just checking whether the map parameter has been set explicitly or not. |
AFAIK we are waiting here for feedback and not for work -> changed the labels. |
Ping @andralex - for convenience I listed both ways below: String comparison (currently implemented in this PR)private auto extremum(alias map = "a", alias selector = "a < b", Range)(Range r)
private auto extremum(alias map = "a", alias selector = "a < b", Range,
RangeElementType = ElementType!Range)
(Range r, RangeElementType seedElement) and then it's a simple check against the value of static if (isSomeString!(typeof(map)))
enum isIdentity = map == "a";
else
enum isIdentity = false;OverloadsThe biggest disadvantage is the template explosion. As mentioned above we would need to add six overloads (2x extremum, 2x minElement, 2x maxElement). private auto extremum(alias selector = "a < b", Range)(Range r)
if (isInputRange!Range && !isInfinite!Range)
private auto extremum(alias map = "a", alias selector = "a < b", Range)(Range r)
if (isInputRange!Range && !isInfinite!Range)
private auto extremum(alias selector = "a < b", Range,
RangeElementType = ElementType!Range)
(Range r, RangeElementType seedElement)
if (isInputRange!Range && !isInfinite!Range &&
!is(CommonType!(ElementType!Range, RangeElementType) == void))
private auto extremum(alias map = "a", alias selector = "a < b", Range,
RangeElementType = ElementType!Range)
(Range r, RangeElementType seedElement)
if (isInputRange!Range && !isInfinite!Range &&
!is(CommonType!(ElementType!Range, RangeElementType) == void)) |
|
|
||
| // check for identity ("a") | ||
| static if (isSomeString!(typeof(map))) | ||
| enum isIdentity = map == "a"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nicer to specialize on no mapping function. Would that require a lot more code?
I opened it as new PR, s.t. it's easier to compare both approaches: |
|
Let's go with #5001 - thanks! |
As #4257 (exposing the extremum function) depends on andralex, I thought I separate the optimization for the noop-function case in a separate PR. Here it goes. The idea is to reduce the unnecessary element and function call if no mapping is specifiecd, e.g. from:
this PR reduces it down to:
Moreover a small goodie was added that checks whether the first template function can be used as an mapping function (unary function) and otherwise it falls back to the selector (binary function). Useful in terms of #4257