[RFC] Make the behavior of calling range primitives clear by JackStouffer · Pull Request #4511 · dlang/phobos

JackStouffer · 2016-06-30T14:01:55Z

Based on this post, many core devs were in agreement that one basic rule of ranges is that many calls to front without a call to popFront should all yield the same value. Some ranges break this rule.

This PR formalizes this rule as to make it crystal clear that these violations are bugs that must be fixed:

Calling front multiple times without calling popFront yields the same result for each call.

Ping @schveiguy @jmdavis

schveiguy · 2016-06-30T14:08:24Z

LGTM.

schveiguy · 2016-06-30T14:11:36Z

I'm flagging @andralex on this, since it's an additional design constraint to a core piece of the library.

wilzbach · 2016-06-30T14:34:49Z

std/range/primitives.d

+    is allowed only if calling `r.empty` has, or would have,
+    returned `false`.)
+    $(LI Calling `r.front` multiple times without calling
+    `r.popFront` yields the same result for each call.)


just a formatting nitpick: it seems that before every method/property has had it's own point. That seems reasonable, Can we keep this?

JackStouffer · 2016-06-30T14:51:38Z

@wilzbach fixed

dnadlinger · 2016-06-30T14:52:54Z

I'm all for clarifying the semantics (in fact, I think these details were a major oversight in the range design, although I'm aware Andrei disagrees). However, I think the currently proposed wording still leaves too many uncertainties – what does "yield the same result" mean? The same return value? Are side-effects evaluated multiple times?

schveiguy · 2016-06-30T14:56:53Z

although I'm aware Andrei disagrees

Where have you seen that? If he disagrees, we probably should just close this now and move on, because it won't happen.

what does "yield the same result" mean

It means that when I call front, I get the same result until I call popFront, whereby I then get the next element in the range when I call front. Essentially, the requirement is that the range doesn't move until I call popFront.

How this is achieved is not relevant or necessary to specify. You can re-calculate, you can use a cache, you can have whatever side effects you need to do, you can do whatever you want. Just make sure that the range's front stays the same.

JackStouffer · 2016-06-30T14:59:29Z

Perhaps I should change the wording to

"r.front called multiple times, without calling any other range primitive, yields the same result for every call"

as calling the set version of front will change the value of front.

schveiguy · 2016-06-30T15:42:55Z

as calling the set version of front will change the value of front.

Actually, your rewording still doesn't imply that calling set version of front is verboten.

Another strawman: "front should return or refer to the same element until popFront is called". Note that empty shouldn't advance the range either.

dnadlinger · 2016-06-30T18:25:45Z

although I'm aware Andrei disagrees

Where have you seen that?

Oh, what I was thinking of is not related to this PR in particular or indeed any other specific change proposal. In one of the more epic range semantics discussions – I think it was one related to transient ranges, maybe to byLine –, Andrei rather nonchalantly dismissed the possibility that these correctness considerations would have an impact on the basic design (e.g. splitting front/popFront vs. having one combined primitive). I don't have any particular, clear issue to highlight (yet?), but in light of all the difficulties with implicit API dependencies and object lifetimes we are facing right now, the position that "it will always be possible to write buggy code, so let's not even bother" seems like too convenient an answer for me. To reiterate, this was just an aside not related to this PR, but e.g. a combined front/advance design would avoid the issue discussed here (but of course come with its own drawbacks).

How this is achieved is not relevant or necessary to specify. You can re-calculate, you can use a cache, you can have whatever side effects you need to do, you can do whatever you want. Just make sure that the range's front stays the same.

Okay, so let's say .front is guaranteed to always "return the same result" before advancing the range. This allows you to declare things like generate as "clearly wrong" – but how much does it actually gain you when wondering how to write range algorithms? Will users actually expect front on the passed-in range to be called multiple times – what if it does have side effects? Should you avoid that, unless you'd need to go unreasonable contortions? What if front is expensive? Should you better make a local copy if you need front three times per element, or is that the user's responsibility, for maximum flexibility? Three hundred times? Three million?

schveiguy · 2016-06-30T18:55:30Z

Will users actually expect front on the passed-in range to be called multiple times – what if it does have side effects?

Yes, they should. If it has side effects, that is up to the range author whether it's important. The range API isn't concerned with side effects, and neither should the user of the range be concerned.

What if front is expensive?

range.cache if you want.

Should you better make a local copy if you need front three times per element, or is that the user's responsibility, for maximum flexibility? Three hundred times? Three million?

I don't think generic code should care. Performance of getting the front element is not technically part of the range definition, but I think everyone will expect the same performance as an array, where front is pretty cheap. If you find your peformance is lacking, then profile and optimize. If this means you make a copy, then you make a copy.

Even with an array, continually indexing the first element is going to cost more than making a copy to go in a register, I've made improvements to code by doing just that.

quickfur · 2016-06-30T19:04:11Z

I've thought about this a little, and realized that it's possible to interconvert between the current range API (empty, front, popFront) and a generator API (empty, next):

struct Generator(Range)
    if (isInputRange!Range)
{
    Range r;
    @property bool empty() { return r.empty; }
    auto next()
    {
        static if (isForwardRange!Range)
        {
            auto s = r.save;
            r.popFront();
            return s.front;
        }
        else
        {
            // Note: assumes non-transience
            auto e = r.front;
            r.popFront();
            return e;
        }
    }
}

struct Range(Generator)
    if (isGenerator!Generator)
{
    Generator g;
    typeof(g.next()) front;
    this()
    {
        if (!g.empty)
            front = g.next();
    }
    @property bool empty() { return g.empty; }
    void popFront() { front = g.next(); }
}

Perhaps what we need is the generator-to-range construct wrapped in a convenient Phobos function, so that it's easy to generate ranges from generators.

quickfur · 2016-06-30T19:06:41Z

And by "generators" I include everything that might be problematic based on the updated range API proposed in this PR, such as front that changes its value every time, etc., front doing too much work every time it's called, etc..

schveiguy · 2016-06-30T19:23:11Z

Yes, a generator can be turned into a range quite easily. But I think we don't need a new concept to do that. A generator is really just a delegate (the "empty" method is mostly useless).

JackStouffer · 2016-06-30T19:52:48Z

@quickfur opApply iterables can be translated in a very similar fashion.

quickfur · 2016-06-30T20:10:51Z

With opApply you need fibers, though. Otherwise you couldn't implement the range primitives with the required semantics.

dnadlinger · 2016-06-30T22:21:16Z

If it has side effects, that is up to the range author whether it's important.

The author of which range?

The range API isn't concerned with side effects, and neither should the user of the range be concerned.

My claim: What you are saying here is that the range API is a leaky abstraction by design.

I don't think generic code should care. Performance of getting the front element is not technically part of the range definition, but I think everyone will expect the same performance as an array,

You are right, generic code (i.e. every single range algorithm) shouldn't have to care. But for that to be possible, it needs to be clear what the code can and cannot (or should not) expect. For example, if I grant you that people will expect front to be cheap – doesn't this mean that map should cache its result, especially because it allows the user to reason about side effects in a self-contained way? On the other hand, what if I'm trying to be cautious about repeat accesses and cache everything in the fancy higher-order range I'm writing, but then somebody uses the code with an element type that's expensive to move/copy, who would much rather prefer no caching to be done because they are only going to in turn call front only once anyway?

schveiguy · 2016-06-30T23:45:01Z

The author of which range?

The author of the range that is creating the side effects.

What you are saying here is that the range API is a leaky abstraction by design.

Leaky how? I can write code that has side effects anywhere, no design can stop that. Unless you want to make all ranges pure?

For example, if I grant you that people will expect front to be cheap – doesn't this mean that map should cache its result?

map should not concern itself with preventing unexpected behavior from someone who creates such a range. Indeed, I have shown how to create an unexpected range with map as shown on the forum post linked from this PR that has nothing to do with caching.

The thing is, caching does not affect the API, just the performance. And premature optimization is not usually desired. It has its own effects (for example, if the underlying data changes through other means, the cached copy is no longer valid). If you want to cache with map, use map!(...).cache. It's not hard to cache on demand.

On the other hand...

I think it's more important to identify the aspects of performance/behavior in the documentation of the higher-order range. Do whatever the minimum is to get it to be a valid range, then allow the user to build on top of that if necessary.

The reality is, as long as we create algorithms that use arbitrary function calls to build the data, we cannot be certain that any range construction has well-behaved semantics. There is a certain degree of responsibility that the function author has to assume, map (and the compiler) can only do so much.

dnadlinger · 2016-07-01T00:05:45Z

Leaky how?

Leaky as in: If the design doesn't specify how side-effect/number of evaluation concerns are to be handled, then I need to crack open the abstraction and review all the source code individually to figure out whether my client code is correct or not.

I can write code that has side effects anywhere, no design can stop that.

The question is not whether you can or not, but whether you should be allowed to in "well-formed" range code, i.e. whether you can then expect those side effects to be evaluated in a certain way.

The thing is, caching does not affect the API, just the performance

This is somewhat of a distraction from my main point, but I disagree. If the user writes a.b!(someSideEffect).c.copy(…) (where a-c are typical range constructor functions, and b is something with side effects in front, e.g. map), then the behaviour of the code manifestly depends on how often c evaluates front per element, not just the performance.

If you want to cache with map, use map!(...).cache. It's not hard to cache on demand.

If you want this to become is the official stance (which is reasonable enough), then let's document it that way: Ranges must expect their front property to be called multiple times and handle that in a valid way (which makes map's behaviour something that needs to be mentioned much more prominently in its documentation), but shouldn't be concerned with optimising for performance in that case – i.e. it's fine for N front calls to take ~N times as long as a single front/popFront pair.

There is a certain degree of responsibility that the function author has to assume […]

Agreed. My point is that when I write a range algorithm, which consume a range but also offers a range interface itself, I can only assume responsibility if I know what guarantees/characteristics client code will expect, and I can in turn expect from the range I'm passed.

schveiguy · 2016-07-01T00:19:44Z

@klickverbot It seems your reasoning comes down to -- can we force algorithms (either by mechanical verification or convention) to require calling front a certain number of times? The answer is no. Any range author and/or function author for range-using code can call front as many times as they want. Does this mean ranges can't have side effects? I don't think it does. But the side effects are the responsibility of the caller, not the algorithm that's calling front N times. Yes, it means that if you are using a function that creates those side effects via calls to front/popFront/empty, then you have to examine the code to see what exactly is going to happen. In essence, you are relying on an implementation detail for correct code.

To put it another way -- I don't think it's invalid to have map!someFunction if someFunction causes side effects (and that's OK with you). But map is perfectly within its right to change the number of calls to front for any reason, so the user should beware of that.

I don't think we need a specific note on map, just a general rule concerning callable parameters and side effects.

schveiguy · 2016-07-01T00:38:11Z

Walter agrees with the essence of the rule: https://forum.dlang.org/post/nl4b8h$2jds$1@digitalmars.com

dnadlinger · 2016-07-01T00:47:22Z

@schveiguy: That's a fair summary, although I'm not just concerned with strict validity conditions, but also "recommendations" for usage, or best practices if you will. For example, a summary of your position could be: Try to implement algorithms with the minimum number of calls to range primitives they naturally require, but don't do any caching internally to reduce the number of empty/front calls to 1 (which is in theory always possible).

You can definitely make the claim that there is a qualitative difference between that and the minimal statement that this PR is concerned with. My point is that writing down (and ideally formalising) these higher-level considerations apart from just "is (not) allowed" is essential to making ranges work in a composable and predictable way. Just specifying that front is allowed to be called multiple times is a first step towards ruling out freak behaviour like generate, but is not enough to reach those goals by itself.

Regarding map, I don't see how there could be a general rule for "callable parameters". The semantics of a callable template argument, and consequently also the user expectations, depend of course on what the algorithm in question is. For example, it seems reasonable for a filter algorithm to specify that the predicate is invoked exactly once per element of the consumed range, but an arbitrary number of times per popFront of the filtered range.

(In case this isn't obvious: I'm not against this PR by itself, since it seems the most solid design from many perspectives, and matches common expectations. But while people are actively thinking and debating about this, it might also make sense to tackle the bigger issue at hand.)

JackStouffer · 2016-07-01T01:15:33Z

Added Walter's rules and make the docs a little clearer.

JackStouffer · 2016-07-01T01:17:39Z

Since we're all here, the rules currently allow r.empty to return things other than bool. Is this intended?

quickfur · 2016-07-06T20:49:03Z

std/range/primitives.d

+    available in the range.)
+    $(LI `r.empty` called multiple times, without calling
+    `r.popFront`, or otherwise mutating the range object,
+    yields the same result for every call.)


I don't like the term "calling", because it implies that empty must be a function. That's not necessarily the case, since it can be a member variable of the range that gets updated by popFront. Maybe "evaluated" is a better word?

Also, multiple evaluations of empty without calling popFront should not cause the value of front to change. Or is this already covered by the following rules?

Also, multiple evaluations of empty without calling popFront should not cause the value of front to change. Or is this already covered by the following rules?

See above comment

JackStouffer · 2016-07-07T13:05:39Z

@quickfur fixed

quickfur · 2016-07-07T16:03:49Z

std/range/primitives.d

+    $(LI `r.front` returns the current element in the range.
+    It may return by value or by reference.)
+    $(LI `r.front` can be legally called iff calling
+    `r.empty` has, or would have, returned `false`.)


Hmm. What about s/call/evaluate/? Since front technically could be a member variable updated by popFront, not necessarily a function.

JackStouffer · 2016-07-07T18:14:23Z

@quickfur Sorry, missed that. Fixed.

quickfur · 2016-07-07T18:35:33Z

Thanks! LGTM once @andralex 's comments are addressed.

JackStouffer · 2016-07-08T16:30:28Z

@quickfur I left some notes on @andralex's comments, I disagree in some areas. If you agree with him could you clarify your position?

jmdavis · 2016-07-08T18:29:53Z

Performance of getting the front element is not technically part of the range definition, but I think everyone will expect the same performance as an array, where front is pretty cheap.

I think that it's pretty clear that all of the range primitives should be O(1) with regards to the number of elements in the range. Exactly what that amounts to in terms of how expensive the operations end up being within O(1) is very much implementation-dependent and not really a concern of the range API IMHO, though it should be expected that empty and front may be called multiple times per call to popFront, and so they should be relatively cheap to call. The primary concern is O(1) though.

Since we're all here, the rules currently allow r.empty to return things other than bool. Is this intended?

I don't see any reason to allow empty to return anything other than bool.

And as for concerns about what returning the same resultmeans, if we want to specify it more concretely, I think that the requirement should be that the same result means equality. There's too much useful code that does stuff like allocate - e.g. range.map!(a => to!string(a))() - for it to make any sense to require that the result be exactly the same object. I'm not sure that we need to be more specific than saying that the result needs to be the same though.

Overall, I think that this looks good.

JackStouffer · 2016-07-08T18:46:37Z

I think that it's pretty clear that all of the range primitives should be O(1) with regards to the number of elements in the range.

Should this be added too? Something like,

"Phobos calls range primitives with the assumption that they are O(1) with respect to the number of elements in the range. Therefore, it's best practice to restrict the range primitives you do provide to those that are O(1). For example, it's not a good idea to implement a singly linked list as a bidirectional range, as the back and popBack attributes would be O(n), and they would be called many times in Phobos code if they are present."

I don't see any reason to allow empty to return anything other than bool.

I will clarify this.

schveiguy · 2016-07-08T19:17:50Z

I think O(1) requirement should not be specified. I think what's important is to know that algorithms will not consider calls to front or empty to be expensive, and may call them more than once for each iteration. If you construct a range with an expensive front, it's on you to worry about the performance of some algorithm that doesn't expect it. But I don't think we need to go as far as saying that's not a valid range.

BTW, typically "sub-linear" is acceptable for fast operations. O(lg(n)) should be sufficient.

quickfur · 2016-07-08T20:08:38Z

I think it's clear that any generic algorithm that takes arbitrary code as parameter (e.g. a lambda, a range whose range API implementation comes from the user) will have a complexity that's keyed on the complexity of the provided code/implementation.

E.g., myRange.sort() is stated to be O(n log n), but it should be clear that what it actually means is n log n composed with the complexity of the range primitives. If popFront is O(n) then the actual complexity is not O(n log n) but O(n^2 log n).

Similarly, myrange.map!lambda supposedly has .front that is O(1), but the actual complexity will be O(lambda).

But it seems unreasonably onerous to require the docs to specify the complexity as "O(f(n) log f(n)) where f(n) is the complexity of parameter X"; we usually assume that range primitives, lambdas, etc., are "cheap" in the sense that we can make the possibly-not-so-accurate assumption that they are O(1) and still derive an overall big-O complexity that gives a good idea of the actual complexity. If the user breaks the O(1) assumption, it's their own problem to deal with the resulting change in complexity; Phobos should simply state that the given big-O complexities are based on the assumption that certain things are O(1), so if your range doesn't obey that, then you're on your own to figure out what the actual complexity will be. I don't think we should outright reject such a range as a non-range.

quickfur · 2016-07-08T20:14:55Z

@JackStouffer I didn't say I agreed with everything @andralex said, just that, being the official "head honcho" of Phobos, his comments do need to be addressed one way or another (either concede with him and implement what he proposed, or win the argument and get his agreement on doing it another way) before we merge.

And for the record, I agree that we should use "if and only if" instead of "iff", even though personally I actually prefer the latter. For public-facing docs using the full phrase doesn't hurt and can only help reduce the likelihood of alienating would-be readers who may or may not know what "iff" means. It's not as though people who understand "iff" wouldn't understand "if and only if", so there is no harm in catering to the wider audience.

jmdavis · 2016-07-09T01:46:13Z

Actually, upon reflection, popFront can't be required to be O(1) because of stuff like binary trees, though I don't see how it makes any sense for empty or front to not be O(1). Regardless, the key thing with regards to performance is that algorithms assume that the range primitives are cheap enough that they basically don't factor into Big-O considerations - particularly front and empty. Probably the most important aspect of that is simply that folks should be aware that it's perfectly legit for an algorithm to call empty and front multiple times per call to popFront, and algorithms are going to assume that those calls are cheap. But ultimately that's an issue of performance and not correctness. So, I don't know if we want to say anything about that or not.

Regardless, it's certainly not the case that a range that has ridiculously expensive range operations isn't a range. It's just a horribly performing one that will violate the complexity assumptions made by any generic, range-based algorithm. It will work perfectly soundly with them but with performance that is worse than those algorithms are supposed to have.

JackStouffer · 2016-07-09T04:13:59Z

Performance is an important part of the range elevator pitch, and complexity is mentioned elsewhere in Phobos. I think I should add a small note, outside of the rules section, about Phobos' assumptions about complexity so it's clear.

quickfur · 2016-07-11T18:41:51Z

@jmdavis In principle, I agree. Creating a range with expensive front or empty is generally unwise, and I think it's safe for Phobos docs to state that O(1) is assumed for these primitives, and that if this isn't actually true, then generic algorithms may not obey the big-O complexity specified in the docs. If the user wants to do it anyway, they just have to live with the consequences.

JackStouffer · 2016-07-11T18:55:14Z

Addressed comments

dnadlinger · 2016-07-12T20:41:15Z

std/range/primitives.d

+)
+
+Also, note that Phobos code assumes that the primitives `r.front` and
+`r.empty` are $(BIGOH 1) complexity wise or "cheap" in terms of running


O(1) doesn't make much sense without stating what the considered parameters are (length of range, ...). "Complexity" can also refer to more than time complexity (space, …).

JackStouffer · 2016-07-18T18:16:32Z

@klickverbot Done

I think this is ready to merge

dnadlinger · 2016-07-18T21:47:29Z

Auto-merge toggled on

dnadlinger · 2016-07-18T21:59:05Z

The language, particularly regarding the time complexity statements, could still be tightened down a bit, but this can be done separately.

JackStouffer force-pushed the patch-19 branch from 65ea770 to 13e604d Compare June 30, 2016 14:03

schveiguy added the @andralex label Jun 30, 2016

wilzbach reviewed Jun 30, 2016
View reviewed changes

JackStouffer force-pushed the patch-19 branch from 13e604d to ec704be Compare June 30, 2016 14:51

JackStouffer changed the title ~~[RFC] Make clear the behavior of calling front~~ [RFC] Make the behavior of calling front clear Jun 30, 2016

schveiguy removed the @andralex label Jul 1, 2016

JackStouffer force-pushed the patch-19 branch from ec704be to d09ff37 Compare July 1, 2016 01:14

JackStouffer changed the title ~~[RFC] Make the behavior of calling front clear~~ [RFC] Make the behavior of calling range primitives clear Jul 1, 2016

quickfur reviewed Jul 6, 2016
View reviewed changes

JackStouffer force-pushed the patch-19 branch from b76621a to 804e827 Compare July 7, 2016 13:05

quickfur reviewed Jul 7, 2016
View reviewed changes

JackStouffer force-pushed the patch-19 branch from 804e827 to 2f73346 Compare July 7, 2016 18:12

JackStouffer force-pushed the patch-19 branch from 2f73346 to 680784c Compare July 7, 2016 18:15

JackStouffer force-pushed the patch-19 branch from 680784c to b0b5afe Compare July 11, 2016 18:54

dnadlinger reviewed Jul 12, 2016
View reviewed changes

Make the Phobos rules/assumptions of input ranges clearer

1dd85d4

JackStouffer force-pushed the patch-19 branch from b0b5afe to 1dd85d4 Compare July 18, 2016 18:16

dnadlinger merged commit f0b5f96 into dlang:master Jul 18, 2016

wilzbach mentioned this pull request Dec 21, 2016

Bugfix for takeOne (issue 16999) #4984

Merged

JackStouffer deleted the patch-19 branch March 10, 2017 19:21

Uh oh!

Conversation

JackStouffer commented Jun 30, 2016

Uh oh!

schveiguy commented Jun 30, 2016

Uh oh!

schveiguy commented Jun 30, 2016

Uh oh!

wilzbach Jun 30, 2016

Choose a reason for hiding this comment

Uh oh!

JackStouffer commented Jun 30, 2016

Uh oh!

dnadlinger commented Jun 30, 2016

Uh oh!

schveiguy commented Jun 30, 2016

Uh oh!

JackStouffer commented Jun 30, 2016

Uh oh!

schveiguy commented Jun 30, 2016

Uh oh!

dnadlinger commented Jun 30, 2016

Uh oh!

schveiguy commented Jun 30, 2016

Uh oh!

quickfur commented Jun 30, 2016

Uh oh!

quickfur commented Jun 30, 2016

Uh oh!

schveiguy commented Jun 30, 2016

Uh oh!

JackStouffer commented Jun 30, 2016

Uh oh!

quickfur commented Jun 30, 2016

Uh oh!

dnadlinger commented Jun 30, 2016

Uh oh!

schveiguy commented Jun 30, 2016

Uh oh!

dnadlinger commented Jul 1, 2016

Uh oh!

schveiguy commented Jul 1, 2016

Uh oh!

schveiguy commented Jul 1, 2016

Uh oh!

dnadlinger commented Jul 1, 2016

Uh oh!

JackStouffer commented Jul 1, 2016

Uh oh!

JackStouffer commented Jul 1, 2016

Uh oh!

quickfur Jul 6, 2016

Choose a reason for hiding this comment

Uh oh!

JackStouffer Jul 6, 2016

Choose a reason for hiding this comment

Uh oh!

JackStouffer commented Jul 7, 2016

Uh oh!

quickfur Jul 7, 2016

Choose a reason for hiding this comment

Uh oh!

JackStouffer commented Jul 7, 2016

Uh oh!

quickfur commented Jul 7, 2016

Uh oh!

JackStouffer commented Jul 8, 2016

Uh oh!

jmdavis commented Jul 8, 2016

Uh oh!

JackStouffer commented Jul 8, 2016

Uh oh!

schveiguy commented Jul 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quickfur commented Jul 8, 2016

Uh oh!

quickfur commented Jul 8, 2016

Uh oh!

jmdavis commented Jul 9, 2016

schveiguy commented Jul 8, 2016 •

edited

Loading