Skip to content

[RFC] Make the behavior of calling range primitives clear#4511

Merged
dnadlinger merged 1 commit intodlang:masterfrom
JackStouffer:patch-19
Jul 18, 2016
Merged

[RFC] Make the behavior of calling range primitives clear#4511
dnadlinger merged 1 commit intodlang:masterfrom
JackStouffer:patch-19

Conversation

@JackStouffer
Copy link
Contributor

Based on this post, many core devs were in agreement that one basic rule of ranges is that many calls to front without a call to popFront should all yield the same value. Some ranges break this rule.

This PR formalizes this rule as to make it crystal clear that these violations are bugs that must be fixed:

Calling front multiple times without calling popFront yields the same result for each call.

Ping @schveiguy @jmdavis

@schveiguy
Copy link
Member

LGTM.

@schveiguy
Copy link
Member

I'm flagging @andralex on this, since it's an additional design constraint to a core piece of the library.

is allowed only if calling `r.empty` has, or would have,
returned `false`.)
$(LI Calling `r.front` multiple times without calling
`r.popFront` yields the same result for each call.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a formatting nitpick: it seems that before every method/property has had it's own point. That seems reasonable, Can we keep this?

@JackStouffer
Copy link
Contributor Author

@wilzbach fixed

@dnadlinger
Copy link
Contributor

I'm all for clarifying the semantics (in fact, I think these details were a major oversight in the range design, although I'm aware Andrei disagrees). However, I think the currently proposed wording still leaves too many uncertainties – what does "yield the same result" mean? The same return value? Are side-effects evaluated multiple times?

@schveiguy
Copy link
Member

although I'm aware Andrei disagrees

Where have you seen that? If he disagrees, we probably should just close this now and move on, because it won't happen.

what does "yield the same result" mean

It means that when I call front, I get the same result until I call popFront, whereby I then get the next element in the range when I call front. Essentially, the requirement is that the range doesn't move until I call popFront.

How this is achieved is not relevant or necessary to specify. You can re-calculate, you can use a cache, you can have whatever side effects you need to do, you can do whatever you want. Just make sure that the range's front stays the same.

@JackStouffer
Copy link
Contributor Author

Perhaps I should change the wording to

"r.front called multiple times, without calling any other range primitive, yields the same result for every call"

as calling the set version of front will change the value of front.

@schveiguy
Copy link
Member

as calling the set version of front will change the value of front.

Actually, your rewording still doesn't imply that calling set version of front is verboten.

Another strawman: "front should return or refer to the same element until popFront is called". Note that empty shouldn't advance the range either.

@dnadlinger
Copy link
Contributor

although I'm aware Andrei disagrees

Where have you seen that?

Oh, what I was thinking of is not related to this PR in particular or indeed any other specific change proposal. In one of the more epic range semantics discussions – I think it was one related to transient ranges, maybe to byLine –, Andrei rather nonchalantly dismissed the possibility that these correctness considerations would have an impact on the basic design (e.g. splitting front/popFront vs. having one combined primitive). I don't have any particular, clear issue to highlight (yet?), but in light of all the difficulties with implicit API dependencies and object lifetimes we are facing right now, the position that "it will always be possible to write buggy code, so let's not even bother" seems like too convenient an answer for me. To reiterate, this was just an aside not related to this PR, but e.g. a combined front/advance design would avoid the issue discussed here (but of course come with its own drawbacks).

How this is achieved is not relevant or necessary to specify. You can re-calculate, you can use a cache, you can have whatever side effects you need to do, you can do whatever you want. Just make sure that the range's front stays the same.

Okay, so let's say .front is guaranteed to always "return the same result" before advancing the range. This allows you to declare things like generate as "clearly wrong" – but how much does it actually gain you when wondering how to write range algorithms? Will users actually expect front on the passed-in range to be called multiple times – what if it does have side effects? Should you avoid that, unless you'd need to go unreasonable contortions? What if front is expensive? Should you better make a local copy if you need front three times per element, or is that the user's responsibility, for maximum flexibility? Three hundred times? Three million?

@JackStouffer JackStouffer changed the title [RFC] Make clear the behavior of calling front [RFC] Make the behavior of calling front clear Jun 30, 2016
@schveiguy
Copy link
Member

Will users actually expect front on the passed-in range to be called multiple times – what if it does have side effects?

Yes, they should. If it has side effects, that is up to the range author whether it's important. The range API isn't concerned with side effects, and neither should the user of the range be concerned.

What if front is expensive?

range.cache if you want.

Should you better make a local copy if you need front three times per element, or is that the user's responsibility, for maximum flexibility? Three hundred times? Three million?

I don't think generic code should care. Performance of getting the front element is not technically part of the range definition, but I think everyone will expect the same performance as an array, where front is pretty cheap. If you find your peformance is lacking, then profile and optimize. If this means you make a copy, then you make a copy.

Even with an array, continually indexing the first element is going to cost more than making a copy to go in a register, I've made improvements to code by doing just that.

@quickfur
Copy link
Member

I've thought about this a little, and realized that it's possible to interconvert between the current range API (empty, front, popFront) and a generator API (empty, next):

struct Generator(Range)
    if (isInputRange!Range)
{
    Range r;
    @property bool empty() { return r.empty; }
    auto next()
    {
        static if (isForwardRange!Range)
        {
            auto s = r.save;
            r.popFront();
            return s.front;
        }
        else
        {
            // Note: assumes non-transience
            auto e = r.front;
            r.popFront();
            return e;
        }
    }
}

struct Range(Generator)
    if (isGenerator!Generator)
{
    Generator g;
    typeof(g.next()) front;
    this()
    {
        if (!g.empty)
            front = g.next();
    }
    @property bool empty() { return g.empty; }
    void popFront() { front = g.next(); }
}

Perhaps what we need is the generator-to-range construct wrapped in a convenient Phobos function, so that it's easy to generate ranges from generators.

@quickfur
Copy link
Member

And by "generators" I include everything that might be problematic based on the updated range API proposed in this PR, such as front that changes its value every time, etc., front doing too much work every time it's called, etc..

@schveiguy
Copy link
Member

Yes, a generator can be turned into a range quite easily. But I think we don't need a new concept to do that. A generator is really just a delegate (the "empty" method is mostly useless).

@JackStouffer
Copy link
Contributor Author

@quickfur opApply iterables can be translated in a very similar fashion.

@quickfur
Copy link
Member

With opApply you need fibers, though. Otherwise you couldn't implement the range primitives with the required semantics.

@dnadlinger
Copy link
Contributor

If it has side effects, that is up to the range author whether it's important.

The author of which range?

The range API isn't concerned with side effects, and neither should the user of the range be concerned.

My claim: What you are saying here is that the range API is a leaky abstraction by design.

I don't think generic code should care. Performance of getting the front element is not technically part of the range definition, but I think everyone will expect the same performance as an array,

You are right, generic code (i.e. every single range algorithm) shouldn't have to care. But for that to be possible, it needs to be clear what the code can and cannot (or should not) expect. For example, if I grant you that people will expect front to be cheap – doesn't this mean that map should cache its result, especially because it allows the user to reason about side effects in a self-contained way? On the other hand, what if I'm trying to be cautious about repeat accesses and cache everything in the fancy higher-order range I'm writing, but then somebody uses the code with an element type that's expensive to move/copy, who would much rather prefer no caching to be done because they are only going to in turn call front only once anyway?

@schveiguy
Copy link
Member

The author of which range?

The author of the range that is creating the side effects.

What you are saying here is that the range API is a leaky abstraction by design.

Leaky how? I can write code that has side effects anywhere, no design can stop that. Unless you want to make all ranges pure?

For example, if I grant you that people will expect front to be cheap – doesn't this mean that map should cache its result?

map should not concern itself with preventing unexpected behavior from someone who creates such a range. Indeed, I have shown how to create an unexpected range with map as shown on the forum post linked from this PR that has nothing to do with caching.

The thing is, caching does not affect the API, just the performance. And premature optimization is not usually desired. It has its own effects (for example, if the underlying data changes through other means, the cached copy is no longer valid). If you want to cache with map, use map!(...).cache. It's not hard to cache on demand.

On the other hand...

I think it's more important to identify the aspects of performance/behavior in the documentation of the higher-order range. Do whatever the minimum is to get it to be a valid range, then allow the user to build on top of that if necessary.

The reality is, as long as we create algorithms that use arbitrary function calls to build the data, we cannot be certain that any range construction has well-behaved semantics. There is a certain degree of responsibility that the function author has to assume, map (and the compiler) can only do so much.

@dnadlinger
Copy link
Contributor

Leaky how?

Leaky as in: If the design doesn't specify how side-effect/number of evaluation concerns are to be handled, then I need to crack open the abstraction and review all the source code individually to figure out whether my client code is correct or not.

I can write code that has side effects anywhere, no design can stop that.

The question is not whether you can or not, but whether you should be allowed to in "well-formed" range code, i.e. whether you can then expect those side effects to be evaluated in a certain way.

The thing is, caching does not affect the API, just the performance

This is somewhat of a distraction from my main point, but I disagree. If the user writes a.b!(someSideEffect).c.copy(…) (where a-c are typical range constructor functions, and b is something with side effects in front, e.g. map), then the behaviour of the code manifestly depends on how often c evaluates front per element, not just the performance.

If you want to cache with map, use map!(...).cache. It's not hard to cache on demand.

If you want this to become is the official stance (which is reasonable enough), then let's document it that way: Ranges must expect their front property to be called multiple times and handle that in a valid way (which makes map's behaviour something that needs to be mentioned much more prominently in its documentation), but shouldn't be concerned with optimising for performance in that case – i.e. it's fine for N front calls to take ~N times as long as a single front/popFront pair.

There is a certain degree of responsibility that the function author has to assume […]

Agreed. My point is that when I write a range algorithm, which consume a range but also offers a range interface itself, I can only assume responsibility if I know what guarantees/characteristics client code will expect, and I can in turn expect from the range I'm passed.

@schveiguy
Copy link
Member

@klickverbot It seems your reasoning comes down to -- can we force algorithms (either by mechanical verification or convention) to require calling front a certain number of times? The answer is no. Any range author and/or function author for range-using code can call front as many times as they want. Does this mean ranges can't have side effects? I don't think it does. But the side effects are the responsibility of the caller, not the algorithm that's calling front N times. Yes, it means that if you are using a function that creates those side effects via calls to front/popFront/empty, then you have to examine the code to see what exactly is going to happen. In essence, you are relying on an implementation detail for correct code.

To put it another way -- I don't think it's invalid to have map!someFunction if someFunction causes side effects (and that's OK with you). But map is perfectly within its right to change the number of calls to front for any reason, so the user should beware of that.

I don't think we need a specific note on map, just a general rule concerning callable parameters and side effects.

@schveiguy
Copy link
Member

Walter agrees with the essence of the rule: https://forum.dlang.org/post/nl4b8h$2jds$1@digitalmars.com

@dnadlinger
Copy link
Contributor

@schveiguy: That's a fair summary, although I'm not just concerned with strict validity conditions, but also "recommendations" for usage, or best practices if you will. For example, a summary of your position could be: Try to implement algorithms with the minimum number of calls to range primitives they naturally require, but don't do any caching internally to reduce the number of empty/front calls to 1 (which is in theory always possible).

You can definitely make the claim that there is a qualitative difference between that and the minimal statement that this PR is concerned with. My point is that writing down (and ideally formalising) these higher-level considerations apart from just "is (not) allowed" is essential to making ranges work in a composable and predictable way. Just specifying that front is allowed to be called multiple times is a first step towards ruling out freak behaviour like generate, but is not enough to reach those goals by itself.

Regarding map, I don't see how there could be a general rule for "callable parameters". The semantics of a callable template argument, and consequently also the user expectations, depend of course on what the algorithm in question is. For example, it seems reasonable for a filter algorithm to specify that the predicate is invoked exactly once per element of the consumed range, but an arbitrary number of times per popFront of the filtered range.

(In case this isn't obvious: I'm not against this PR by itself, since it seems the most solid design from many perspectives, and matches common expectations. But while people are actively thinking and debating about this, it might also make sense to tackle the bigger issue at hand.)

@JackStouffer
Copy link
Contributor Author

Added Walter's rules and make the docs a little clearer.

@JackStouffer JackStouffer changed the title [RFC] Make the behavior of calling front clear [RFC] Make the behavior of calling range primitives clear Jul 1, 2016
@JackStouffer
Copy link
Contributor Author

Since we're all here, the rules currently allow r.empty to return things other than bool. Is this intended?

available in the range.)
$(LI `r.empty` called multiple times, without calling
`r.popFront`, or otherwise mutating the range object,
yields the same result for every call.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the term "calling", because it implies that empty must be a function. That's not necessarily the case, since it can be a member variable of the range that gets updated by popFront. Maybe "evaluated" is a better word?

Also, multiple evaluations of empty without calling popFront should not cause the value of front to change. Or is this already covered by the following rules?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, multiple evaluations of empty without calling popFront should not cause the value of front to change. Or is this already covered by the following rules?

See above comment

@JackStouffer
Copy link
Contributor Author

@quickfur fixed

$(LI `r.front` returns the current element in the range.
It may return by value or by reference.)
$(LI `r.front` can be legally called iff calling
`r.empty` has, or would have, returned `false`.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. What about s/call/evaluate/? Since front technically could be a member variable updated by popFront, not necessarily a function.

@JackStouffer
Copy link
Contributor Author

@quickfur Sorry, missed that. Fixed.

@quickfur
Copy link
Member

quickfur commented Jul 7, 2016

Thanks! LGTM once @andralex 's comments are addressed.

@JackStouffer
Copy link
Contributor Author

@quickfur I left some notes on @andralex's comments, I disagree in some areas. If you agree with him could you clarify your position?

@jmdavis
Copy link
Member

jmdavis commented Jul 8, 2016

Performance of getting the front element is not technically part of the range definition, but I think everyone will expect the same performance as an array, where front is pretty cheap.

I think that it's pretty clear that all of the range primitives should be O(1) with regards to the number of elements in the range. Exactly what that amounts to in terms of how expensive the operations end up being within O(1) is very much implementation-dependent and not really a concern of the range API IMHO, though it should be expected that empty and front may be called multiple times per call to popFront, and so they should be relatively cheap to call. The primary concern is O(1) though.

Since we're all here, the rules currently allow r.empty to return things other than bool. Is this intended?

I don't see any reason to allow empty to return anything other than bool.

And as for concerns about what returning the same resultmeans, if we want to specify it more concretely, I think that the requirement should be that the same result means equality. There's too much useful code that does stuff like allocate - e.g. range.map!(a => to!string(a))() - for it to make any sense to require that the result be exactly the same object. I'm not sure that we need to be more specific than saying that the result needs to be the same though.

Overall, I think that this looks good.

@JackStouffer
Copy link
Contributor Author

I think that it's pretty clear that all of the range primitives should be O(1) with regards to the number of elements in the range.

Should this be added too? Something like,

"Phobos calls range primitives with the assumption that they are O(1) with respect to the number of elements in the range. Therefore, it's best practice to restrict the range primitives you do provide to those that are O(1). For example, it's not a good idea to implement a singly linked list as a bidirectional range, as the back and popBack attributes would be O(n), and they would be called many times in Phobos code if they are present."

I don't see any reason to allow empty to return anything other than bool.

I will clarify this.

@schveiguy
Copy link
Member

schveiguy commented Jul 8, 2016

I think O(1) requirement should not be specified. I think what's important is to know that algorithms will not consider calls to front or empty to be expensive, and may call them more than once for each iteration. If you construct a range with an expensive front, it's on you to worry about the performance of some algorithm that doesn't expect it. But I don't think we need to go as far as saying that's not a valid range.

BTW, typically "sub-linear" is acceptable for fast operations. O(lg(n)) should be sufficient.

@quickfur
Copy link
Member

quickfur commented Jul 8, 2016

I think it's clear that any generic algorithm that takes arbitrary code as parameter (e.g. a lambda, a range whose range API implementation comes from the user) will have a complexity that's keyed on the complexity of the provided code/implementation.

E.g., myRange.sort() is stated to be O(n log n), but it should be clear that what it actually means is n log n composed with the complexity of the range primitives. If popFront is O(n) then the actual complexity is not O(n log n) but O(n^2 log n).

Similarly, myrange.map!lambda supposedly has .front that is O(1), but the actual complexity will be O(lambda).

But it seems unreasonably onerous to require the docs to specify the complexity as "O(f(n) log f(n)) where f(n) is the complexity of parameter X"; we usually assume that range primitives, lambdas, etc., are "cheap" in the sense that we can make the possibly-not-so-accurate assumption that they are O(1) and still derive an overall big-O complexity that gives a good idea of the actual complexity. If the user breaks the O(1) assumption, it's their own problem to deal with the resulting change in complexity; Phobos should simply state that the given big-O complexities are based on the assumption that certain things are O(1), so if your range doesn't obey that, then you're on your own to figure out what the actual complexity will be. I don't think we should outright reject such a range as a non-range.

@quickfur
Copy link
Member

quickfur commented Jul 8, 2016

@JackStouffer I didn't say I agreed with everything @andralex said, just that, being the official "head honcho" of Phobos, his comments do need to be addressed one way or another (either concede with him and implement what he proposed, or win the argument and get his agreement on doing it another way) before we merge.

And for the record, I agree that we should use "if and only if" instead of "iff", even though personally I actually prefer the latter. For public-facing docs using the full phrase doesn't hurt and can only help reduce the likelihood of alienating would-be readers who may or may not know what "iff" means. It's not as though people who understand "iff" wouldn't understand "if and only if", so there is no harm in catering to the wider audience.

@jmdavis
Copy link
Member

jmdavis commented Jul 9, 2016

Actually, upon reflection, popFront can't be required to be O(1) because of stuff like binary trees, though I don't see how it makes any sense for empty or front to not be O(1). Regardless, the key thing with regards to performance is that algorithms assume that the range primitives are cheap enough that they basically don't factor into Big-O considerations - particularly front and empty. Probably the most important aspect of that is simply that folks should be aware that it's perfectly legit for an algorithm to call empty and front multiple times per call to popFront, and algorithms are going to assume that those calls are cheap. But ultimately that's an issue of performance and not correctness. So, I don't know if we want to say anything about that or not.

Regardless, it's certainly not the case that a range that has ridiculously expensive range operations isn't a range. It's just a horribly performing one that will violate the complexity assumptions made by any generic, range-based algorithm. It will work perfectly soundly with them but with performance that is worse than those algorithms are supposed to have.

@JackStouffer
Copy link
Contributor Author

Performance is an important part of the range elevator pitch, and complexity is mentioned elsewhere in Phobos. I think I should add a small note, outside of the rules section, about Phobos' assumptions about complexity so it's clear.

@quickfur
Copy link
Member

@jmdavis In principle, I agree. Creating a range with expensive front or empty is generally unwise, and I think it's safe for Phobos docs to state that O(1) is assumed for these primitives, and that if this isn't actually true, then generic algorithms may not obey the big-O complexity specified in the docs. If the user wants to do it anyway, they just have to live with the consequences.

@JackStouffer
Copy link
Contributor Author

Addressed comments

)

Also, note that Phobos code assumes that the primitives `r.front` and
`r.empty` are $(BIGOH 1) complexity wise or "cheap" in terms of running
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O(1) doesn't make much sense without stating what the considered parameters are (length of range, ...). "Complexity" can also refer to more than time complexity (space, …).

@JackStouffer
Copy link
Contributor Author

@klickverbot Done

I think this is ready to merge

@dnadlinger
Copy link
Contributor

Auto-merge toggled on

@dnadlinger dnadlinger merged commit f0b5f96 into dlang:master Jul 18, 2016
@dnadlinger
Copy link
Contributor

The language, particularly regarding the time complexity statements, could still be tightened down a bit, but this can be done separately.

@JackStouffer JackStouffer deleted the patch-19 branch March 10, 2017 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants