iterators that don't know they're done until they've tried #8149

StefanKarpinski · 2014-08-26T18:30:22Z

There's a fundamental awkwardness to using the start/next/done protocol to iterate things like streams or tasks that can't know if they're done or not until they've tried to get the next item. Currently, we work around this mismatch by advancing the iterator in done instead of in next. This kind of works, but it's pretty awkward and intuitive. It also makes composition of these kinds of iterators fail very badly in some cases (zipping multiple streams, etc.).

The only way around this is for next to have two exit routes: normal return of the next value and "abnormal" return when the stream has gone dry. Python handles this by with exceptions: when a generator is done, it throws one. This does not have sufficiently good performance for Julia, however. Another option would be to use a special sentinel return value to indicate that iteration has finished and there's no value to be had. To use this approach, we'd need to make sure the compiler is clever enough to optimize this type instability away and produce fast code.

Another option would be to use exceptions like Python does, but not always use exceptions. It's worth noting that when you're iterating over something like a stream or a task, it is often the case that you want to do some kind of cleanup – e.g. closing a file handle – if the iteration exits via some error.

The text was updated successfully, but these errors were encountered:

johnmyleswhite · 2014-08-26T18:32:00Z

I hesitate to say this, but this is a perfect use case for nullable types.

eschnett · 2014-08-26T18:49:30Z

You could have an interface consisting of start() and next(), where each
returns a tuple of a bool (done?) and a value. The value is only value if
the iterator is not yet done.

This also allows iterators over empty ranges (start returns "done"), and
nicely combines obtaining the next value with the check for being done.
With a boolean, no exception is necessary.

-erik

On Tue, Aug 26, 2014 at 2:30 PM, Stefan Karpinski notifications@github.com
wrote:

There's a fundamental awkwardness to using the start/next/done protocol to
iterate things like streams or tasks that can't know if they're done or not
until they've tried to get the next item. Currently, we work around this
mismatch by advancing the iterator in done instead of in next. This kind of
works, but it's pretty awkward and intuitive. It also makes composition of
these kinds of iterators fail very badly in some cases (zipping multiple
streams, etc.).

The only way around this is for next to have two exit routes: normal
return of the next value and "abnormal" return when the stream has gone
dry. Python handles this by with exceptions: when a generator is done, it
throws one. This does not have sufficiently good performance for Julia,
however. Another option would be to use a special sentinel return value to
indicate that iteration has finished and there's no value to be had. To use
this approach, we'd need to make sure the compiler is clever enough to
optimize this type instability away and produce fast code.

Another option would be to use exceptions like Python does, but not always
use exceptions. It's worth noting that when you're iterating over something
like a stream or a task, it is often the case that you want to do some kind
of cleanup - e.g. closing a file handle - if the iteration exits via some
error.

Reply to this email directly or view it on GitHub
#8149.

Erik Schnetter schnetter@cct.lsu.edu
http://www.perimeterinstitute.ca/personal/eschnetter/

JeffBezanson · 2014-08-26T18:52:59Z

see also #6125

StefanKarpinski · 2014-08-26T19:02:08Z

Thanks for the reference, Jeff – I must be terrible at keyword search on GitHub. Lots of relevant issues referenced from there.

@eschnett, the problem is that you still need to return a value even if it isn't used – where do you get such a value?

JeffBezanson · 2014-08-26T19:05:32Z

The reason an exception is not so crazy is that at least it only happens once per loop, and doesn't affect code that runs on every iteration. However for nested numerical loops it becomes crazy again.

I'm skeptical of whether there is any way to make passing an extra bit out of next efficient enough. Here's a crazy idea: have next return (value, state, done::Bool), but also keep the done function. Then types can pick which mechanism to use. If they want the done() function, they always return false from next. If they want to return a done flag from next, they have done() always return false. The boolean constants will inline nicely and eliminate unused branches. Hey, I said it was crazy.

JeffBezanson · 2014-08-26T19:16:03Z

@johnmyleswhite 's suggestion of nullable types might be just the ticket though. For value types, the value field can be uninitialized, and for reference types the value field can be left as an undefined reference. This is actually a case where immutable types with uninitialized fields are useful.

I have found that LLVM is actually very good at handling structs that contain some value plus a Bool. When everything is inlined, the struct will be converted to scalars and there might be no overhead at all. Eliminating the done call might make up the difference in other cases.

StefanKarpinski · 2014-08-26T19:23:15Z

The Nullable type is a good possibility. One question is whether it's the (value, state) pair that's nullable or just value; having just value be nullable is a bit simpler, but means you still have to produce a state even when the iterable is done (which hasn't really been a big issue). You'll also need to be able to nest nullables – e.g. Nullable{Nullable{T}} – or you wouldn't be able to return nullable values via iteration. But that's perfectly doable, so not really an issue.

JeffBezanson · 2014-08-26T19:39:03Z

I think the Nullable version should definitely be tried to kick the tires for feel and performance.
My instinct is that the state should be separate, and just the value should be nullable. That allows composing with other code that might expect nullable types.

A big problem will be manually using next. Right now you can do v, s = next(...) and it's slightly annoying but not so bad. Needing to check the Nullable seems to make that much worse. Clever syntax needed :)

StefanKarpinski · 2014-08-26T19:47:13Z

Well, I have to confess that if we take the nullable option (see what I did there), then I'd like to get rid of done, so the manually written out equivalent of a for loop would look like this:

state = start(itr)
while true
    value, state = next(itr, state)
    isnull(value) && break
    # loop body
end

JeffBezanson · 2014-08-26T19:49:06Z

Yes, not thinking about done definitely helps make up for a bit more complexity in next.

Keno · 2014-08-26T19:55:26Z

The problem I see with that is that it might not be possible to construct a valid value if we're done.

JeffBezanson · 2014-08-26T20:01:00Z

We can do whatever new(x) does for an object with 2 fields :)

Keno · 2014-08-26T20:04:17Z

Fair enough.

StefanKarpinski · 2014-08-26T20:05:59Z

Solved problem: https://github.com/johnmyleswhite/NullableTypes.jl/blob/master/src/01_typedef.jl:

immutable Nullable{T}
    isnull::Bool
    value::T

    Nullable() = new(true)
    Nullable(value::T) = new(false, value)
end

johnmyleswhite · 2014-08-26T20:11:50Z

Maybe it's time I make a pull request to bring NullableTypes.jl into Base?

JeffBezanson · 2014-08-26T20:22:33Z

Yeah, let's do that. It's the sort of type that is far more useful if part of Base.

johnmyleswhite · 2014-08-26T20:23:09Z

Ok, I'll do that once I get home from work.

simonster · 2014-08-27T00:58:57Z

Won't using option types incur additional overhead for reference types, since immutables with references are allocated on the heap?

JeffBezanson · 2014-08-27T02:31:18Z

Yes, but in that case we're already allocating a tuple if next is not inlined. And we designed this in the first place before anything could be inlined or any allocations could be removed, so we're not afraid of performance issues like that. More important to get it right first.

StefanKarpinski · 2014-08-27T19:34:28Z

Nullable PR: #8152

josevalim · 2014-08-28T14:28:26Z

Question: if you don't have done(), how do you signal the iterator you are no longer interested in more items? For example, if the user stops iteration (with a break) or because an exception was raised from the user code? This would be useful if there is an interest in supporting resources in iterators or even lazy resources (for example, you only open the file when iteration starts).

The other aspect to consider with iterators is if you want to allow lazy functional composition. I am going to write the examples in Elixir (sorry!). In Elixir, we have two modules: Enum (for eager computations) and Stream (for lazy computations):

1..3
|> Enum.map(fn -> IO.inspect(x) end)
|> Enum.map(fn -> x * 2 end)
|> Enum.map(fn -> IO.inspect(x) end)

IO.inspect receives an element, prints it to the terminal, and returns the element itself. Enum always returns a list back, which means the code above will traverse the list 3 times and print:

Streams, on the other hand, allow us to lazily compose:

stream = 1..3
|> Stream.map(fn -> IO.inspect(x) end)
|> Stream.map(fn -> x * 2 end)
|> Stream.map(fn -> IO.inspect(x) end)

At this point, no computation was done. We need to eagerly convert the stream to a list or force it somehow. If we call Enum.to_list(stream), it will print:

Now the list was traversed just once. Laziness allows all kind of interesting composition, working with infinite collections, lazy resources and so on.

I am just bringing this up because choosing to support (or not support) those features affect considerably the design of iterators. The classic start/next iterator, described here, cannot efficiently support laziness. You would need to at least make next return a sum type of Next(element, state) | Skip(state) | Done in order to efficiently provide laziness (that's what the latest Haskell versions use, see the Stream Fusion paper and the follow up one) but that is still painful when working with resources. If you want something that works well with resources and also provides laziness, you need to go with something like Haskell Iteratees or Elixir Reducees (which I can provide more information if there is interest) at the cost of performance.

josevalim · 2014-08-28T14:37:19Z

Just one disclaimer: the solutions linked above aim to be purely functional. Another option is to make the iterator mutable in terms of its state (so it returns only value instead of value, state). Then you should be able to get laziness and resource handling out of it as long as you provide a done operation. That's what the Rx folks use - although it is not my area of expertise.

aaronsheldon · 2016-08-11T16:05:31Z

The design pattern I use in this situation is to introduce a 1 step phase offset between the state and the data produced. This means that between loop start and the first item being delivered there are at most two tries made. But after that everything proceeds as normal. In the context of tasks the pattern would be

start(x::tasks)
    ...
    get the state from advancing the first task
    ....
    return the state
end
next(x::tasks, y::state)
    ....
    get the result from the state
    ...
    get the state from advancing the next task
    ...
    return the result and the state
end
done(x::tasks, y::state) = check for valid state and tasks

The idea being to make "done" as light as possible, and to focus it on strictly Boolean determinations. As a design pattern putting a lot of branching and assignment inside of a function that is called implicitly on a branch, the "while", is very risky.

JeffBezanson · 2017-05-02T16:41:45Z

Subsumed by #18823.

StefanKarpinski added speculative labels Aug 26, 2014

StefanKarpinski changed the title ~~iterators that don't know when they're done until they've tried~~ iterators that don't know they're done until they've tried Aug 27, 2014

nolta mentioned this issue Nov 30, 2014

WIP: split next into nextval & nextstate #9182

Closed

simonster mentioned this issue Dec 12, 2014

Code generation and cartesian iteration #9080

Closed

JeffBezanson mentioned this issue Feb 10, 2016

add Flatten iterator #14805

Merged

timholy mentioned this issue Apr 25, 2016

julep: a return-able break statement (improving iteration performance) #16035

Closed

mauro3 mentioned this issue Aug 11, 2016

Add an optional "at stop" method to the iterator interface #17954

Closed

nalimilan added the missing data Base.missing and related functionality label Sep 6, 2016

StefanKarpinski added design Design of APIs or of the language itself and removed needs decision A decision on this change is needed labels Sep 13, 2016

Keno mentioned this issue Oct 6, 2016

Julep: Revised iteration protocol #18823

Closed

JeffBezanson closed this as completed May 2, 2017

tkelman mentioned this issue Jun 22, 2017

Iterators that know their size after start has been called #22467

Open

tkf mentioned this issue Feb 7, 2018

Iterator slicing/indexing JuliaCollections/IterTools.jl#26

Open

garrison mentioned this issue Feb 21, 2018

Proposal: drop/ignore return value rather than yielding it JuliaDynamics/ResumableFunctions.jl#15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iterators that don't know they're done until they've tried #8149

iterators that don't know they're done until they've tried #8149

StefanKarpinski commented Aug 26, 2014

johnmyleswhite commented Aug 26, 2014

eschnett commented Aug 26, 2014

JeffBezanson commented Aug 26, 2014

StefanKarpinski commented Aug 26, 2014

JeffBezanson commented Aug 26, 2014

JeffBezanson commented Aug 26, 2014

StefanKarpinski commented Aug 26, 2014

JeffBezanson commented Aug 26, 2014

StefanKarpinski commented Aug 26, 2014

JeffBezanson commented Aug 26, 2014

Keno commented Aug 26, 2014

JeffBezanson commented Aug 26, 2014

Keno commented Aug 26, 2014

StefanKarpinski commented Aug 26, 2014

johnmyleswhite commented Aug 26, 2014

JeffBezanson commented Aug 26, 2014

johnmyleswhite commented Aug 26, 2014

simonster commented Aug 27, 2014

JeffBezanson commented Aug 27, 2014

StefanKarpinski commented Aug 27, 2014

josevalim commented Aug 28, 2014

josevalim commented Aug 28, 2014

aaronsheldon commented Aug 11, 2016 •

edited

Loading

JeffBezanson commented May 2, 2017

iterators that don't know they're done until they've tried #8149

iterators that don't know they're done until they've tried #8149

Comments

StefanKarpinski commented Aug 26, 2014

johnmyleswhite commented Aug 26, 2014

eschnett commented Aug 26, 2014

JeffBezanson commented Aug 26, 2014

StefanKarpinski commented Aug 26, 2014

JeffBezanson commented Aug 26, 2014

JeffBezanson commented Aug 26, 2014

StefanKarpinski commented Aug 26, 2014

JeffBezanson commented Aug 26, 2014

StefanKarpinski commented Aug 26, 2014

JeffBezanson commented Aug 26, 2014

Keno commented Aug 26, 2014

JeffBezanson commented Aug 26, 2014

Keno commented Aug 26, 2014

StefanKarpinski commented Aug 26, 2014

johnmyleswhite commented Aug 26, 2014

JeffBezanson commented Aug 26, 2014

johnmyleswhite commented Aug 26, 2014

simonster commented Aug 27, 2014

JeffBezanson commented Aug 27, 2014

StefanKarpinski commented Aug 27, 2014

josevalim commented Aug 28, 2014

josevalim commented Aug 28, 2014

aaronsheldon commented Aug 11, 2016 • edited Loading

JeffBezanson commented May 2, 2017

aaronsheldon commented Aug 11, 2016 •

edited

Loading