Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iterators that don't know they're done until they've tried #8149

Closed
StefanKarpinski opened this issue Aug 26, 2014 · 24 comments
Closed

iterators that don't know they're done until they've tried #8149

StefanKarpinski opened this issue Aug 26, 2014 · 24 comments
Labels
breaking This change will break code design Design of APIs or of the language itself missing data Base.missing and related functionality speculative Whether the change will be implemented is speculative

Comments

@StefanKarpinski
Copy link
Member

There's a fundamental awkwardness to using the start/next/done protocol to iterate things like streams or tasks that can't know if they're done or not until they've tried to get the next item. Currently, we work around this mismatch by advancing the iterator in done instead of in next. This kind of works, but it's pretty awkward and intuitive. It also makes composition of these kinds of iterators fail very badly in some cases (zipping multiple streams, etc.).

The only way around this is for next to have two exit routes: normal return of the next value and "abnormal" return when the stream has gone dry. Python handles this by with exceptions: when a generator is done, it throws one. This does not have sufficiently good performance for Julia, however. Another option would be to use a special sentinel return value to indicate that iteration has finished and there's no value to be had. To use this approach, we'd need to make sure the compiler is clever enough to optimize this type instability away and produce fast code.

Another option would be to use exceptions like Python does, but not always use exceptions. It's worth noting that when you're iterating over something like a stream or a task, it is often the case that you want to do some kind of cleanup – e.g. closing a file handle – if the iteration exits via some error.

@johnmyleswhite
Copy link
Member

I hesitate to say this, but this is a perfect use case for nullable types.

@eschnett
Copy link
Contributor

You could have an interface consisting of start() and next(), where each
returns a tuple of a bool (done?) and a value. The value is only value if
the iterator is not yet done.

This also allows iterators over empty ranges (start returns "done"), and
nicely combines obtaining the next value with the check for being done.
With a boolean, no exception is necessary.

-erik

On Tue, Aug 26, 2014 at 2:30 PM, Stefan Karpinski notifications@github.com
wrote:

There's a fundamental awkwardness to using the start/next/done protocol to
iterate things like streams or tasks that can't know if they're done or not
until they've tried to get the next item. Currently, we work around this
mismatch by advancing the iterator in done instead of in next. This kind of
works, but it's pretty awkward and intuitive. It also makes composition of
these kinds of iterators fail very badly in some cases (zipping multiple
streams, etc.).

The only way around this is for next to have two exit routes: normal
return of the next value and "abnormal" return when the stream has gone
dry. Python handles this by with exceptions: when a generator is done, it
throws one. This does not have sufficiently good performance for Julia,
however. Another option would be to use a special sentinel return value to
indicate that iteration has finished and there's no value to be had. To use
this approach, we'd need to make sure the compiler is clever enough to
optimize this type instability away and produce fast code.

Another option would be to use exceptions like Python does, but not always
use exceptions. It's worth noting that when you're iterating over something
like a stream or a task, it is often the case that you want to do some kind
of cleanup - e.g. closing a file handle - if the iteration exits via some
error.

Reply to this email directly or view it on GitHub
#8149.

Erik Schnetter schnetter@cct.lsu.edu
http://www.perimeterinstitute.ca/personal/eschnetter/

@JeffBezanson
Copy link
Member

see also #6125

@StefanKarpinski
Copy link
Member Author

Thanks for the reference, Jeff – I must be terrible at keyword search on GitHub. Lots of relevant issues referenced from there.

@eschnett, the problem is that you still need to return a value even if it isn't used – where do you get such a value?

@JeffBezanson
Copy link
Member

The reason an exception is not so crazy is that at least it only happens once per loop, and doesn't affect code that runs on every iteration. However for nested numerical loops it becomes crazy again.

I'm skeptical of whether there is any way to make passing an extra bit out of next efficient enough. Here's a crazy idea: have next return (value, state, done::Bool), but also keep the done function. Then types can pick which mechanism to use. If they want the done() function, they always return false from next. If they want to return a done flag from next, they have done() always return false. The boolean constants will inline nicely and eliminate unused branches. Hey, I said it was crazy.

@JeffBezanson
Copy link
Member

@johnmyleswhite 's suggestion of nullable types might be just the ticket though. For value types, the value field can be uninitialized, and for reference types the value field can be left as an undefined reference. This is actually a case where immutable types with uninitialized fields are useful.

I have found that LLVM is actually very good at handling structs that contain some value plus a Bool. When everything is inlined, the struct will be converted to scalars and there might be no overhead at all. Eliminating the done call might make up the difference in other cases.

@StefanKarpinski
Copy link
Member Author

The Nullable type is a good possibility. One question is whether it's the (value, state) pair that's nullable or just value; having just value be nullable is a bit simpler, but means you still have to produce a state even when the iterable is done (which hasn't really been a big issue). You'll also need to be able to nest nullables – e.g. Nullable{Nullable{T}} – or you wouldn't be able to return nullable values via iteration. But that's perfectly doable, so not really an issue.

@JeffBezanson
Copy link
Member

I think the Nullable version should definitely be tried to kick the tires for feel and performance.
My instinct is that the state should be separate, and just the value should be nullable. That allows composing with other code that might expect nullable types.

A big problem will be manually using next. Right now you can do v, s = next(...) and it's slightly annoying but not so bad. Needing to check the Nullable seems to make that much worse. Clever syntax needed :)

@StefanKarpinski
Copy link
Member Author

Well, I have to confess that if we take the nullable option (see what I did there), then I'd like to get rid of done, so the manually written out equivalent of a for loop would look like this:

state = start(itr)
while true
    value, state = next(itr, state)
    isnull(value) && break
    # loop body
end

@JeffBezanson
Copy link
Member

Yes, not thinking about done definitely helps make up for a bit more complexity in next.

@Keno
Copy link
Member

Keno commented Aug 26, 2014

The problem I see with that is that it might not be possible to construct a valid value if we're done.

@JeffBezanson
Copy link
Member

We can do whatever new(x) does for an object with 2 fields :)

@Keno
Copy link
Member

Keno commented Aug 26, 2014

Fair enough.

@StefanKarpinski
Copy link
Member Author

Solved problem: https://github.com/johnmyleswhite/NullableTypes.jl/blob/master/src/01_typedef.jl:

immutable Nullable{T}
    isnull::Bool
    value::T

    Nullable() = new(true)
    Nullable(value::T) = new(false, value)
end

@johnmyleswhite
Copy link
Member

Maybe it's time I make a pull request to bring NullableTypes.jl into Base?

@JeffBezanson
Copy link
Member

Yeah, let's do that. It's the sort of type that is far more useful if part of Base.

@johnmyleswhite
Copy link
Member

Ok, I'll do that once I get home from work.

@simonster
Copy link
Member

Won't using option types incur additional overhead for reference types, since immutables with references are allocated on the heap?

@JeffBezanson
Copy link
Member

Yes, but in that case we're already allocating a tuple if next is not inlined. And we designed this in the first place before anything could be inlined or any allocations could be removed, so we're not afraid of performance issues like that. More important to get it right first.

@StefanKarpinski StefanKarpinski changed the title iterators that don't know when they're done until they've tried iterators that don't know they're done until they've tried Aug 27, 2014
@StefanKarpinski
Copy link
Member Author

Nullable PR: #8152

@josevalim
Copy link

Question: if you don't have done(), how do you signal the iterator you are no longer interested in more items? For example, if the user stops iteration (with a break) or because an exception was raised from the user code? This would be useful if there is an interest in supporting resources in iterators or even lazy resources (for example, you only open the file when iteration starts).

The other aspect to consider with iterators is if you want to allow lazy functional composition. I am going to write the examples in Elixir (sorry!). In Elixir, we have two modules: Enum (for eager computations) and Stream (for lazy computations):

1..3
|> Enum.map(fn -> IO.inspect(x) end)
|> Enum.map(fn -> x * 2 end)
|> Enum.map(fn -> IO.inspect(x) end)

IO.inspect receives an element, prints it to the terminal, and returns the element itself. Enum always returns a list back, which means the code above will traverse the list 3 times and print:

1
2
3
2
4
6

Streams, on the other hand, allow us to lazily compose:

stream = 1..3
|> Stream.map(fn -> IO.inspect(x) end)
|> Stream.map(fn -> x * 2 end)
|> Stream.map(fn -> IO.inspect(x) end)

At this point, no computation was done. We need to eagerly convert the stream to a list or force it somehow. If we call Enum.to_list(stream), it will print:

1
2
2
4
3
6

Now the list was traversed just once. Laziness allows all kind of interesting composition, working with infinite collections, lazy resources and so on.

I am just bringing this up because choosing to support (or not support) those features affect considerably the design of iterators. The classic start/next iterator, described here, cannot efficiently support laziness. You would need to at least make next return a sum type of Next(element, state) | Skip(state) | Done in order to efficiently provide laziness (that's what the latest Haskell versions use, see the Stream Fusion paper and the follow up one) but that is still painful when working with resources. If you want something that works well with resources and also provides laziness, you need to go with something like Haskell Iteratees or Elixir Reducees (which I can provide more information if there is interest) at the cost of performance.

@josevalim
Copy link

Just one disclaimer: the solutions linked above aim to be purely functional. Another option is to make the iterator mutable in terms of its state (so it returns only value instead of value, state). Then you should be able to get laziness and resource handling out of it as long as you provide a done operation. That's what the Rx folks use - although it is not my area of expertise.

@aaronsheldon
Copy link

aaronsheldon commented Aug 11, 2016

The design pattern I use in this situation is to introduce a 1 step phase offset between the state and the data produced. This means that between loop start and the first item being delivered there are at most two tries made. But after that everything proceeds as normal. In the context of tasks the pattern would be

start(x::tasks)
    ...
    get the state from advancing the first task
    ....
    return the state
end
next(x::tasks, y::state)
    ....
    get the result from the state
    ...
    get the state from advancing the next task
    ...
    return the result and the state
end
done(x::tasks, y::state) = check for valid state and tasks

The idea being to make "done" as light as possible, and to focus it on strictly Boolean determinations. As a design pattern putting a lot of branching and assignment inside of a function that is called implicitly on a branch, the "while", is very risky.

@nalimilan nalimilan added the missing data Base.missing and related functionality label Sep 6, 2016
@StefanKarpinski StefanKarpinski added design Design of APIs or of the language itself and removed needs decision A decision on this change is needed labels Sep 13, 2016
@JeffBezanson
Copy link
Member

Subsumed by #18823.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This change will break code design Design of APIs or of the language itself missing data Base.missing and related functionality speculative Whether the change will be implemented is speculative
Projects
None yet
Development

No branches or pull requests

9 participants