Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the rangepocalypse #5585

Closed
StefanKarpinski opened this issue Jan 27, 2014 · 51 comments
Closed

the rangepocalypse #5585

StefanKarpinski opened this issue Jan 27, 2014 · 51 comments
Milestone

Comments

@StefanKarpinski
Copy link
Member

I know @JeffBezanson hates umbrella issues, but there are so many range-related problems cropping up that I wanted to have a single place to consolidate the various issues. Here are some of the relevant issues:

Less immediate, but related:

We need a more coherent story for ranges. I suspect that ranges of ordinal types, including integers and chars need to be handled rather differently from floating-point ranges and such.

@jiahao
Copy link
Member

jiahao commented Jan 27, 2014

what a wide range of problems.

@StefanKarpinski
Copy link
Member Author

wah wah wah

@JeffBezanson
Copy link
Member

We have a totally coherent story: ranges are a more compact representation of a certain class of dense vectors.

The only real problem is that the size of a dense vector is obviously limited by Int, which on a 64-bit system is totally reasonable. But with the range representation a much bigger "vector" becomes possible.

The tricky part is factoring and naming the types. We can probably keep Ranges. Range1 was never very useful for floating point, so it can probably become an alias for OrdinalRange1. I'd guess the best fate for Range is to make it an alias for Ranges, and after #1470 have constructors for it that make an appropriate concrete range. Or it could just become a function. If we switch floats to start-stop-len representation though, there will not be any real use for Range in any backwards-compatible sense.

@StefanKarpinski
Copy link
Member Author

What they represent has never been an issue. How they are represented and implemented, however, is clearly a problem or there wouldn't be ten billion issues related to ranges. I think the way this needs to work is that we separate out "ordinal ranges" from "linspace ranges".

The "ordinal ranges" include integer, bigint, character, etc. ranges, and store the start, stop and step for those. Since using a step of 1 is so common, that can be factored out as its own case. The types should probably look like this:

abstract Ranges{T}
abstract OrdinalRange{T,S<:Integer} <: Ranges{T}

immutable UnitRange{T} <: OrdinalRange{T,Int}
    start::T
    stop::T
end

immutable StepRange{T,S<:Integer} <: OrdinalRange{T,S}
    start::T
    stop::T
    step::S
end

When you write 1:10 it would produce a UnitRange{Int} object, while 1:2:10 would be a StepRange{Int,Int} object. The type parameter on the .step field allows large steps to be used for things like -big(2)^1000:big(2)^100:big(2)^1000.

The "linspace ranges" need a different internal representation: start, stop and length. Each value is computed by taking a linear combination of the start and the stop. For the greatest accuracy, we might even want to allow downscaling: i.e. start, stop, length, and divisor, and ref is computed something like this:

function ref(r::LinearRange, k::Int)
    k -= 1; (0 <= k <= r.length) || error(BoundsError)
    ((r.length-k)*r.start + k*r.stop)/r.divisor
end

The type should probably look like this:

immutable LinearRange{T,S<:Integer} <: Ranges{T}
    start::T
    stop::T
    length::S
    divisor::T
end

I'm not sure about the divisor part since that's only necessary to improve a few very nasty corner cases. It would be nice to reclaim Range as the name of the abstract type instead of Ranges, which has always been a bit awkward.

@quinnj
Copy link
Member

quinnj commented Jan 28, 2014

Can I vote against setting the S<:Integer parameter as an Integer subtype? That would disallow date ranges from easily plugging in to this since the step is a Period type. Just leaving it S doesn't hurt, since these are mostly internal, right?

@StefanKarpinski
Copy link
Member Author

Yes, that's entirely fair and making sure this works well with dates and times a crucial design consideration. What kinds of constructions do you need / want for dates and times? Can you list them with examples here?

@lindahua
Copy link
Contributor

I generally agree with @StefanKarpinski's proposal.

Just have to emphasize that a:b and a:s:b always need to be different types. view(a:b, :) and view(a:s:b, :) are very different things, for the former you may directly apply a BLAS function without making a copy.

@StefanKarpinski
Copy link
Member Author

The type of the .length field in the LinearRange type is a bit problematic, actually. Should it be of the same type as .start and .stop? That may make the most sense since otherwise if it's say Int when the start/stop are Float64, there are lots of perfectly reasonable ranges that we can't represent because Int doesn't have enough range. I know you can't materialize ranges that are that big, but we should still be able to use them. If .length is the same types as .start and .stop then it will allow all integral lengths up to 2^53 to be represented exactly, and then only even-length ranges up to 2^54, and so on. We could allow it to be a type parameter still, however, so that if you want to directly construct one, you can explicitly choose the type of the length.

@mschauer
Copy link
Contributor

Will the set of UnitRange start and stop be restricted to those who have a (Int-) length?

@JeffBezanson
Copy link
Member

My current idea there would be to allow any start and stop, but compute the length with overflow checking.

@StefanKarpinski
Copy link
Member Author

@mschauer, no, that's the whole point – the length representation is lousy for ordinal range types. This way you don't need to pick a length type since it's not needed for iterating the object. Computing the length and indexing into it need some thought. For length, I think that we could just do this:

length(r::OrdinalRange) = r.stop - r.start + 1

@StefanKarpinski
Copy link
Member Author

Computing the length with overflow checking is a good idea.

@JeffBezanson
Copy link
Member

The divisor representation has the problem that the intermediate result k*r.stop can overflow.

@kmsquire
Copy link
Member

view(a:b, :) and view(a:s:b, :) are very different things, for the former you may directly apply a BLAS function without making a copy.

It's been a while since I've attempted interacting directly with BLAS, but I was under the impression that for some functions, at least, one could provide a stride for both x and y direction of matrices, which in theory could allow direct operations on (some) views.

@lindahua
Copy link
Contributor

BLAS level 2 & 3 functions only work with matrices with (contiguous rank 1), i.e. each column needs to be contiguous. But this is not related to the discussion here though.

@quinnj
Copy link
Member

quinnj commented Jan 28, 2014

I don't think there's much that's particularly fancy about date ranges. Something along the following

# Creates Range{Date} with default step of Day(1)
daterange = Date(2013,1,1):Date(2014,1,1)
# Date range with step of Week(1) from 2013-01-01 to 2014-01-01
daterange = Date(2013,1,1):Week(1):Date(2014,1,1)
# Datetime works similarly, but can also have TimePeriod steps
# Datetime range for every second of the day 2014-01-28
datetimerange = Datetime(2014,1,28):Second(1):Datetime(2014,1,29)

I think it's pretty simple and straightforward and provides the expected functionality. I originally thought of trying to incorporate the recur function into ranges, but as Jeff and I discussed, you're not really looking for a range type in that case, but the actual resulting array of dates. But for creating regular-stepped date ranges, this is simple and convenient.

@milktrader may have more ideas on anything else that we may want consider with date ranges specifically.

@StefanKarpinski
Copy link
Member Author

I don't think they're crazy exotic or anything, but they are just exotic enough that they make a great example that we need to make sure we can support. It would be nice not to have to create a new, separate set of range types every time someone wants to make something rangelike with their own custom types.

@EyeOfPython
Copy link

Are we already supporting multidimensional Ranges ? Range in cartesian space could be useful, representin e.g. AABBs. At least it should be easy to adapt behaving like Range.

@simonster
Copy link
Member

@EyeOfPython There is an NDRange example included with Julia, although it's broken and could be improved considerably with the new language features introduced since it was last updated 2 years ago.

@lindahua
Copy link
Contributor

NDRange could be a great idea, but best discussed in a separate thread. The pressing problem here is to resolve the issues related to current 1D range systems as listed above.

@JeffBezanson
Copy link
Member

I started prototyping the candidate new Range1 representation. The trickiest thing is that if we want to allow ranges like typemin(Int):typemax(Int), iterating becomes difficult since there is no invalid value available to mean "end". Therefore we need more state. I tried the following:

start{T<:Integer}(r::Range1{T}) = (false, r.start)
next{T<:Integer}(r::Range1{T}, i) = (tupleref(i,2), (true, oftype(T, tupleref(i,2)+1)))
done{T<:Integer}(r::Range1{T}, i) = (r.stop<r.start) | (tupleref(i,1) & (tupleref(i,2)==oftype(T, r.stop+1)))

That is kind of crazy, but amazingly enough LLVM is able to optimize everything away and we get loops that are sometimes even faster than before.

@JeffBezanson
Copy link
Member

This seems to be CPU-dependent. On some machines that implementation is indeed significantly slower.

@StefanKarpinski
Copy link
Member Author

Man, that's a real bummer. I considered starting the state at the value before the first one:

start{T<:Integer}(r::Range1{T}) = oftype(T,r.start-1)
next{T<:Integer}(r::Range1{T}, i) = (oftype(T,i+1), oftype(T,i+1))
done{T<:Integer}(r::Range1{T}, i) = i == r.stop

Of course, that considers typemin(Int):typemax(Int) to be empty, so that's no good.

@simonster
Copy link
Member

This is a problem for 32-bit ranges, but on a 64-bit system I think you'd be dead before you could finish iterating over typemin(Int):typemax(Int).

@lindahua
Copy link
Contributor

What about we simply reject (i.e. throw an error with clear message) upon the construction of a:b when b is typmax(Int)?

Why should we need to support a use case which is almost never used in real practice when supporting it requires disproportionate amount of efforts?

@milktrader
Copy link
Contributor

I would second @karbarcca that Datetime ranges are sufficient as currently implemented. I haven't run into any issues yet.

julia> dates[1:2]
2-element Array{Date{ISOCalendar},1}:
 1980-01-03
 1980-01-04

julia> length(dates)
505

julia> tuesdays = x->dayofweek(x)==Tuesday
(anonymous function)

julia> tue = dates[1]:tuesdays:dates[505]
1980-01-08:(anonymous function):1981-12-29

julia> typeof(tue)
DateRange1{ISOCalendar} (constructor with 1 method)

julia> tue.step
(anonymous function)

julia> length(tue)
104

julia> everythirdweek = [tue][1]:weeks(3):[tue][104]
1980-01-08:3 weeks:1981-12-22

julia> length(everythirdweek)
35

julia> [everythirdweek][1:2]
2-element Array{Date{ISOCalendar},1}:
 1980-01-08
 1980-01-29

added note: the dates array above is just something I had lying around. To make something similar you would do something like this:

julia> dates = [date(1980,1,1):days(1):date(1981,12,31)];

@StefanKarpinski
Copy link
Member Author

That's not actually what I interpreted @karbarcca as saying. The date/time range functionality is there but it's a shares almost no code with Base's range implementations. You shouldn't need to rewrite all the tricky functionality and logic for ranges because you have a new type of thing that you want to make ranges of. That's a complete failure of generic programming and composability.

@milktrader
Copy link
Contributor

Aha, I figured I was missing some important point.

@StefanKarpinski
Copy link
Member Author

See #5636 – a mind-reading float-range implementation.

@JeffBezanson
Copy link
Member

Take a look at my latest attempt. Not done yet, but the basic type definitions can be examined and debated.
Boy implementing ranges is a pain in the ass!

@StefanKarpinski
Copy link
Member Author

Yeah, I've been following that – looks pretty good. How do you feel about merging soon?

@JeffBezanson
Copy link
Member

I'd like to as soon as we can get the rest of ranges.jl implemented.

Division is a minor wrinkle --- datetimes and periods support /, but maybe ranges should be defined in terms of div.

@JeffBezanson
Copy link
Member

A big question is what to do with Range. Probably best to just reclaim it as the abstract type name since that's what we really want anyway.

Also maybe StepRange should be called StridedRange, to be more evocative to people used to arrays.

@StefanKarpinski
Copy link
Member Author

I rather like StepRange, but I guess StridedRange would be ok. The idea of StridedArray is that it stored in memory with fixed strides, which doesn't really make sense for a range, which isn't stored. I agree that we should just reclaim the Range name.

@JeffBezanson JeffBezanson added this to the 0.3 milestone Mar 20, 2014
@JeffBezanson
Copy link
Member

I looked at a few other libraries and languages, and people tend to either accept a length argument, or accept a value one beyond the last actual value. This cleverly sidesteps the typemin(Int):typemax(Int) problem by making it impossible to express. I'm not sure there is any other solution. Throwing an error requires, at least, an extra branch at the start of every loop.

@StefanKarpinski StefanKarpinski modified the milestones: 0.3-rc1, 0.3 Mar 28, 2014
@JeffBezanson
Copy link
Member

An interesting option would be to use the lifted (Bool,Int) state for StepRange since that is less performance-critical.

@simonster
Copy link
Member

So, I tried this:

start(r::Range1{Int,Int}) = uint(0)
next(r::Range1{Int,Int}, i) = (int(r.start+i), box(Uint64,add_int_nuw(unbox(Uint64,i),unbox(Uint64,1))))
done(r::Range1{Int,Int}, i) = (r.sentinel-1 < r.start) | (i > (uint(r.sentinel)-1)-r.start)

where add_int_nuw is a new no unsigned wrap add intrinsic. i eventually wraps and gives an infinite loop for typemin(Int):typemax(Int), but if Int == Int64 that takes several centuries, so I think we can safely ignore that. (Not sure what to do for Int == Int32, where this behavior is observable and potentially problematic.) This seems to work with the loop vectorizer and generate reasonable looking IR, and the performance difference from master is within the margin of error on my system, although I can't promise this is true elsewhere. I'm sure I've missed an edge case, though...

@simonster
Copy link
Member

Unfortunately this only seems to partly work with the loop vectorizer. The first function I define in a session gets vectorized, but not subsequent functions. There is no problem with the old iterator.

@JeffBezanson
Copy link
Member

This is quite interesting.

That vectorizer behavior sounds like a bug; I don't see why it should be stateful like that.

@JeffBezanson
Copy link
Member

Another idea I want to discuss: in some cases it is much more convenient to construct a range by length. We could add a function range(start, [step], length) for doing that. Then range and colon would be the two ways to make ranges, nice because neither depends on the implementation (e.g. we can change what types they return in the future).

@StefanKarpinski
Copy link
Member Author

I like that idea a lot. Those are the two standard ways to do this. One wrinkle is that while range(0.3, 1.1, 9) is very similar to linspace(0.3, 1.1, 9), since the former constructs a FloatRange object, it gives this:

Base.showcompact(io::IO, x::Float64) = show(io,x)

julia> [0.3:0.1:1.1]
9-element Array{Float64,1}:
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1.0
 1.1

while linspace currently doesn't:

julia> linspace(0.3,1.1,9)
9-element Array{Float64,1}:
 0.3
 0.4
 0.5
 0.6000000000000001
 0.7000000000000001
 0.8
 0.9
 1.0000000000000002
 1.1

This is probably an argument for linspace doing the same endpoint lifting as FloatRange.

@simonster
Copy link
Member

I realized I was benchmarking with a function where I wasn't actually using the iteration variable. If I use the iteration variable, then I get an extra add in the loop (two extra adds for summing the yielded values) and this is much slower, so it's not really viable.

@StefanKarpinski
Copy link
Member Author

Ah, that makes sense. Tricksy compilers.

JeffBezanson added a commit that referenced this issue Mar 31, 2014
…t,len

this Range1 could be the UnitRange of #5585, with Range1 deprecated

also intended to address #5469 (performance)
JeffBezanson added a commit that referenced this issue Apr 1, 2014
…t,len

this Range1 could be the UnitRange of #5585, with Range1 deprecated

also intended to address #5469 (performance)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants