Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to a factory pattern for array creation #22630

Open
davidanthoff opened this issue Jun 30, 2017 · 18 comments
Open

Move to a factory pattern for array creation #22630

davidanthoff opened this issue Jun 30, 2017 · 18 comments
Labels
arrays [a, r, r, a, y, s] breaking This change will break code design Design of APIs or of the language itself speculative Whether the change will be implemented is speculative

Comments

@davidanthoff
Copy link
Contributor

This might well be a terrible idea, but I can't figure out why, so here we go with an idea about a revamp of the array type hierarchy:

Rename AbstractArray --> Array, DenseArray --> AbstractDenseArray and Array --> DenseArray. The new type hierarchy would look like this:

Array
  |
  |- AbstractDenseArray
  |       |
  |       |- DenseArray
  |
  ...

The final piece is to keep the Array constructor methods around, but now they would be methods on the abstract type Array. The default methods would return a DenseArray, but one could add methods in packages that return a different array type.

There would be two benefits:

  1. It would in a not very subtle way push folks to use type signatures in method dispatch that are better than what lots of folks use today. Right now it seems quite common to define a method like foo(a::Array{Int,1}), but I'm pretty sure in almost all cases that should really have been foo(a::AbstractArray{<:Int,1}) (at least in julia 0.6+ with the new type system). With this new proposal foo(a::Array{Int,1}) won't work at all, and maybe that would be a good thing.
  2. There would be a clean way to hook into the array creation mechanism and return custom array types depending on the element type of the array. For example, a package author could add a method Array{T,N}(d::NTuple{N,Int}) where {T<:Foo,N} (actually, a few more would be needed) that would return a custom array type instead of DenseArray if one creates an array of Foos. In some sense this idea was inspired by Support isbits Union array elements and struct fields #22441, which would change how arrays of a certain element type are stored in memory. Having Array be a kind of factory method for custom array types would allow a similar design for other element types, i.e. any package could decide that one should get a custom array type if one creates an array of those elements via the Array method. I'm currently playing with that kind of design in WIP DataValueArray queryverse/DataValues.jl#12 for a custom type, so that would be one use case.

I haven't even started to think about how this could be phased in, and I can well imagine that the problems around that might be enough to make this a moot suggestion. I guess I'm not even sure enough whether this would be a good idea on its own (if we didn't have to worry about backwards compat), so I deferred thinking about a deprecation strategy for now :)

@yuyichao yuyichao added breaking This change will break code speculative Whether the change will be implemented is speculative labels Jun 30, 2017
@mbauman mbauman changed the title Array type hierarchy Rename AbstractArray to Array; use a more specific name for Array? Jun 30, 2017
@mbauman
Copy link
Member

mbauman commented Jun 30, 2017

With regards to benefit 1: foo(a::AbstractArray{Int,1}) currently works just fine to match all AbstractArray subtypes with Int elements. You only need the S{<:T} construction if T is abstract; S doesn't really matter.

Currently there aren't any constructors on AbstractArray, so benefit 2 could be possible right now. It's just more of a mouthful than you probably want. If you're interested in that, I'd propose it separately.

As a renaming, the major benefit is in changing how we talk about things. Sometimes that can be worth it. In this case, the manual makes careful distinction between vector, matrix, and array vs. Vector, Matrix, and Array within its prose. I'm not sure how confusing that is to newcomers to the language, but to me that'd be the biggest pro here.

@nalimilan
Copy link
Member

If we did that, we would have to rename String to UTF8String and Set to TheDefaultJuliaSet (or a better name, you get the idea).

@andyferris
Copy link
Member

I'm not really sure such a large change is worth any of the benefits that follow.

First, I'd say the AbstractT supertype with prototypical concrete type T is actually a pretty common and I dare say useful idiom. Teaching people to be relaxed about type signatures will be ongoing regardless - it takes a little while to appreciate Julia's genericism.

I would like to see more constructors, but I guess we can add methods to AbstractArray or collect for that?

If it were a clearer win-win, this might make sense to sneak in before 1.0, but I'd argue if we can't follow "if it ain't broke, don't fix it", then it might turn off potential users worried about ongoing language instability.

@ChrisRackauckas
Copy link
Member

First, I'd say the AbstractT supertype with prototypical concrete type T is actually a pretty common and I dare say useful idiom. Teaching people to be relaxed about type signatures will be ongoing regardless - it takes a little while to appreciate Julia's genericism.

I agree with both of these. I think people use ::Array{Int,1} because they think it matters for performance, not because they don't know what AbstractArray is. And naming abstract types with Abstract in the name is commonplace and very helpful IMO.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jul 11, 2017

You may be interested in this ancient series of commits: 38d8b8b6678d5a1ccca5d8aa45c8af19a99df6d6^...eafcbd8. We used to use the names Vector, Matrix and Tensor for what are now called AbstractVector, AbstractMatrix and AbstractArray. Back then Vector was called DenseVector and Matrix was called DenseMatrix. Confusingly Array was just Array, not DenseArray – but that was not the biggest problem. The situation was confusing largely because it suggested the wrong defaults. Normal dense arrays are what you want most of the time, so they should have the simplest names that are the easiest to "reach for". If you're writing generic library code, it requires a bit more thought so it's ok for those names to be longer and less easy to reach. Giving the most abstract, generic array type the shortest, most obvious name is precisely the wrong thing to do.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jul 11, 2017

The commit message of 26b8497 explained it fairly well, if somewhat simplistically:

Since this seems really confusing to various people, vector and matrix now refer to the dense array types, not the abstract tensor type that they previously referred to. This makes it easier to "reach" for a Vector or Matrix that are dense and harder to reach for an AbstractVector or AbstractMatrix, which is good because it is usually the former that you want, and not the latter.

It's kind of funny that I used the same "reach" metaphor back then to explain it.

@davidanthoff
Copy link
Contributor Author

@StefanKarpinski I completely agree with your point if you look at only half of the proposal I made, namely the renaming part (and I also have to say that my original write-up was not very clear...). That, in isolation, makes clearly no sense. But if you look at the whole proposal and apply the criterion "the simplest name should do the right thing in most instances", my proposal fares well (I think), if not better, than what we have right now. Of course there might be other arguments against it.

The core idea here is to move to a factory pattern for array creation, where the default method for creating an array would move from specifying the precise container (array) type, to a model where you say "give me an array that is really good for storing elements of type T". That factory method would be Array, i.e. Array would both be the abstract super type and the factory-pattern-like constructor for arrays. One would still typically create an array by writing say a = Array{T,2}(2,4), but the type returned here could differ based on the type of T. In most cases this would return an DenseArray{T,2}, but one could add constructors to Array that return something else for a given element type. For example, Array{Bool,1}(2) could return a BitArray, Array{DataValue{Int},1}(3) could return a DataValueArray{Int,1}, Array{Nullable{CategoricalValue}}(2) could return a NullableCategoricalArray etc. If one wants to circumvent the factory pattern, one could still easily specify the precise container type by just writing e.g. DenseArray{Bool,2}(2). So, for array instantiation, this proposal still sticks with the mantra of "Normal dense arrays are what you want most of the time, so they should have the simplest names that are the easiest to 'reach for'", but modifies it by recognizing that for some element types, the most common thing one wants is some custom array type.

The other use case for the type name is in situations where one "consumes" an array, i.e. in dispatch and in field declarations in types. In my mind, in both cases, the most common correct thing in those situations is to use the abstract super type for arrays: in general a function that takes an array should work with any array, not just dense arrays, similar if I declare a type that holds an array, things are much more general if it can hold any array type, not just dense arrays. Currently, one has to use the longer name AbstractArray in function argument declarations and composite field types to achieve the generality that we would like to see in most cases. Under the proposal in this issue, the shortest name (Array) would instead do "the right thing in most cases", i.e. if you declare function foo(a::Array) it would work with any array, not just dense arrays.

So I think just based on the criterion "the shortest name should do the right thing in most cases", my proposal is at least as good as what we have currently: array creation still would use Array, but in dispatch and field declarations Array would also do the most useful thing.

What are the benefits of all this? I see these right now:

  1. We already have kind of a factory pattern for array creation in some parts of base, but not in others. This proposal would make the factory pattern the default throughout. For example, broadcast picks the array type based on the element type (and returns BitArray if the element type is Bool). Functions like trues also pick an array type based on the element type. But other functions in base don't: map won't return a BitArray, fill(true, 2) won't return a BitArray etc. Under this proposal, the behavior would be simply this: any function that creates an array and takes an element type as an argument in some sense will return whatever the array factory for that element type returns.
  2. Packages could make special array types like DataValueArray integrate very smoothly into the whole array creation story. For example, in the case of DataValueArray, it would be very simple to make A = [3,5,2,NA,9] return a DataValueArray. I think in general that this would allow us to get rid of many of the awkwardnesses that made using something like NullableArray cumbersome (just to be clear, that alone would not cut it, but it would be an important piece in the puzzle).
  3. As pointed out above, the shortest name Array would be the sensible default in function argument declarations and field types in composite types.

I've changed the title of this issue to something more descriptive. My original title was really lame, but the new name picked by @mbauman really only highlights one part of the overall proposal.

@davidanthoff davidanthoff changed the title Rename AbstractArray to Array; use a more specific name for Array? Move to a factory pattern for array creation Jul 12, 2017
@mbauman
Copy link
Member

mbauman commented Jul 12, 2017

You still seem tied to a rename as part of this proposal. That's not necessary to achieve the goals you want. In fact, a rename won't even achieve those goals — it seems like you just want all code using Array{MyCustomType}(m,n) to start returning instances of MyCustomArray{2}. But that will not happen, even with a rename. Because such a rename would be so terribly breaking that we'd do it in stages. By the time we can finally start making Array return custom array types, absolutely no code will be using the name Array anymore because it would have been deprecated away beforehand. So folks will still have to change their code to opt into the new abstract factory method.

Like I said, there's definitely value in rethinking our abstract array constructors. The name itself is the most superficial part of this proposal; let's nail down the functionality. Currently similar has been the workhorse of the abstract "factory". But that hasn't been sufficient in many cases. We could add an AbstractArray constructor or I'm sure there are other clever ideas here. See #18161.

Predicating this on an intensive renaming that'd require two-four major release cycles to deprecate is really a non-starter, IMO.

@StefanKarpinski
Copy link
Member

I second what @mbauman said, and would add that the entire motivation for this seems to stem from wanting the ability to write [1, 2, null] and get a NullableArray instead of an Array. I get where that desire is coming from and have contemplated it myself from time to time. However, problem would be simply eliminated by the approach I outlined here since the type of [1, 2, null] would be Vector{Int?} and it would have a representation much like the current NullableArray. The same would apply to other constructors, e.g. if you write Dict("foo" => 1.5, "bar" => null, null => Inf) you would get a Dict{String?,Float64?}. That's why the approach is so appealing – it solves the nullability issue without resorting to factory shenanigans.

@vtjnash
Copy link
Member

vtjnash commented Jul 13, 2017

I think we might want something like [1, 2, null]? to give you a null-aware vcat, and use Dict?("foo" => 1.5, "bar" => null) for a null-aware Dict (null => Inf is likely malformed - it's not generally sensible for keys to be null). Otherwise, I'm not sure how you would describe a sensible algorithm for computing the intended eltype of the array in a predictable way. Perhaps promote_type would return some form of sentinel value?

@JeffBezanson
Copy link
Member

We can use whatever algorithm we want to determine the element type, though I concede that one could debate whether it is "sensible". promote_type could in fact return a union type, as could typejoin (which is already kind of an arbitrary join algorithm that just joins things however we feel like).

@andyferris
Copy link
Member

Yes, I think typejoin(Int, Null) should be Union{Int, Null}, and this would flow into promote_type and array construction and wherever else promotion is desired...

@vtjnash
Copy link
Member

vtjnash commented Jul 13, 2017

OK, that actually does work surprisingly well. I was expecting to run into lots of issues trying to define that.

julia> struct Null end
julia> Base.promote_rule(::Type{Null}, T::Type) = Union{Null, T}
julia> Base.promote_rule(::Type{Union{T, Null}}, S::Type) where {T} = Union{Null, Base.promote_type(T, S)}
julia> Base.convert(::Type{Union{T, Null}}, x) where {T} = convert(T, x)

julia> [1,  Null()] |> typeof
Array{Union{Int64, Null},1}

julia> [1, 3.0, 4.0, Null()] |> typeof
Array{Union{Float64, Null},1}

julia> [Null(), 1, 3.0, 4.0] |> typeof
Array{Union{Float64, Null},1}

Inference uses typejoin to eliminate Unions, so it would seem that its definition should be something like "doesn't return Unions". But of course, that says nothing about what vcat should do (although sometimes generating Union doesn't seem "sensible" to me).

@quinnj
Copy link
Member

quinnj commented Jul 13, 2017

FWIW, that's exactly the approach taken in Nulls.jl and it's been working out quite well.

@vtjnash
Copy link
Member

vtjnash commented Jul 13, 2017

Ah, good to know. Looks like you're missing the second rule I added (needed to give transitivity, such that promote_type(A, promote_type(B, C)) == promote_type(promote_type(A, B), C), such that vcat example 2 above works correctly).

@andyferris
Copy link
Member

Shouldn't the second example also become Array{Union{Float64, Null},1}?

@vtjnash
Copy link
Member

vtjnash commented Jul 13, 2017

Yes, it does. (I fixed the text above when I noticed my copy-paste error, but I assume you noticed the error in the notification email?)

@andyferris
Copy link
Member

No, I think the GitHub webpage became inconsistent. Odd - carry on.

@mbauman mbauman added the arrays [a, r, r, a, y, s] label Aug 30, 2017
@nsajko nsajko added the design Design of APIs or of the language itself label Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] breaking This change will break code design Design of APIs or of the language itself speculative Whether the change will be implemented is speculative
Projects
None yet
Development

No branches or pull requests