NamedTuple covariance #32825

bramtayl · 2019-08-08T00:29:15Z

I thought there was already an issue open for this but I couldn't find it. I don't know if people saw my JuliaCon talk but I implemented covariant namedtuples for LightQuery. If that would be useful, maybe we could get it in Base, and if its not useful, I'd like to understand why, and I can switch LightQuery over to using DataValues and invariant NamedTuples.

JeffBezanson · 2019-08-08T01:55:14Z

See also #24614 😂

I think the main reason invariant named tuples are useful is that a named tuple with Union{Int,Missing} as a field type can be a concrete type with a known layout.

The main issue I see with covariance is that a tuple of Int or Missing must have either the type Tuple{Int} or Tuple{Missing}. If you have N elements there are 2^N possible types. Even if there is an enclosing container with element type Tuple{Union{Int,Missing},...} the types of the elements are still all taken from the set of 2^N. With invariance, every element can have the same type InvariantTuple{Union{Int,Missing}, ...}. As far as I recall that was our reasoning.

quinnj · 2019-08-08T04:09:07Z

In addition to Jeff's great points, I've also been diving a lot into the remaining inconveniences of working w/ Union{T, Missing} and I think we have a pretty good handle on things we can do to make them better:

With #32448, NamedTuples (and all structs) w/ isbits Union fields will be stored inline in Arrays.

With #32699, we actually get into a scenario where you can do a lazy map (a la Query.EnumerableMap, or like this simple Tables.jl version that doesn't rely on inference at all), and the compiler is able to inline/infer everything all the way down, which == great performance and no extra allocations. Now, that does depend on who's collecting the lazy map, and this ideal scenario works for most cases: columnar outputs like DataFrame/IndexedTable/TypedTable, file-based outputs like CSV/Feather, and databases, like ODBC/MySQL/SQLite. The one case that is inconvenient is collecting into an Array of NamedTuples, since the resulting type would be Vector{NamedTuple{names, T} where T <: Tuple{Int, Union{Float64, Missing}}, which wouldn't have an inlined/concrete layout. Now, you can still get the nice case by manually specifying the output type of your NamedTuple mapping function (like return NamedTuple{names, Tuple{Int, Union{Float64, Missing}}}((1, missing)), and I think it'd be worth exploring some syntax ideas to make that simpler/nicer.

And honestly, by having a lazy map that can be inferred when collecting, we do away with the main reason I'm aware of for the DataValue type (since, on a technical basis, the same inference improvement would apply to inferring join key mapping functions, etc.). (the other differences I'm aware of with Missing vs. DataValue are orthogonal to inference issues, like 3-valued logic, etc.). That to me seems like a huge win!

At this point, I think there's strong reasoning and momentum to not further complicate the data ecosystem by hanging on to DataValues. There are so many great developments going on between machine learning frameworks, core statistics, applied math and optimization techniques, and they've all adopted and moved forward with the Missing approach. It would just be unfortunate if a subset of packages kept themselves isolated, needing to always wrap/unwrap themselves to go in/out w/ the rest of the ecosystem instead of "just working" out of the box.

I think sometimes we also get really caught up on temporary inconveniences and miss the forest for the trees; Jeff and I sat for maybe 20 minutes at the JuliaCon hackathon and were able to hammer out a bunch of concrete issues to improve various scenarios. I think it'd be much more productive for everyone to highlight the inconveniences and then come together to figure out solutions; having dug so deep into all of this over the last month or two, I really think it was the right choice to go w/ the Missing representation, and I don't think it fundamentally limits things in any way. Do we need to keep evolving APIs, improving inference, and make things easier/simpler? Definitely. But I, for one, think it would be a huge win to move everything off of DataValue, avoid the need to maintain all these extra data structures, and collectively collaborate on a more cohesive data ecosystem, even if that means dealing with some short-term inconveniences.

bramtayl · 2019-08-08T10:08:46Z

If you have N elements there are 2^N possible types

I came to the opposite conclusion by the same reasoning, e.g.

using LightQuery: @name
using InteractiveUtils: @code_warntype

unstable() = rand(Bool) ? 1 : 1.0

test1() = (a = unstable(), b = unstable(), c = unstable())
test2() = @name (a = unstable(), b = unstable(), c = unstable())

@code_warntype test1()
@code_warntype test2()

bramtayl · 2019-08-08T10:11:36Z

Although maybe #32699 would solve that?

bramtayl · 2019-08-08T12:05:46Z

Well anyways once #32699 comes out I'll try switching LightQuery over to regular NamedTuples and see what happens to performance

JeffBezanson · 2019-08-08T15:09:19Z

If you have N elements there are 2^N possible types

I came to the opposite conclusion by the same reasoning, e.g.

The "2^N types" is not an inference issue; it's more fundamental. We have no way to efficiently store such a structure. We'd like to keep a bit for each element saying whether it's Int or Missing, but we can't because neither of the types Tuple{Int} or Tuple{Missing} provides for such a bit, and Tuple{Union{Int,Missing}} is abstract so it doesn't have a layout at all.

We could fix that by having the outer structure (an array of tuples) store all the tuples inline, with N bits next to each specifying which tuple type it has. But we don't have that implemented, and it doesn't really seem worth it --- we would still only infer an abstract type when you index the array, and code_warntype would still be red.

bramtayl · 2019-08-08T15:13:22Z

We have no way to efficiently store such a structure

That's not a problem for me; I store data column-wise in LightQuery. What is important is to get efficient iterators

quinnj · 2019-08-08T15:26:04Z

We have no way to efficiently store such a structure

That's not a problem for me; I store data column-wise in LightQuery. What is important is to get efficient iterators

I really think #32699 is the answer here; it should give inference just enough information to realize it can avoid materializing the temporary NamedTuple all together and efficiently store the computed values into output columns directly.

vtjnash · 2019-08-08T15:34:29Z

We have no way to efficiently store such a structure

This might not have been the best choice of word. As Jeff said later, we know how to efficiently store this structure without too much difficulty, the problem is that we can't compute the value of the NamedTuple.

JeffBezanson · 2019-08-08T15:38:25Z

That's not a problem for me; I store data column-wise in LightQuery. What is important is to get efficient iterators

Right; the problem can be avoided by never materializing the tuples. For example you can return a lazy Row{Tuple{Union{Int,Missing}}} type that just contains the index of a row --- but that type is invariant! So I still don't see what work the covariant named tuples are doing here; maybe spell it out more?

bramtayl · 2019-08-08T15:46:23Z

The covariant named tuples came out of me experimenting with a way to write named tuples before NamedTuples came about, and realizing that my version of named tuples didn't suffer from #32699 . I thought it was because of covariance but I guess its just a missing optimization?

bramtayl · 2019-08-08T22:56:20Z

Reopen if no I guess

o314 · 2019-10-31T03:13:45Z

Missing case apart,
The structdiff @ https://github.com/JuliaLang/julia/blob/v1.3.0-rc4/base/namedtuple.jl#L303 catch an instance or a type by following the generated function way. I wonder if there is a workaround playing with this to bring nametuple covariance.

My own case

@test NamedTuple{(:a,),Tuple{Int64}} <: NamedTuple{NS,T} where {NS,T<:Tuple}
@test_broken Type{NamedTuple{(:a,),Tuple{Int64}}} <: Type{NamedTuple{NS,T} where {NS,T<:Tuple}}

Edited: may be rather due to a Type covariance issue than a namedtuple one

bramtayl closed this as completed Aug 8, 2019

yuyichao mentioned this issue Aug 29, 2020

NamedTuple should inherit the covariance of Tuple #37271

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NamedTuple covariance #32825

NamedTuple covariance #32825

bramtayl commented Aug 8, 2019

JeffBezanson commented Aug 8, 2019

quinnj commented Aug 8, 2019

bramtayl commented Aug 8, 2019 •

edited

Loading

bramtayl commented Aug 8, 2019 •

edited

Loading

bramtayl commented Aug 8, 2019

JeffBezanson commented Aug 8, 2019

bramtayl commented Aug 8, 2019

quinnj commented Aug 8, 2019

vtjnash commented Aug 8, 2019

JeffBezanson commented Aug 8, 2019

bramtayl commented Aug 8, 2019

bramtayl commented Aug 8, 2019

o314 commented Oct 31, 2019 •

edited

Loading

NamedTuple covariance #32825

NamedTuple covariance #32825

Comments

bramtayl commented Aug 8, 2019

JeffBezanson commented Aug 8, 2019

quinnj commented Aug 8, 2019

bramtayl commented Aug 8, 2019 • edited Loading

bramtayl commented Aug 8, 2019 • edited Loading

bramtayl commented Aug 8, 2019

JeffBezanson commented Aug 8, 2019

bramtayl commented Aug 8, 2019

quinnj commented Aug 8, 2019

vtjnash commented Aug 8, 2019

JeffBezanson commented Aug 8, 2019

bramtayl commented Aug 8, 2019

bramtayl commented Aug 8, 2019

o314 commented Oct 31, 2019 • edited Loading

bramtayl commented Aug 8, 2019 •

edited

Loading

bramtayl commented Aug 8, 2019 •

edited

Loading

o314 commented Oct 31, 2019 •

edited

Loading