`map(DataFrame, groups)` does not return a collection of DataFrames #2126

oxinabox · 2020-02-19T13:37:42Z

This is odd.
If I have done a groupby
I would expect that map(DataFrame, groups) would return a collection of independent DataFrames.
Afterall that is what DataFrame does to a single singleSubDataFrame.
But on a group it doesn't.
It map leaves things as SubDataFrame

MVE:

julia> df = DataFrame(name = string.(repeat('a':'c', 3)), value=rand(9));

julia> groups = groupby(df, :name);

julia> grp1 = first(group
groupby      groupindices  groups        groupvars
julia> grp1 = first(groups)
3×2 SubDataFrame
│ Row │ name   │ value    │
│     │ String │ Float64  │
├─────┼────────┼──────────┤
│ 1   │ a      │ 0.966107 │
│ 2   │ a      │ 0.12078  │
│ 3   │ a      │ 0.466016 │

julia> DataFrame(grp1)
3×2 DataFrame
│ Row │ name   │ value    │
│     │ String │ Float64  │
├─────┼────────┼──────────┤
│ 1   │ a      │ 0.966107 │
│ 2   │ a      │ 0.12078  │
│ 3   │ a      │ 0.466016 │

julia> map(DataFrame, groups)[1]
3×2 SubDataFrame
│ Row │ name   │ value    │
│     │ String │ Float64  │
├─────┼────────┼──────────┤
│ 1   │ a      │ 0.966107 │
│ 2   │ a      │ 0.12078  │
│ 3   │ a      │ 0.466016 │

Workaround:

Wrap it in a Tuple: (collect also works).

julia> map(DataFrame, Tuple(groups))[1]
3×2 DataFrame
│ Row │ name   │ value    │
│     │ String │ Float64  │
├─────┼────────┼──────────┤
│ 1   │ a      │ 0.966107 │
│ 2   │ a      │ 0.12078  │
│ 3   │ a      │ 0.466016 │

This is on DataFrames v0.20.2

The text was updated successfully, but these errors were encountered:

nalimilan · 2020-02-19T14:18:12Z

What happens is that map on a GroupedDataFrame returns a GroupedDataFrame. So it calls DataFrame on each SubDataFrame, and then concatenates the result into a single DataFrame, which is wrapped in a GroupedDataFrame. Since indexing a GDF gives a SDF, you get the observed result.

This may be counter-intuitive, but map generally preserves the type of the input collection (e.g. with tuples). The idea is that you can switch to a comprehension if you want a vector result.

That said, maybe the current behavior isn't very useful. We could also reserve it like broadcast until we have a more systematic plan.

bkamins · 2020-02-19T15:03:52Z

map and broadcasting are in general a hard case.

For DataFrame we have worked out what broadcasting should do and I think what we have is nice (well - I am a bit skewed here 😄), but we do not define map for DataFrame exactly because we are not sure what it should produce.

For GroupedDataFrame we are also unsure what should broadcasting do so we preferred to leave it as "reserved".

As for map the current thinking is that map is like combine but it does not combine the result into a single DataFrame. In particular map preserves the grouping columns, so when you write:

julia> df = DataFrame(rand(3,4))
3×4 DataFrame
│ Row │ x1       │ x2       │ x3       │ x4       │
│     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.903949 │ 0.878733 │ 0.227235 │ 0.419861 │
│ 2   │ 0.11941  │ 0.9246   │ 0.18599  │ 0.483359 │
│ 3   │ 0.633959 │ 0.76821  │ 0.946462 │ 0.298769 │

julia> map(x -> (a="something", b="something else"), groupby(df, [:x1, :x2]))
GroupedDataFrame with 3 groups based on keys: x1, x2
First Group (1 row): x1 = 0.9039492232004707, x2 = 0.8787325687356251
│ Row │ x1       │ x2       │ a         │ b              │
│     │ Float64  │ Float64  │ String    │ String         │
├─────┼──────────┼──────────┼───────────┼────────────────┤
│ 1   │ 0.903949 │ 0.878733 │ something │ something else │
⋮
Last Group (1 row): x1 = 0.6339592041863114, x2 = 0.7682104273682182
│ Row │ x1       │ x2      │ a         │ b              │
│     │ Float64  │ Float64 │ String    │ String         │
├─────┼──────────┼─────────┼───────────┼────────────────┤
│ 1   │ 0.633959 │ 0.76821 │ something │ something else │

you see that the grouping columns are retained and only new columns a and b are added (not really useful example but I wanted to show what is going on).

So essentially now map serves as a method to transform GroupedDataFrame into another GroupedDataFrame retaining the grouping columns.

I think such a functionality is useful (although it is essentially the same as doing groupby on the result of combine). The question is if:

we want to retain map as a function that does it;
or we deprecate this meaning of map and in the future (after 1.0 release) have free hands to decide what we want map to do?

Comments are welcome!

bkamins · 2020-02-19T15:07:38Z

Also - related - we are unsure if GroupedDataFrame <: AbstractVector or GroupedDataFrame <: AbstractDataFrame or neither. I am leaning towards GroupedDataFrame <: AbstractVector but again it is a very committing decision so for now it is just a subtype of Any so that we have free hands to decide later.

Note that very soon (#2095) GroupedDataFrame will be a kind of indexed-data frame, which I think will be super useful (and open completely new ways how GroupedDataFrame can be used), and this also might impact the decision how we want to treat it.

bkamins · 2020-03-07T18:00:59Z

@nalimilan - given we want to go for 1.0 release and do not break too much I would close this and leave the functionality of map for GroupedDataFrame as it is now.

bkamins added breaking The proposed change is breaking. decision labels Feb 19, 2020

bkamins added this to the 1.0 milestone Feb 19, 2020

nalimilan closed this as completed Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`map(DataFrame, groups)` does not return a collection of DataFrames #2126

`map(DataFrame, groups)` does not return a collection of DataFrames #2126

oxinabox commented Feb 19, 2020

nalimilan commented Feb 19, 2020

bkamins commented Feb 19, 2020

bkamins commented Feb 19, 2020

bkamins commented Mar 7, 2020

map(DataFrame, groups) does not return a collection of DataFrames #2126

map(DataFrame, groups) does not return a collection of DataFrames #2126

Comments

oxinabox commented Feb 19, 2020

MVE:

Workaround:

nalimilan commented Feb 19, 2020

bkamins commented Feb 19, 2020

bkamins commented Feb 19, 2020

bkamins commented Mar 7, 2020

`map(DataFrame, groups)` does not return a collection of DataFrames #2126

`map(DataFrame, groups)` does not return a collection of DataFrames #2126