Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map(DataFrame, groups) does not return a collection of DataFrames #2126

Closed
oxinabox opened this issue Feb 19, 2020 · 4 comments
Closed

map(DataFrame, groups) does not return a collection of DataFrames #2126

oxinabox opened this issue Feb 19, 2020 · 4 comments
Labels
breaking The proposed change is breaking. decision
Milestone

Comments

@oxinabox
Copy link
Contributor

This is odd.
If I have done a groupby
I would expect that map(DataFrame, groups) would return a collection of independent DataFrames.
Afterall that is what DataFrame does to a single singleSubDataFrame.
But on a group it doesn't.
It map leaves things as SubDataFrame

MVE:

julia> df = DataFrame(name = string.(repeat('a':'c', 3)), value=rand(9));

julia> groups = groupby(df, :name);

julia> grp1 = first(group
groupby      groupindices  groups        groupvars
julia> grp1 = first(groups)
3×2 SubDataFrame
│ Row │ name   │ value    │
│     │ String │ Float64  │
├─────┼────────┼──────────┤
│ 1   │ a      │ 0.966107 │
│ 2   │ a      │ 0.12078  │
│ 3   │ a      │ 0.466016 │

julia> DataFrame(grp1)
3×2 DataFrame
│ Row │ name   │ value    │
│     │ String │ Float64  │
├─────┼────────┼──────────┤
│ 1   │ a      │ 0.966107 │
│ 2   │ a      │ 0.12078  │
│ 3   │ a      │ 0.466016 │

julia> map(DataFrame, groups)[1]
3×2 SubDataFrame
│ Row │ name   │ value    │
│     │ String │ Float64  │
├─────┼────────┼──────────┤
│ 1   │ a      │ 0.966107 │
│ 2   │ a      │ 0.12078  │
│ 3   │ a      │ 0.466016 │

Workaround:

Wrap it in a Tuple: (collect also works).

julia> map(DataFrame, Tuple(groups))[1]
3×2 DataFrame
│ Row │ name   │ value    │
│     │ String │ Float64  │
├─────┼────────┼──────────┤
│ 1   │ a      │ 0.966107 │
│ 2   │ a      │ 0.12078  │
│ 3   │ a      │ 0.466016 │

This is on DataFrames v0.20.2

@nalimilan
Copy link
Member

What happens is that map on a GroupedDataFrame returns a GroupedDataFrame. So it calls DataFrame on each SubDataFrame, and then concatenates the result into a single DataFrame, which is wrapped in a GroupedDataFrame. Since indexing a GDF gives a SDF, you get the observed result.

This may be counter-intuitive, but map generally preserves the type of the input collection (e.g. with tuples). The idea is that you can switch to a comprehension if you want a vector result.

That said, maybe the current behavior isn't very useful. We could also reserve it like broadcast until we have a more systematic plan.

@bkamins bkamins added breaking The proposed change is breaking. decision labels Feb 19, 2020
@bkamins bkamins added this to the 1.0 milestone Feb 19, 2020
@bkamins
Copy link
Member

bkamins commented Feb 19, 2020

map and broadcasting are in general a hard case.

For DataFrame we have worked out what broadcasting should do and I think what we have is nice (well - I am a bit skewed here 😄), but we do not define map for DataFrame exactly because we are not sure what it should produce.

For GroupedDataFrame we are also unsure what should broadcasting do so we preferred to leave it as "reserved".

As for map the current thinking is that map is like combine but it does not combine the result into a single DataFrame. In particular map preserves the grouping columns, so when you write:

julia> df = DataFrame(rand(3,4))
3×4 DataFrame
│ Row │ x1       │ x2       │ x3       │ x4       │
│     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.903949 │ 0.878733 │ 0.227235 │ 0.419861 │
│ 2   │ 0.11941  │ 0.9246   │ 0.18599  │ 0.483359 │
│ 3   │ 0.633959 │ 0.76821  │ 0.946462 │ 0.298769 │

julia> map(x -> (a="something", b="something else"), groupby(df, [:x1, :x2]))
GroupedDataFrame with 3 groups based on keys: x1, x2
First Group (1 row): x1 = 0.9039492232004707, x2 = 0.8787325687356251
│ Row │ x1       │ x2       │ a         │ b              │
│     │ Float64  │ Float64  │ String    │ String         │
├─────┼──────────┼──────────┼───────────┼────────────────┤
│ 1   │ 0.903949 │ 0.878733 │ something │ something else │
⋮
Last Group (1 row): x1 = 0.6339592041863114, x2 = 0.7682104273682182
│ Row │ x1       │ x2      │ a         │ b              │
│     │ Float64  │ Float64 │ String    │ String         │
├─────┼──────────┼─────────┼───────────┼────────────────┤
│ 1   │ 0.633959 │ 0.76821 │ something │ something else │

you see that the grouping columns are retained and only new columns a and b are added (not really useful example but I wanted to show what is going on).

So essentially now map serves as a method to transform GroupedDataFrame into another GroupedDataFrame retaining the grouping columns.

I think such a functionality is useful (although it is essentially the same as doing groupby on the result of combine). The question is if:

  • we want to retain map as a function that does it;
  • or we deprecate this meaning of map and in the future (after 1.0 release) have free hands to decide what we want map to do?

Comments are welcome!

@bkamins
Copy link
Member

bkamins commented Feb 19, 2020

Also - related - we are unsure if GroupedDataFrame <: AbstractVector or GroupedDataFrame <: AbstractDataFrame or neither. I am leaning towards GroupedDataFrame <: AbstractVector but again it is a very committing decision so for now it is just a subtype of Any so that we have free hands to decide later.

Note that very soon (#2095) GroupedDataFrame will be a kind of indexed-data frame, which I think will be super useful (and open completely new ways how GroupedDataFrame can be used), and this also might impact the decision how we want to treat it.

@bkamins
Copy link
Member

bkamins commented Mar 7, 2020

@nalimilan - given we want to go for 1.0 release and do not break too much I would close this and leave the functionality of map for GroupedDataFrame as it is now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking. decision
Projects
None yet
Development

No branches or pull requests

3 participants