Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

describe doesn't follow the source => function => name convention #2385

Closed
nalimilan opened this issue Aug 26, 2020 · 28 comments · Fixed by #2401
Closed

describe doesn't follow the source => function => name convention #2385

nalimilan opened this issue Aug 26, 2020 · 28 comments · Fixed by #2401
Labels
breaking The proposed change is breaking. decision
Milestone

Comments

@nalimilan
Copy link
Member

I've just realized that describe takes name => function pairs, which is totally inconsistent with select, transform and combine. I think we should deprecate the current syntax and move to function => name.

@nalimilan nalimilan added the breaking The proposed change is breaking. label Aug 26, 2020
@nalimilan nalimilan added this to the 1.0 milestone Aug 26, 2020
@bkamins
Copy link
Member

bkamins commented Aug 26, 2020

Well - we were aware that we left this case.

describe is consistent with DataFrame and insertcols! syntax. Also see https://bkamins.github.io/julialang/2020/07/17/pair.html for a review of all uses of => we have right now.

Having said that I would not oppose to change it prior to 1.0 release as this is a non-core functionality.

@nalimilan
Copy link
Member Author

OK, I must have forgotten that then. I'd be inclined to change this, as it's quite similar to select/transform/combine (function associated with output column name). insertcols! is a bit different as there's no function. Also, the fewer exceptions, the better.

@bkamins
Copy link
Member

bkamins commented Aug 27, 2020

As I have said - I would not oppose it.

@oxinabox, @quinnj, @ararslan - do you have any opinion here? (I am mostly asking about the approach to "breaking things" in cases like here - we have a case of a relatively unimportant feature that is a bit inconsistent with the rest of the design; so - it would be better to change it but it is breaking).

In general this is a last call to break it in 0.22 release, as in general I will oppose any breaking changes to ensure the stability of the API of DataFrames.jl as I believe this is what we need to start ensuring if we want to have a wider adoption.

@nalimilan
Copy link
Member Author

If we want to avoid breakage, we could also allow both orders (since there's no ambiguity), but recommend function => name.

@bkamins
Copy link
Member

bkamins commented Aug 27, 2020

Actually I would prefer to be breaking, but want to confirm the policy (i.e. allow small breakages for 0.22 like this one to clean things up; do not allow breaking changes after 1.0 release at all 😄)

@tk3369
Copy link
Contributor

tk3369 commented Aug 29, 2020

My preference is to change the syntax to make it consistent with select/transform/combine.

@matthieugomez
Copy link
Contributor

If you’re going to be breaking, would not it be nicer to have cols as the default arg, and stats as the kwarg? That would be nice to be able to do describe(df, :x) similarly to groupby(df, :x) or select(df, :x)

@bkamins
Copy link
Member

bkamins commented Aug 29, 2020

cols kwarg is a standard approach to restrict columns to which a function is applied in DataFrames.jl.

@nalimilan
Copy link
Member Author

Is that really the case? We have lots of cases where it's a positional argument (allowmissing, completecases, unique...).

@bkamins
Copy link
Member

bkamins commented Aug 30, 2020

Ah - sorry - you are right. I was thinking about these functions and I knew we called the argument cols, but mixed up the fact that we use cols as a positional argument.

So @matthieugomez is what you want this:?

describe(df, cols; stats)

but it would greatly complicate the syntax of stats. Would you want to pass the arguments as a value (Symbol or Pair) or a vector of Union{Symbol, Pair}?

@nalimilan
Copy link
Member Author

Yeah. I was hoping we could make both cols and stats positional using dispatch, but there's a conflict when passing a symbol...

@bkamins
Copy link
Member

bkamins commented Aug 30, 2020

@matthieugomez - what do you think given these considerations (or in general - what would be your proposal exactly)? Thank you!

@pdeffebach
Copy link
Contributor

I like what @matthieugomez suggested. It's much closer to stata where summarize is used very heavily in that context.

I think describe(df, cols::MultiColumnIndex) would be good. I agree that this would make the typing of describe.

My ideal behavior is probably,

describe(df, cols, fun => :myfuname, fun2 => myfunname2)
describe(df, cols, fun => :myfuname, fun2 => myfunname2; allstats = true)

Where the first one just returns the results of fun1 and fun2. The second returns mean, std, etc. in addition to fun1 and fun2

@bkamins
Copy link
Member

bkamins commented Aug 31, 2020

The problem with this is that describe(df, :min) is problematic. I know you proposed a MultiColumnIndex, but still users could be confused by it.

Also note that currently you can write:

describe(df, cols=:x, :min, :sum => sum)

do you both feel that having to write cols= is that bad?

@pdeffebach
Copy link
Contributor

That's fine. cols is not that bad. Also point 1 can just be taken care of by adding :all as a positional argument.

We should change to fun => name though.

@pdeffebach
Copy link
Contributor

I've said this before but we should also change insertcols! to insertcols!(df, x => :name) for consistency everywhere.

@bkamins
Copy link
Member

bkamins commented Aug 31, 2020

Yes - I will make a PR changing this order.

@bkamins
Copy link
Member

bkamins commented Aug 31, 2020

see #2401

@nalimilan
Copy link
Member Author

What we could do is require wrapping symbols in a vector or All, e.g. describe(df, [:x]). Not clearly better or worse than writing cols=:x (the plural is weird when you pass a single column too). Do we have any other place where we use a cols keyword argument in the end?

@matthieugomez
Copy link
Contributor

matthieugomez commented Aug 31, 2020

@bkamins As you suggest, I was thinking of a kwarg stats that would be either a Union{Symbol, Pair}, or a vector of Union{Symbol, Pair}, as you suggest.

Beyond consistency, my impression is that 99% of the time, people will use describe to output all the default function for one or two columns (let me know if you disagree). So it makes sense to make it I easier to specify columns rather than specifying stats.

@bkamins
Copy link
Member

bkamins commented Aug 31, 2020

people will use describe to output all the default function for one or two columns (let me know if you disagree)

Interesting - I do exactly the opposite - rather collecting data for all columns but custom functions. My major use case of describe is to quickly check if I read-in a DataFrame correctly. If I wanted to investigate one or two columns I would use combine probably.

@bkamins
Copy link
Member

bkamins commented Aug 31, 2020

Do we have any other place where we use a cols keyword argument in the end?

No

@nalimilan
Copy link
Member Author

OK. That could be a reason to change that argument then. OTOH we made stats positional for consistency with select/transform/combine (funny fact: before #1664 it was a keyword argument!). It seems we can't be consistent with everything here...

@bkamins
Copy link
Member

bkamins commented Sep 1, 2020

This is indeed a coincidence 😄. Let us wait a bit as maybe other people will voice their opinion (or @matthieugomez changes his opinion given this issue 😄)

@pdeffebach
Copy link
Contributor

I would like to revive this, I think describe(df, :x1, :x2; stats = ...) is a much more useful syntax and think we should get it in before 1.0.

@bkamins bkamins reopened this Oct 1, 2020
@bkamins
Copy link
Member

bkamins commented Oct 1, 2020

The only issue is that you can write describe(df[!, [:x1, :x2]], ...stats here...) if you really do not like to use cols. Is that that problematic?

(I do not have a strong opinion here, but just want to understand the rationale)

@pdeffebach
Copy link
Contributor

I think mostly, those are a lot of brackets and a lot of characters to type. If you use stata you are using

summ x y z

all the time interacitvely, it's second nature. So making it super easy to type is important.

@bkamins
Copy link
Member

bkamins commented Jan 2, 2021

I am closing this as for 1.0 we will not introduce breaking changes with positional and kwargs (still we implemented the original proposal to use fun => col convention)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking. decision
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants