Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order of arguments in data frame methods? #128

Closed
strengejacke opened this issue Mar 20, 2022 · 12 comments
Closed

Order of arguments in data frame methods? #128

strengejacke opened this issue Mar 20, 2022 · 12 comments

Comments

@strengejacke
Copy link
Member

Should we harmonize the order of arguments for our data frame methods, so that we have data - select - exclude - the others?

E.g.
https://easystats.github.io/datawizard/reference/data_cut.html
https://easystats.github.io/datawizard/reference/data_rename.html

We could maybe make select always the 2nd argument.

@strengejacke
Copy link
Member Author

or we make is exact the opposite, and put select/exclude to the end.
Then we could do, without naming arguments,

data |>
  find_colums(starts_with("xy")) |>
  data_cut("mean")

or

data |>
  find_colums(starts_with("xy")) |>
  data_addprefix("NEW_")

@DominiqueMakowski
Copy link
Member

We should definitely do something about consistency

Option 1 is the most straightforward, Option 2 is interesting...

And maybe option 2 makes sense, because when you think about it in most function the selectors are not the "main" arguments (e.g. in add prefix), and people expect the main arguments to be described first and then the more additional ones at the end?

@strengejacke
Copy link
Member Author

@mattansb @bwiernik @IndrajeetPatil Thoughts?

@bwiernik
Copy link
Contributor

bwiernik commented Mar 21, 2022

I like the idea of consistency. I find option 2 sort of confusing--I don't have an intuition of what that pipe chain is supposed to do. I would prefer to hew as close to the tidyverse-style API here as we can. So I would prefer select/exclude be the second/third arguments

@IndrajeetPatil
Copy link
Member

I second Brenton.

Designing API that mimics tidyverse API also means that the users familiar with that API won't need to change their habits, and can easily adapt to easystats API.

@strengejacke
Copy link
Member Author

strengejacke commented Mar 21, 2022

But there's one difference to the tidyverse API, namely that select and exclude are arguments to select columns, which we have in many of our functions, while tidyverse (or dplyr) functions have their "operational" arguments after the first data argument. tidyverse has no arguments to select columns, it "only" has select().

Functions like slice(), muate(), summarize(), arrange() etc. all have a data argument first, followed by arguments that are specific to the function and their intended operation. We have data argument, then column selection, then the arguments specific to the function:

tidyverse: foo(data, what_to_do)
datawizard: foo(data, select_column, what_to_do)

We could stick to this, or change it to: foo(data, what_to_do, select_column)

But for the pipe-workflow, find_columns() should have an option to return the data (either by having an argument, or we have a separate function, get_columns(), that returns the data instead of the names).

@bwiernik
Copy link
Contributor

I’m not sure without some context. Can you give some examples with our functions?

@strengejacke
Copy link
Member Author

These functions from both packages work in a similar fashion:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(datawizard)

iris |> 
  select(starts_with("Sep")) |> 
  summarise(m = mean(Sepal.Length)) # 2nd argument does not select data
#>          m
#> 1 5.843333

iris |> 
  select(starts_with("Sep")) |> 
  slice(1:5) # 2nd argument does not select data
#>   Sepal.Length Sepal.Width
#> 1          5.1         3.5
#> 2          4.9         3.0
#> 3          4.7         3.2
#> 4          4.6         3.1
#> 5          5.0         3.6

iris |> 
  select(starts_with("Sep")) |> 
  data_cut("median") |> # 2nd argument does not select data
  head()
#>   Sepal.Length Sepal.Width
#> 1            1           2
#> 2            1           2
#> 3            1           2
#> 4            1           2
#> 5            1           2
#> 6            1           2

iris |> 
  select(starts_with("Sep")) |> 
  data_addprefix("HU") |> # 2nd argument does not select data
  head()
#>   HUSepal.Length HUSepal.Width
#> 1            5.1           3.5
#> 2            4.9           3.0
#> 3            4.7           3.2
#> 4            4.6           3.1
#> 5            5.0           3.6
#> 6            5.4           3.9

When we would use select as 2nd argument in all those functions in datawizard, they would no longer behave similar to dplyr. select and exclude are convenient "bonus" arguments, that are not necessarily related to the core task of the functions. - like all the other "2nd" arguments in the tidyverse functions (and like in some of our datawizard's functions).

@strengejacke
Copy link
Member Author

I just saw, for most/all(?) functions, where applicable, select/ exclude are not the 2nd and 3rd argument. There are exceptions, like data_extract() or reshape_longer(), but in these cases, the 2nd argument must select data (like dplyr::pull()). So probably we're already ok with the current function design...

@strengejacke
Copy link
Member Author

I think Dom summarised it well:

And maybe option 2 makes sense, because when you think about it in most function the selectors are not the "main" arguments (e.g. in add prefix), and people expect the main arguments to be described first and then the more additional ones at the end?

@bwiernik
Copy link
Contributor

Okay, I agree. We should move the secondary select/exclude arguments after the main function arguments

@strengejacke
Copy link
Member Author

closing in favor of #133

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants