Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two features for data.table one liners #3681

Closed
moodymudskipper opened this issue Jul 5, 2019 · 12 comments
Closed

Two features for data.table one liners #3681

moodymudskipper opened this issue Jul 5, 2019 · 12 comments

Comments

@moodymudskipper
Copy link

In the context of the recent twitter conversation on data.table's relevance, I've been thinking that one of the intimidating aspects of data.table is the feeling that we have to opt in 100% in the data.table paradigm to make good use of it. It's something that tibbles mostly avoid by having [.tbl_df behave much like [.data.frame.

I believe it would be to data.table's benefit to promote single line calls that won't force the user to commit much of their code to the data.table paradigm and won't entail explicit verbose conversions, I propose two features.

1) convert to data.frame easily from data.table using DT[]

DT[] is the same as DT now, and is unlikely to be ever used in practice so this shouldn't break anything, I propose :

as.data.table(iris)[,mean(Sepal.Length), by = Species][]
#>      Species    V1
#> 1     setosa 5.006
#> 2 versicolor 5.936
#> 3  virginica 6.588

# instead of
# as.data.frame(as.data.table(iris)[,mean(Sepal.Length), by = Species])

** 2) convert to data.table easily from data.frame using dt$mydf instead of as.data.table(mydf)

This follows the approach that Gabor Grothendieck used to design gsubfn::fn, and that I followed to an extent with my own work on my package tag.

Dollar notation is a way around bracket overload and makes explicit that we have a single argument, two things that add up to cognitive load especially in complex calls. It can confuse the user who is used to employ $ to select list elements only but I believe its use by packages R6, crayon and gsubfn show that it can be intuitive nonetheless.

Its implementation would be as simple as :

dt <- structure(NA, class="data.table.dt")
`$.data.table.dt` <- function(e1,e2) eval(bquote(as.data.table(.(as.symbol(e2)))))

using both this feature and the one proposed in 1 we can introduce easily a single data.table operation in our code :

 dt$iris[,mean(Sepal.Length), by = Species][]
#>      Species    V1
#> 1     setosa 5.006
#> 2 versicolor 5.936
#> 3  virginica 6.588
# instead of
# as.data.frame(as.data.table(iris)[,mean(Sepal.Length), by = Species])

We thus can benefit from data.table's compact syntax and efficient code, for the price of two discrete conversions, in a neat dt<...>[] sandwich which will protect the user from worries about having a confusing mix of data.table and data.frame/tibbles in their workspace, or unexpected modifications by reference that can happen when feeding a data.table to a function. It could potentially constitutes a foot in the door for more standard data.table usage.

@franknarf1
Copy link
Contributor

Fyi, setDT and setDF exist for making these conversions in place. While those can be used for a long one-liner, I would put them on separate lines for clarity. While the functions are instant, dropping DT class is not costless (there might be a key, indices or other attributes stored to make calls more efficient).

DT[] is idiomatic for printing (I think the FAQ explains) and DT$column for extracting columns. I use both all the time, and they are consistent with data.frame.

@moodymudskipper
Copy link
Author

ah... I didn't know that [] was used to print already, as in the console it didn't make a difference :

it's in the wiki indeed :

DT[, m:=mean(v), by=x][]   

add new column by reference by group
postfix [] is shortcut to print()

I understand your arguments about efficiency and clarity, but I think they don't invalidate the point of trying to make data.table accessible in base or tidy code for a quick instruction. I personally rarely care about efficiency but I care about not using setDT then a data.table call, then setDF if I need to do a quick join or aggregation.

@MichaelChirico
Copy link
Member

MichaelChirico commented Jul 5, 2019 via email

@moodymudskipper
Copy link
Author

moodymudskipper commented Jul 5, 2019

I don't think $ is "supposed" to be for OOP, I think @ is. It is its common use though, but gsubfn or (my tag) use it in the context of decorators and I think it's intuitive enough. crayon uses it just for short syntax and I think it's reasonably intuitive as well. I admit that it's a bit weird, but my point is that if one of the key selling points of data.table is to be compact, let's make it compacter.

library(crayon)
alert <- combine_styles(bold, red, bgCyan)
cat(alert("Warning!"), "\n")
alert <- bold $ red $ bgCyan
cat(alert("Warning!"), "\n")

I wasn't aware that dt was used by the package already, any name can work. it can still work with dt technically, by just setting a class for it, as all the work is done in the $ method.

Is there some data.frame behavior you are accustomed to that doesn't work
on data.table?

Well in #3672 I learnt that it's expected that sometimes a data.table should be printed twice to be printed once, I'm sure they are smart reasons for that (I haven't gone through them yet) but as a user I'd feel safer I could make sure I keep this arguably unintuitive behavior away until I understand it, and I might do it by ensuring my workspace is data.table free.

Another example would be :

irisDT <- as.data.table(iris)
x <- "Species"
irisDT[, x]
# Error in `[.data.table`(irisDT, , x) : 
#  j (the 2nd argument inside [...]) is a single symbol but column name 'x' is not found. 
# Perhaps you #intended DT[, ..x]. This difference to data.frame is deliberate and explained in FAQ 1.1.

I just read today in the vignettes that there's an option that can be triggered to make the above work, but it's not a default for backward compatibility, and that's perfectly fine by me, but still can be confusing.

Yet another example :

irisDT[,"Species", TRUE]
# Error in `[.data.table`(irisDT, , "Species", TRUE) : 
#  The items in the 'by' or 'keyby' list are length (1). Each must be length 150; the same length as there are rows in x (after subsetting if i is provided).

If my workspace is a bit messy and I'm not super comfortable with data.table, or R in general, I get 2 different classes of objects that look very alike behaving differently. converting to tibble doesn't affect the cases above and in general when using tibbles one doesn't have to care too much if an object is a tibble or a data.frame (in my practice I often don't know, the use of a tidy function at some point will change it to tibble, which I prefer, and that's it).

This was to answer your question, but let's be clear that I'm not implying flaws in data.table here nor interested in claiming which approach is best, just trying to see if we can cross the bridge for some users by providing features that make the timid ones feel safe about using this package.

I for one when doing fast data wrangling in front of my team would be happy to be able to use data.table syntax just for a quick operation without feeling like I'm commiting to a new paradigm or have to think about converting my tables 2 times, hence this suggestion :).


Maybe the data.table to data.frame conversion feature could be a parameter unclass (or strip.class, to.df, data.frame...) which would remove the data.table class from the output, this way it can be done in an efficient way:

as.data.table(iris)[,mean(Sepal.Length), by = Species, unclass = TRUE]

@MichaelChirico
Copy link
Member

sometimes a data.table should be printed twice to be printed once

In fact that's an FAQ:

https://cloud.r-project.org/web/packages/data.table/vignettes/datatable-faq.html#why-do-i-have-to-type-dt-sometimes-twice-after-using-to-print-the-result-to-console

Basically we're facing a compromise -- how to prevent the table from printing every time a column is added with := (since it's done within [, we have to take efforts to distinguish the :=-not-print & non-:=-print cases of [); see #869 . As Frank pointed out, the workarounds to guarantee printing are: explicitly call print(DT) or force-print by adding an empty [ call: DT[].

For x <- "Species"; irisDT[, x], we hope the error message is as helpful as possible, and explicitly spells out for you something that would have worked: DT[ , ..x]:

j (the 2nd argument inside [...]) is a single symbol but column name 'x' is not found. Perhaps you #intended DT[, ..x]. This difference to data.frame is deliberate and explained in FAQ 1.1.

We're always striving to improve our error messages, so it would be helpful if you could comment on how this could be improved (it's always harder to write error messages as experienced users)

Re: unclass argument to [, I think we're a bit averse to adding new arguments there since there are already 15 (and more suggested)! And as we're trying to show here, we don't think the differences for common operations are so different from data.frame that we expect users to morph in-and-out of data.table all the time -- mainly setDF is around to facilitate working with data.table-unfriendly packages that expect data.frame-identical behavior (usually around row names).

iris is a bit weird since it's a built-in data set-as-promise, it's rare to need as.data.table to get going; much more common approach would look something like:

iris = copy(iris) # idiosyncratic step for built-in data set
setDT(iris)
species = iris[ , mean(Sepal.Length), by = Species]
setDF(species)

Personal preference but I find this much cleaner (and if you dig up some of my oldest StackOverflow answers, you'll see I used to be a strict adherent to the everything-in-one-line school).

Let's be clear that I'm not implying flaws in data.table here nor interested in claiming which approach is best

Hopefully we don't come across as defensive! In fact we're very interested in hearing the perspectives of users from new or different backgrounds & take seriously your input, my probing is trying to get at what could be improved in the documentation/messaging (errors, warnings, etc).

@moodymudskipper
Copy link
Author

Michael you don't come across as defensive at all, quite the opposite in fact and thanks for that. I wanted to put emphasis on the fact that I'm focused on building a bridge between base or tidy technologies and data tables, and not on a criticism of data.table itself.

These error messages are very good and very helpful I think, I don't really have any idea how to improve them. Nevertheless, I remember that younger me was distressed by having several objects acting differently despite similar syntax, as a user that didn't know much about classes, methods, operator overloading, assignment by reference... I now understand more or less the quirk about why to print twice and the existing workarounds, but I think we can all accept that any user encountering it for the first time will find it surprising.

My idea of these features was to create a quirk free environment, protecting its user from data.table's less intuitive features, and proposing a quick benefit (compact syntax and a part of the performance improvement).

I'm not sure if my proposed implementation is so smart or elegant, but I think the RTFM first approach might turn off some users, as the idea that one paradigm HAS to be chosen over another, so I was thinking : "How could I do really quickly an operation with data.table and not commit further to the package ?".

About the extra arguments, I get that we want to limit them not to be confusing, but if the only reason is that too many arguments make the documentation difficult, shouldn't we find a way to document features on several pages, with feature centred doc rather than argument centred or example centred doc, and let the arguments grow and multiply ? Is there another reason apart from doc, and maybe code complexity, that we can't have twice more arguments (being provocative on purpose)?

@moodymudskipper
Copy link
Author

@MichaelChirico I think this new idea works better :

https://twitter.com/antoine_fabri/status/1153444109727739904

dt_brackets <- list(
  `[.data.frame` = data.table:::`[.data.table`,
  `[<-.data.frame` = data.table:::`[<-.data.table`,
  `[.tbl_df` = data.table:::`[.data.table`,
  `[<-.tbl_df` = data.table:::`[<-.data.table` )
iris2 <- with(dt_brackets, iris[, .(meanSW = mean(Sepal.Width)), by = Species])
iris2
#>      Species meanSW
#> 1     setosa  3.428
#> 2 versicolor  2.770
#> 3  virginica  2.974
class(iris2)
#> [1] "data.frame"
iris3 <- with(dt_brackets, tibble::as_tibble(iris)[, .(meanSW = mean(Sepal.Width)), by = Species])
iris3
#> # A tibble: 3 x 2
#>   Species    meanSW
#>   <fct>       <dbl>
#> 1 setosa       3.43
#> 2 versicolor   2.77
#> 3 virginica    2.97
class(iris3)
#> [1] "tbl_df"     "tbl"        "data.frame"

with_DT <- function(expr){
  dt_brackets <- list(
    `[.data.frame` = data.table:::`[.data.table`,
    `[<-.data.frame` = data.table:::`[<-.data.table`,
    `[.tbl_df` = data.table:::`[.data.table`,
    `[<-.tbl_df` = data.table:::`[<-.data.table` )
  eval(substitute(with(dt_brackets, expr)))
}
iris2 <- with_DT(iris[, .(meanSW = mean(Sepal.Width)), by = Species])
class(iris2)
#> [1] "data.frame"

Created on 2019-07-23 by the reprex package (v0.3.0)

@franknarf1
Copy link
Contributor

@moodymudskipper There are data.table object innards missing that mean it doesn't work in some common cases, I guess, eg with_DT(iris2[order(Species)]) yields an error.

And some operations will leave you with a franken-frame (not a data.frame, not a data.table), like

with_DT(iris2[, c(letters, LETTERS) := as.list(c(letters, LETTERS))][])

... creates an internal self ref attribute on the data.frame. Not sure whether that means it meets your expectations/goals here or not.

Making [.data.table portable to non-data.tables, if that's the goal, seems like a big departure and a documentation headache (eg, need to figure out which parts of DT NSE will throw errors like order(Species) above in a particular reimplementation of NSE like with_DT; need to figure out how interactions with tidy NSE work or don't work) for a benefit that is not clear to me.

Outside of [.data.table and set*, many features are portable, anyways: fread/fwrite, rleid/rowid, IDateTime.

@moodymudskipper
Copy link
Author

moodymudskipper commented Jul 23, 2019

Can you explain what you mean by franken-frame ? In that case it seems to me iris2 is a regular data frame. The fact that it's changed by reference might or might not be expected, but both would make sense and would be easy to document I think.

The following is less efficient (maybe can be improved ?) and would need more work to pass all arguments and deal with missing values instead of just forwarding ... but works with the your order() example and never change by reference (as far as the user can see) :

with_DT <- function(expr){
  dt_brackets <- list(
    `[.data.frame` = function(x, ...) {
      if(data.table::is.data.table(x)) {
        x[...] 
      } else {
        class_ <- class(x)
        x <- data.table::as.data.table(x)
        x <- x[...]
        class(x) <- class_
        x
      } 
    })
  eval(substitute(with(dt_brackets, expr)))
}

iris2 <- with_DT(iris[, .(meanSW = mean(Sepal.Width)), by = Species])
class(iris2)
#> [1] "data.frame"
class(iris)
#> [1] "data.frame"

iris2 <- with_DT(iris2[, c(letters[1:3], LETTERS[1:3]) := as.list(c(letters[1:3], LETTERS[1:3]))][])
class(iris2)
#> [1] "data.frame"
with_DT(iris2[order(Species)])
#>      Species meanSW a b c A B C
#> 1     setosa  3.428 a b c A B C
#> 2 versicolor  2.770 a b c A B C
#> 3  virginica  2.974 a b c A B C

Created on 2019-07-23 by the reprex package (v0.3.0)

About the benefit, it has few if your workflow includes a lot of data.table code anyway, but :

  • it is convenient if you work interactively and want to use data.table's syntax (and speed to a degree) to do some quick operation that would be more verbose if done otherwise
  • It makes it really obvious that your chunk of code is data.table code, so the fact that your code might have hybrid syntax is much less confusing, especially if you're sharing your work, F1 on with_DT will tell you all you need to know (or give you the necessary pointers) if you're not familiar with data.table.
  • your workspace is guaranteed not to contain data frames of several classes, which I believe can be a source of confusion.
  • it keeps your tibbles as tibbles while with idiomatic data.table code to get back to your former class you'd have to do setDF on data.frames in the end but as_tibble on tibbles (or maybe settattr(df, "class", c("tbl_df", "tbl", "data.frame") to be more efficient)
  • the fact that it never modifies by reference can be seen as a feature by users who feel strongly that this should never be done, or who are confused by it.
  • Stack Overflow points because quick one liners that don't change class :), makes data table less scary to new users, would be a bridge to switch to a more extensive data.table workflow.

@franknarf1
Copy link
Contributor

franknarf1 commented Jul 23, 2019

@moodymudskipper Thanks for explaining. Re the benefit "the fact that it never modifies by reference can be seen as a feature", that was what I was getting at with "franken frame" -- It has unexpected attributes (the selfref). I suspect similar issues might come up with other DF-derived classes like grouped_df -- will the grouping still be correct after a run through with_DT? Besides attributes, to ensure that columns are not modified by reference, you'll need to "lock" the data somehow, maybe related to #2277 (comment) #778

Anyway, I hope my impression (that the task is nontrivial) is clear enough, but I don't want to crowd the thread trying to explain it in further replies since it's really not a deep point and I think others understand what you're trying to do better than I do.

@moodymudskipper
Copy link
Author

Thanks @franknarf1 , it all makes sense.

I worked a bit on my idea, taking into account the caveats you mentioned, and built it into a one function package: https://github.com/moodymudskipper/withDT

I'm not sure how robust it is and can certainly be optimised but it serves its purpose for now.

As far as I'm concerned this can be closed, but if you deem useful to integrate a similar feature I'll be happy to retire my package.

@jangorecki
Copy link
Member

It is generally good idea to put such functionality into new package, then eventually we could think about pulling it into DT. Personally I am not finding very useful to mix use of data.frame and data.table, but I might be using data.table long enough to not look back. Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants