-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two features for data.table one liners #3681
Comments
Fyi, setDT and setDF exist for making these conversions in place. While those can be used for a long one-liner, I would put them on separate lines for clarity. While the functions are instant, dropping DT class is not costless (there might be a key, indices or other attributes stored to make calls more efficient). DT[] is idiomatic for printing (I think the FAQ explains) and DT$column for extracting columns. I use both all the time, and they are consistent with data.frame. |
ah... I didn't know that it's in the wiki indeed :
I understand your arguments about efficiency and clarity, but I think they don't invalidate the point of trying to make data.table accessible in base or tidy code for a quick instruction. I personally rarely care about efficiency but I care about not using |
IIUC the dt$ syntax is supposed to be object-oriented-like approach -- dt
is exported by data.table as a class creator, so dt$my_df would be like
new(data.table).create(my_df)
personally I find it a bit odd/jarring but maybe that's just because I've
been staring at data.table for too long.
Antoine, could you elaborate what you mean be behaving "mostly like
data.frame"? my experience with tbl is a bit limited but my instincts said
[.data.table is much more similar to [.data.frame.
Is there some data.frame behavior you are accustomed to that doesn't work
on data.table?
Thank you for the feedback & suggested features!
…On Fri, Jul 5, 2019, 9:44 PM Frank ***@***.***> wrote:
Fyi, setDT and setDF exist for making these conversions in place. While
those can be used for a long one-liner, I would put them on separate lines
for clarity. While the functions are instant, dropping DT class is not
costless (there might be a key, indices or other attributes stored to make
calls more efficient).
DT[] is idiomatic for printing (I think the FAQ explains) and DT$column
for extracting columns. I use both all the time, and they are consistent
with data.frame.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3681?email_source=notifications&email_token=AB2BA5NJMJJ6LHFSOHMILILP55F4RA5CNFSM4H6KSDUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZJR3PY#issuecomment-508763583>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB2BA5OY4VHXTBZJU77C6OTP55F4RANCNFSM4H6KSDUA>
.
|
I don't think
I wasn't aware that
Well in #3672 I learnt that it's expected that sometimes a data.table should be printed twice to be printed once, I'm sure they are smart reasons for that (I haven't gone through them yet) but as a user I'd feel safer I could make sure I keep this arguably unintuitive behavior away until I understand it, and I might do it by ensuring my workspace is data.table free. Another example would be :
I just read today in the vignettes that there's an option that can be triggered to make the above work, but it's not a default for backward compatibility, and that's perfectly fine by me, but still can be confusing. Yet another example :
If my workspace is a bit messy and I'm not super comfortable with data.table, or R in general, I get 2 different classes of objects that look very alike behaving differently. converting to tibble doesn't affect the cases above and in general when using tibbles one doesn't have to care too much if an object is a tibble or a data.frame (in my practice I often don't know, the use of a tidy function at some point will change it to tibble, which I prefer, and that's it). This was to answer your question, but let's be clear that I'm not implying flaws in data.table here nor interested in claiming which approach is best, just trying to see if we can cross the bridge for some users by providing features that make the timid ones feel safe about using this package. I for one when doing fast data wrangling in front of my team would be happy to be able to use data.table syntax just for a quick operation without feeling like I'm commiting to a new paradigm or have to think about converting my tables 2 times, hence this suggestion :). Maybe the data.table to data.frame conversion feature could be a parameter
|
In fact that's an FAQ: Basically we're facing a compromise -- how to prevent the table from printing every time a column is added with For
We're always striving to improve our error messages, so it would be helpful if you could comment on how this could be improved (it's always harder to write error messages as experienced users) Re:
Personal preference but I find this much cleaner (and if you dig up some of my oldest StackOverflow answers, you'll see I used to be a strict adherent to the everything-in-one-line school).
Hopefully we don't come across as defensive! In fact we're very interested in hearing the perspectives of users from new or different backgrounds & take seriously your input, my probing is trying to get at what could be improved in the documentation/messaging (errors, warnings, etc). |
Michael you don't come across as defensive at all, quite the opposite in fact and thanks for that. I wanted to put emphasis on the fact that I'm focused on building a bridge between base or tidy technologies and data tables, and not on a criticism of data.table itself. These error messages are very good and very helpful I think, I don't really have any idea how to improve them. Nevertheless, I remember that younger me was distressed by having several objects acting differently despite similar syntax, as a user that didn't know much about classes, methods, operator overloading, assignment by reference... I now understand more or less the quirk about why to print twice and the existing workarounds, but I think we can all accept that any user encountering it for the first time will find it surprising. My idea of these features was to create a quirk free environment, protecting its user from data.table's less intuitive features, and proposing a quick benefit (compact syntax and a part of the performance improvement). I'm not sure if my proposed implementation is so smart or elegant, but I think the RTFM first approach might turn off some users, as the idea that one paradigm HAS to be chosen over another, so I was thinking : "How could I do really quickly an operation with data.table and not commit further to the package ?". About the extra arguments, I get that we want to limit them not to be confusing, but if the only reason is that too many arguments make the documentation difficult, shouldn't we find a way to document features on several pages, with feature centred doc rather than argument centred or example centred doc, and let the arguments grow and multiply ? Is there another reason apart from doc, and maybe code complexity, that we can't have twice more arguments (being provocative on purpose)? |
@MichaelChirico I think this new idea works better : https://twitter.com/antoine_fabri/status/1153444109727739904 dt_brackets <- list(
`[.data.frame` = data.table:::`[.data.table`,
`[<-.data.frame` = data.table:::`[<-.data.table`,
`[.tbl_df` = data.table:::`[.data.table`,
`[<-.tbl_df` = data.table:::`[<-.data.table` )
iris2 <- with(dt_brackets, iris[, .(meanSW = mean(Sepal.Width)), by = Species])
iris2
#> Species meanSW
#> 1 setosa 3.428
#> 2 versicolor 2.770
#> 3 virginica 2.974
class(iris2)
#> [1] "data.frame"
iris3 <- with(dt_brackets, tibble::as_tibble(iris)[, .(meanSW = mean(Sepal.Width)), by = Species])
iris3
#> # A tibble: 3 x 2
#> Species meanSW
#> <fct> <dbl>
#> 1 setosa 3.43
#> 2 versicolor 2.77
#> 3 virginica 2.97
class(iris3)
#> [1] "tbl_df" "tbl" "data.frame"
with_DT <- function(expr){
dt_brackets <- list(
`[.data.frame` = data.table:::`[.data.table`,
`[<-.data.frame` = data.table:::`[<-.data.table`,
`[.tbl_df` = data.table:::`[.data.table`,
`[<-.tbl_df` = data.table:::`[<-.data.table` )
eval(substitute(with(dt_brackets, expr)))
}
iris2 <- with_DT(iris[, .(meanSW = mean(Sepal.Width)), by = Species])
class(iris2)
#> [1] "data.frame" Created on 2019-07-23 by the reprex package (v0.3.0) |
@moodymudskipper There are data.table object innards missing that mean it doesn't work in some common cases, I guess, eg And some operations will leave you with a franken-frame (not a data.frame, not a data.table), like
... creates an internal self ref attribute on the data.frame. Not sure whether that means it meets your expectations/goals here or not. Making [.data.table portable to non-data.tables, if that's the goal, seems like a big departure and a documentation headache (eg, need to figure out which parts of DT NSE will throw errors like Outside of [.data.table and set*, many features are portable, anyways: fread/fwrite, rleid/rowid, IDateTime. |
Can you explain what you mean by franken-frame ? In that case it seems to me iris2 is a regular data frame. The fact that it's changed by reference might or might not be expected, but both would make sense and would be easy to document I think. The following is less efficient (maybe can be improved ?) and would need more work to pass all arguments and deal with missing values instead of just forwarding with_DT <- function(expr){
dt_brackets <- list(
`[.data.frame` = function(x, ...) {
if(data.table::is.data.table(x)) {
x[...]
} else {
class_ <- class(x)
x <- data.table::as.data.table(x)
x <- x[...]
class(x) <- class_
x
}
})
eval(substitute(with(dt_brackets, expr)))
}
iris2 <- with_DT(iris[, .(meanSW = mean(Sepal.Width)), by = Species])
class(iris2)
#> [1] "data.frame"
class(iris)
#> [1] "data.frame"
iris2 <- with_DT(iris2[, c(letters[1:3], LETTERS[1:3]) := as.list(c(letters[1:3], LETTERS[1:3]))][])
class(iris2)
#> [1] "data.frame"
with_DT(iris2[order(Species)])
#> Species meanSW a b c A B C
#> 1 setosa 3.428 a b c A B C
#> 2 versicolor 2.770 a b c A B C
#> 3 virginica 2.974 a b c A B C Created on 2019-07-23 by the reprex package (v0.3.0) About the benefit, it has few if your workflow includes a lot of data.table code anyway, but :
|
@moodymudskipper Thanks for explaining. Re the benefit "the fact that it never modifies by reference can be seen as a feature", that was what I was getting at with "franken frame" -- It has unexpected attributes (the selfref). I suspect similar issues might come up with other DF-derived classes like grouped_df -- will the grouping still be correct after a run through with_DT? Besides attributes, to ensure that columns are not modified by reference, you'll need to "lock" the data somehow, maybe related to #2277 (comment) #778 Anyway, I hope my impression (that the task is nontrivial) is clear enough, but I don't want to crowd the thread trying to explain it in further replies since it's really not a deep point and I think others understand what you're trying to do better than I do. |
Thanks @franknarf1 , it all makes sense. I worked a bit on my idea, taking into account the caveats you mentioned, and built it into a one function package: https://github.com/moodymudskipper/withDT I'm not sure how robust it is and can certainly be optimised but it serves its purpose for now. As far as I'm concerned this can be closed, but if you deem useful to integrate a similar feature I'll be happy to retire my package. |
It is generally good idea to put such functionality into new package, then eventually we could think about pulling it into DT. Personally I am not finding very useful to mix use of data.frame and data.table, but I might be using data.table long enough to not look back. Closing for now. |
In the context of the recent twitter conversation on data.table's relevance, I've been thinking that one of the intimidating aspects of data.table is the feeling that we have to opt in 100% in the data.table paradigm to make good use of it. It's something that tibbles mostly avoid by having
[.tbl_df
behave much like[.data.frame
.I believe it would be to data.table's benefit to promote single line calls that won't force the user to commit much of their code to the data.table paradigm and won't entail explicit verbose conversions, I propose two features.
1) convert to data.frame easily from data.table using
DT[]
DT[]
is the same asDT
now, and is unlikely to be ever used in practice so this shouldn't break anything, I propose :** 2) convert to data.table easily from data.frame using
dt$mydf
instead ofas.data.table(mydf)
This follows the approach that Gabor Grothendieck used to design
gsubfn::fn
, and that I followed to an extent with my own work on my package tag.Dollar notation is a way around bracket overload and makes explicit that we have a single argument, two things that add up to cognitive load especially in complex calls. It can confuse the user who is used to employ
$
to select list elements only but I believe its use by packages R6, crayon and gsubfn show that it can be intuitive nonetheless.Its implementation would be as simple as :
using both this feature and the one proposed in 1 we can introduce easily a single data.table operation in our code :
We thus can benefit from data.table's compact syntax and efficient code, for the price of two discrete conversions, in a neat
dt<...>[]
sandwich which will protect the user from worries about having a confusing mix of data.table and data.frame/tibbles in their workspace, or unexpected modifications by reference that can happen when feeding a data.table to a function. It could potentially constitutes a foot in the door for more standard data.table usage.The text was updated successfully, but these errors were encountered: