Inconsistency for sanity, being data.frame like for easy transition #1188

arunsrinivasan · 2015-06-21T08:55:13Z

(After a brief discussion with Matt)

The behaviour with=FALSE:

require(data.table)
DT = data.table(x=1:5, y=6:10, z=11:15)

DT[, c("y", "z"), with=FALSE]

In talking to colleagues, and at meetings or over emails, it seems that restoring the data.frame behaviour only for those cases where j is integer/character vector can only bring more sanity (trading inconsistency).

The issue is that data.table usage revolves around [ a lot, and therefore users are confronted with having to learn this difference quite early, and having to learn new syntax for a known basic operation doesn't sit well. It also doesn't seem to help in explaining how a data.table is a data.frame with this basic operation.

AFAICT, there's no real usage to having just character/integer vectors in j. Therefore, it'd be great to have with=FALSE being unnecessary and be able to subset columns the data.frame way:

DT[, c("y", "z")]
DT[, 2:3]

The default return of vector in case of only one column and use of drop=FALSE should also be restored. This'll help get over the basic data.frame like usage very quickly without having to wonder "why", and start learning the actual essential enhanced-ness data.table provides.

It'd be great to hear thoughts from other users as well.

This has come up before (raised by Matt) : http://r.789695.n4.nabble.com/with-FALSE-td4589266.html but 'leave it as it is' was the response more or less.

The text was updated successfully, but these errors were encountered:

DavidArenburg · 2015-06-21T11:17:42Z

+1MM

jrowen · 2015-06-21T13:24:27Z

I like this idea too. From the earlier responses, it seems the biggest drawback could be introducing inconsistency, as a new user would expect the two approaches below to return the same result.

DT[,c("colA","colB")] 

colvars = c("colA","colB") 
DT[,colvars]

Is there a way they both could return the same result?

arunsrinivasan · 2015-06-21T14:51:44Z

@jrowen thanks. Yes they both should return the same result (as data.frame would).
Will have to think a bit more about this though.

x <- "z"
DT[, x]

It'd be ambiguous in this case, isn't it?

One way off the top of my head is for the enhanced-ness to kick in j, only when it is wrapped with .() or list(), but perhaps that's too big a design change...

Hm, now I'm thinking if this'd only create more problems instead :-(

markdanese · 2015-06-21T15:16:19Z

Perhaps you could make the error message more friendly and help the user. Or even find the cases and add "with = FALSE" and advise the user that the change was made (like with setting column names the "old" way). I have been using data.table for a year and a half, and I periodically want to use column numbers for some quick interactive work and get an error. Not a big deal to type with = FALSE, but a nice reminder would be welcome. This would serve to teach new users as well.

franknarf1 · 2015-06-21T15:37:32Z

I don't know. It might just make it harder for people to learn. I agree with Mark that adding a discouraging warning would help with that.

If you allow too much prominence to this way of accessing columns, it may prove something of a slippery slope. Can you really do this without also doing these?

allowing numeric vectors (which are truncated to floor(j) in data.frames)
making DT[int_or_char] match the data.frame analogue (where it subsets DT like a list)

Aside: If you do this, perhaps you could add some faster accessor for j (in terms of shorter code), analogous to the list subsetting in my last bullet point. I find with=FALSE awkward and verbose and so had been doing workarounds like [.listof`(DT,int_or_char)` (broken in R 3.2.0 onward) and [.noquote(DT,int_or_char). A function like this would allow experienced users of the new functionality to sidestep the warning Mark suggested and to write clearer, more readable code (since, on reviewing their code, they wouldn't have to wonder whether they were looking at data.table- or data.frame-style j).

EDIT: I'm trying to explain what I mean over here: http://chat.stackoverflow.com/transcript/message/24012297#24012297

eantonya · 2015-06-22T03:22:18Z

I quite like the automatic with=FALSE guessing, but not the drop
reinstatement - I don't want to see that terrible option resurrected and
muddying the waters of data.table.

raubreywhite · 2015-06-22T05:09:05Z

I agree with eduard, drop= true is one of the worst parts of data.frame. I think it makes sense to implement with= false, as this improves consistency and doesn't materially degrade the quality of data table, but drop= true would just be implementing a bad idea for the sake of consistency.

Sent from my iPhone

On 22 Jun 2015, at 5:22 am, eduard notifications@github.com wrote:

I quite like the automatic with=FALSE guessing, but not the drop
reinstatement - I don't want to see that terrible option resurrected and
muddying the waters of data.table.

On Sun, Jun 21, 2015, 10:37 AM franknarf1 notifications@github.com wrote:

I don't know. It might just make it harder for people to learn. I agree
with Mark that adding a discouraging warning would help with that.

If you allow too much prominence to this way of accessing columns, it may
prove something of a slippery slope. Can you really do this without also
doing these?

setting drop=TRUE as default

allowing numeric vectors (which are truncated to floor(j) in
data.frames)

making DT[int_or_char] match the data.frame analogue (where it
subsets DT like a list)

Aside: If you do this, perhaps you could add some faster accessor for j
(in terms of shorter code), analogous to the list subsetting in my last
bullet point. I find with=FALSE awkward and verbose and so had been doing
workarounds like [.listof(DT,int_or_char)(broken in R 3.2.0 onward) and [.noquote(DT,int_or_char). A function like this would allow experienced users of the new functionality to sidestep the warning Mark suggested and to write clearer, more readable code (since, on reviewing their code, they wouldn't have to wonder whether they were looking at data.table- or data.frame-stylej).

—
Reply to this email directly or view it on GitHub
#1188 (comment)
.

—
Reply to this email directly or view it on GitHub.

franknarf1 · 2015-06-22T17:03:47Z

I think it defeats the purpose of the change if you don't use drop=TRUE for these character-or-integer cases. If using data.table syntax DT[,.(mycol)], retain drop=FALSE, sure; I don't think changing that case would help anything.

eantonya · 2015-06-22T17:23:38Z

@franknarf1 I disagree. drop argument is only relevant for single column retrievals, so not having it only affects part of the cases, and the effect it has on those cases is one of consistency, and not the strange sometimes this sometimes that behavior of data.frame.

franknarf1 · 2015-06-22T17:31:50Z

@eantonya Yeah, I guess we do disagree; sorry if I'm repeating myself, but I'll try to clarify. I'm not crazy about the sometimes-this-sometimes-that behavior of data.frame either, but the premise of this proposed enhancement is that data.frame syntax should be supported to some limited extent.

Within that limited scope (when j (1) does not use any columns of DT and (2) evaluates to character or integer... or something like that), we should give people what they expect. It's not like you or I are going to use it, so what harm? And if we don't give them what they expect, why bother giving them the concession to begin with? They'll still have grounds to complain about inconsistency. (I won't use it because I want to be able to read my code without the mental overhead of figuring out whether data.frame syntax is being used.)

arunsrinivasan · 2015-06-22T18:41:45Z

@franknarf1 perhaps I should clarify.

Ideally, what I'd like is for data.frame syntax in j to do everything that data.frame syntax does as shown below:

DT[, 1:2]
DT[, c("x", "y")]

cols = c("x", "y")
DT[, cols]

all of these should return two column data.table.

However, as @jrowen pointed out from the old post, the last case is tricky (for cases like the one I've shown in the previous post). Unless this case can be taken care of quite nicely, I personally don't see a huge advantage of implementing this feature. I can imagine myself explaining the behaviour to beginners (or in a talk) with too many ifs-and-buts.. and that's not helping.

So, what would be great is to figure out whether there's a way around the last scenario without breaking too many things. And whether it's worth it.

I don't feel strongly about drop = . being present or not. And IMO that's not the main part of this discussion, at least until it's clear that we are going to implement this functionality.

I'm also fully aware of the case DT[3:4] vs DF[3:4], but this doesn't seem to come up at all as an issue.. (on SO, or r-help or here or data.table-help) AFAICT.

franknarf1 · 2015-06-22T19:01:43Z

@arunsrinivasan Yeah, I also don't see a benefit from the feature change. As you say, it seems like it would make explaining the syntax harder and lead to messier code everywhere (as people start using data.frame syntax as a crutch).

Back to my aside (mentioned in your last sentence). Yeah, I've never seen anyone else complain about DT[1:3] vs DF[1:3], but maybe they should! Really, if we had the functionality mentioned in this thread so that DT[.SDcols=1:3] and DT[.SDcols=c("a","b")] worked as my intuition suggests they should, it would be really handy. It's off-topic here, because that change wouldn't be any sort of crutch for people who don't want to learn data.table syntax, though. Not sure if that's already a FR... Oh, just found it: #1149

eantonya · 2015-06-22T19:04:53Z

@arunsrinivasan I actually don't see a big problem with some cases not working. I see this as guessing with=FALSE, and it's ok to guess incorrectly sometimes. Maybe a warning message can be printed accompanying the guess, similar to the guesses melt/dcast make.

@franknarf1 I'm not sure what you mean - of course I'd use this feature myself - I use with=FALSE reasonably frequently, and would love to not have to type it.

The framework from which I see this change is that of enhancing data.table usage for everyone, and emphatically not one of trying to mimic what data.frame does. From that viewpoint adding drop back disintegrates usage for advanced users for what I see as a very minor short-term gain and long-term loss for beginners. Whereas the with=FALSE guess is a short- and long-term enhancement for everyone.

franknarf1 · 2015-06-22T19:07:22Z

@eantonya My mistake. I'd find the use of the feature in my code very hard to parse (by eye).

As far as the enhancement goes (excluding the mimickry), doesn't Richardo's DT[.SDcols=1:3] pull that off better (linked above, issue 1149)?

eantonya · 2015-06-22T19:14:52Z

I don't have anything against that option (and I think that should work regardless of this one going in), but would prefer typing DT[, 1:3] since it's less typing.

As far as how to guess - I would propose the following - if any of the names in j contain a column name or any of the special dot-symbols (.SD, etc), then don't guess. Otherwise attempt to evaluate the expression in outside environment - if that succeeds and returns a character/int/numeric vector - then guess with=FALSE. Otherwise go back to what we do now.

I think this takes care of the cases above and a few more I can think of right now.

Thinking some more - evaluating smth twice is fairly dangerous, so perhaps it's ok to live with the evaluation result no matter what it is (so return columns for character/int/numeric and actual result otherwise).

franknarf1 · 2015-06-22T19:34:14Z

@eantonya I'm not really familiar with parsing R calls, but it sounds like cases like this:

DT <- data.table(a1=1:2, a2=3:4, a3=5:6)
suff = 2
DT[,mean(get(paste0("a",suff)))] # 3.5

suffy = 3
DT[,plot(get(paste0("a",suff)),get(paste0("a",suffy)))] # plots a2 v a3

would no longer work, since j does not find any names...?

If some guesswork way were implemented, maybe it could be made into an option, datatable.guesswith, off by default but recommended for folks strongly tied to data.frame syntax.

eantonya · 2015-06-22T19:45:42Z

Ok, let's add get to the list that includes .SD and friends. What other cases would it not work for? Let's see if it's easy to classify the expressions.

franknarf1 · 2015-06-22T19:55:08Z

Okay, I'll see if I think of or come across any others. Nothing comes to mind beyond mget (which I can't figure out how to actually use here) and eval, like

str  = paste0("a",suff)
expr = parse(text=str)
DT[,eval(expr)]

mattdowle · 2015-08-05T02:46:53Z

Great comments above. In an attempt to draw it all together, I'm thinking we should make the following changes. If I've read correctly, I think (hope!) this will please everyone and displease nobody.

inspect j before evaluation (as is done anyway). If it's a single number or single string then with=FALSE will be assumed. These will then work:
DT[,1]
DT[,"someCol"]
These don't do anything useful now anyway, so won't break existing code. In both cases, a single column data.table will be returned, consistent with 'with=FALSE' and dropping 'drop'. The possible surprise of getting a single column data.table (unlike data.frame) is unlikely to upset, especially since the column will print nicely (top and bottom 5 rows) rather than a long vector filling up the console.
if j is a single symbol, it'll return that column as a vector, as it has always done. If however that column name is missing, raise a new error (wouldn't do anything useful now anyway so new error won't break existing code).
DT[, existingCol] # return the column as a vector as before
DT[, missingCol]
Error: j (the 2nd argument inside [...]) is a single symbol that isn't a column name. In data.table, j is evaluated within its scope. If missingCol is a variable in calling scope that contains column names or numbers, then add with=FALSE; i.e. DT[, missingCol, with=FALSE]. This difference to data.frame is deliberate and explained in FAQ 1.1 . It allows more advanced usage: see example in ?data.table.
if j contains no symbols (e.g. calls to c(), :, paste(), paste0() and only column numbers or strings), evaluate it and expect the result to be a number or character vector. Then set with=FALSE. These would then work:
DT[, c(1:10, 50)]
DT[, c("ColA","ColB")]
DT[, paste0("V",20:25)]
Otherwise, current behaviour.

We can always go further later depending on how it goes. We'll wait for everyone who's commented so far to confirm before going ahead (and only then after 1.9.6 is (finally) on CRAN!)

markdanese · 2015-08-05T03:34:17Z

Seems good to me. As always, thanks to you and Arun for doing the hard work.

DavidArenburg · 2015-08-05T10:11:01Z

Sounds great to me.

jangorecki · 2015-08-05T11:08:26Z

To reduce inconsistency it can be good to remove default value for with argument, or make it NULL/logical(0)/NA, which would corresponds to guess. Then explicitly using with=TRUE would be still able to override new proposed behaviour. So the change would be focused on guessing with argument only when it is not provided.

mattdowle · 2015-08-05T11:15:27Z

@jangorecki Yes nice idea - agree.

jrowen · 2015-08-05T12:25:36Z

I too am in favor of the revised proposal.

ronhylton · 2015-08-06T03:55:53Z

Here's a slightly different viewpoint. There are places where I'd dearly love to dispatch a DT into some old code expecting a DF and automagically pick up a big improvement in merge() speed (and conceivably for other operations involving grouping). Unfortunately with the non-DF behavior of [ that often won't work, and sometimes I end up basically as.x'ing back and forth between DT and DF in order to keep old code happy.

One clean solution to this is setcompatibility(c("on","off,"?")).

off provides "native DT" behavior for those who want to fully exploit DT capabilities.

on provides "native DF" behavior unless the operation clearly doesn't make sense for DF. E.g. I don't think you can have DF[DF,] so something like this would clearly be invoking a DT-style join.

Conceivably there could be other compatibility levels, e.g. almost-DF without drop.

Since this would be a setxxx by reference it also hopefully has very little performance cost.

jangorecki · 2015-08-06T10:00:07Z

@ronhylton your comments has much wider scope then discussed topic. It might be better to isolate it as new FR. Discussed detection of j compatibility can be managed for example by implementing with default value as getOption("datatable.with") etc.

mattdowle · 2015-08-06T23:43:42Z

@ronhylton Agree with Jan - best raise a new issue. One option is to place your old code in a package, then it would automatically divert to base syntax when passed a data.table.

ptoche · 2016-02-17T21:59:34Z

I have reached this thread 3 times over the last year or so, which is when I started using data.table. Every time it was because I had forgotten about the with = FALSE option. Every time I've read this thread, I've thought "Ah yep, true, must remember that," but somehow I don't.

data.table is a fantastic package. My 2 cents on the topic of this thread: I do not care much about compatibility/consistency with data.frame. If it's there, great (1 stone, 2 birds), but consistency shouldn't be there just for the sake of it, it should be there if the feature is desirable. And the bottom line for me is that dt[i,j] is a very, very intuitive way to access data, it's pretty much standard notation that's been around for centuries (or at least one century). Intuitive and natural, that's what I think matters.

One of the top search hits for "r data.table subset by column and row" is this page: http://personal.colby.edu/personal/m/mgimond/RIntro/04_Manipulating_data_tables.html, which states "For example, to access one dat cell value at row 23 and column 4, type dat[23, 4]" because that's what everyone expects.

JoshOBrien · 2016-10-18T15:47:54Z

This is such a nice fix to what has been a real stumbling block for data.table users. Superb! Leaving a note here as I referenced this in comments following a related SO answer that should itself eventually be edited to reflect the change.

…ngScope. Closes #1952. #1188

priyak1917 · 2017-08-30T20:50:29Z

Error in [.data.table(mba, , i) :
j (the 2nd argument inside [...]) is a single symbol but column name 'i' is not found. Perhaps you intended DT[,..i] or DT[,i,with=FALSE]. This difference to data.frame is deliberate and explained in FAQ 1.1.

so annoying error, not letting me knit the file, i'm so very poor in coding, this is daunting me more.

skanskan · 2019-01-11T16:50:22Z

Hello.

I've noticed that some of my old code doesn't work anymore because of this change.

For example this code was scaling the values contained in the columns defined by mycols.

mycols <- c("A", "B", "C")
myDT[,scale(mycols)]
or
myDT[paste0("z",mycols) := scale(mycols) ]

But now it doesn't work.
And the following line doesn't work either.
myDT[,scale(..mycols)]
adding with=FALSE doesn't solve the problem.

I need to do something like this:
myDT[,scale(.SD), .SDcols=esca ]
myDT[,lapply(.SD[,..esca], scale) ]
myDT[, paste0("z",names(.SD)) := scale(.SD) , .SDcols=mycols]

Is there a better way?

mattdowle · 2019-01-11T21:45:50Z

Hi @skanskan,
Please start a new issue and link to this issue. It's very hard to track comments in closed issues. Please include the data to create myDT and the output. I've tried to follow what you've written but I can't without the data, the input and output shown: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
Thanks, Matt

arunsrinivasan added internals enhancement labels Jun 21, 2015

mattdowle added this to the v1.9.8 milestone Jun 22, 2015

arunsrinivasan mentioned this issue Aug 11, 2015

Two dimensional column selector gives different result than data.frame's selector #1259

Closed

arunsrinivasan modified the milestones: v2.0.0, v1.9.8 Mar 8, 2016

mattdowle modified the milestones: v1.9.8, v2.0.0 Apr 22, 2016

MichaelChirico mentioned this issue Apr 22, 2016

[R-Forge #1757] Add drop to [.data.table, and "dontmove" to not put group columns first #648

Open

arunsrinivasan removed this from the v1.9.8 milestone May 13, 2016

mattdowle added this to the v1.9.8 milestone Sep 29, 2016

mattdowle closed this as completed in f78d790 Sep 30, 2016

shrektan mentioned this issue Dec 1, 2016

j not evaluated in data.table environment #1946

Closed

MichaelChirico mentioned this issue Dec 2, 2016

further data.frame-like consistency in j: logical vector #1950

Open

mattdowle added a commit that referenced this issue Dec 8, 2016

Added .. prefix for j, #633. Removed datatable.WhenJisSymbolThenCalli…

dd24120

…ngScope. Closes #1952. #1188

ToeKneeFan mentioned this issue Jun 27, 2019

How do .I and .SD work? #3668

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency for sanity, being data.frame like for easy transition #1188

Inconsistency for sanity, being data.frame like for easy transition #1188

arunsrinivasan commented Jun 21, 2015

DavidArenburg commented Jun 21, 2015

jrowen commented Jun 21, 2015

arunsrinivasan commented Jun 21, 2015

markdanese commented Jun 21, 2015

franknarf1 commented Jun 21, 2015

eantonya commented Jun 22, 2015

raubreywhite commented Jun 22, 2015

franknarf1 commented Jun 22, 2015

eantonya commented Jun 22, 2015

franknarf1 commented Jun 22, 2015

arunsrinivasan commented Jun 22, 2015

franknarf1 commented Jun 22, 2015

eantonya commented Jun 22, 2015

franknarf1 commented Jun 22, 2015

eantonya commented Jun 22, 2015

franknarf1 commented Jun 22, 2015

eantonya commented Jun 22, 2015

franknarf1 commented Jun 22, 2015

mattdowle commented Aug 5, 2015

markdanese commented Aug 5, 2015

DavidArenburg commented Aug 5, 2015

jangorecki commented Aug 5, 2015

mattdowle commented Aug 5, 2015

jrowen commented Aug 5, 2015

ronhylton commented Aug 6, 2015

jangorecki commented Aug 6, 2015

mattdowle commented Aug 6, 2015

ptoche commented Feb 17, 2016

JoshOBrien commented Oct 18, 2016

priyak1917 commented Aug 30, 2017

skanskan commented Jan 11, 2019 •

edited

Loading

mattdowle commented Jan 11, 2019

Inconsistency for sanity, being data.frame like for easy transition #1188

Inconsistency for sanity, being data.frame like for easy transition #1188

Comments

arunsrinivasan commented Jun 21, 2015

DavidArenburg commented Jun 21, 2015

jrowen commented Jun 21, 2015

arunsrinivasan commented Jun 21, 2015

markdanese commented Jun 21, 2015

franknarf1 commented Jun 21, 2015

eantonya commented Jun 22, 2015

raubreywhite commented Jun 22, 2015

franknarf1 commented Jun 22, 2015

eantonya commented Jun 22, 2015

franknarf1 commented Jun 22, 2015

arunsrinivasan commented Jun 22, 2015

franknarf1 commented Jun 22, 2015

eantonya commented Jun 22, 2015

franknarf1 commented Jun 22, 2015

eantonya commented Jun 22, 2015

franknarf1 commented Jun 22, 2015

eantonya commented Jun 22, 2015

franknarf1 commented Jun 22, 2015

mattdowle commented Aug 5, 2015

markdanese commented Aug 5, 2015

DavidArenburg commented Aug 5, 2015

jangorecki commented Aug 5, 2015

mattdowle commented Aug 5, 2015

jrowen commented Aug 5, 2015

ronhylton commented Aug 6, 2015

jangorecki commented Aug 6, 2015

mattdowle commented Aug 6, 2015

ptoche commented Feb 17, 2016

JoshOBrien commented Oct 18, 2016

priyak1917 commented Aug 30, 2017

skanskan commented Jan 11, 2019 • edited Loading

mattdowle commented Jan 11, 2019

skanskan commented Jan 11, 2019 •

edited

Loading