Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Naming conflict and unexpected behavior with the functional form of data.table DT() #5129

Open
Kamgang-B opened this issue Sep 2, 2021 · 6 comments

Comments

@Kamgang-B
Copy link
Contributor

Issue 1: When working with a data.frame, assignment to a new variable does affect the original dataset while assignment to an existing variable (modification/update) does.

A = setDF(list(mpg = c(21, 21, 22.8, 21.4, 18.7), 
               cyl = c(6, 6, 4, 6, 8), 
               disp = c(160, 160, 108, 258, 360)))

# create a count column (output not shown)
A |> DT(, count := .N, by=cyl)   

print(A)  # does not contain the count column
#    mpg cyl disp
# 1 21.0   6  160
# 2 21.0   6  160
# 3 22.8   4  108
# 4 21.4   6  258
# 5 18.7   8  360

# modify an existing column
A |> DT(, disp := disp %% 100)

print(A) # disp has been modified
#    mpg cyl disp
# 1 21.0   6   60
# 2 21.0   6   60
# 3 22.8   4    8
# 4 21.4   6   58
# 5 18.7   8   60

I think that the expectation when calling a data.table query on a data.frame is that it should behave in a similar way; that is,
assignments and modifications should affect the original data.frame. This is partially useful to avoid to reassign the data back every time we use DT on a data.frame. This will also make it consistent with what would happen when using a data.table and not a data.frame.

Issue 2: Naming a data.frame or data.table D leads to errors when used with DT function:

D <- copy(A)

D[1:3,] |> DT(D[4:5,], on="cyl")             # error
Error: object of type 'closure' is not subsettable

A[1:3,] |> DT(A[4:5,], on="cyl")             # works

D |> DT(, names(D) := lapply(.SD, sort))      # error
Error: LHS of := isn't column names ('character') or positions ('integer' or 'numeric')

A |> DT(, names(A) := lapply(.SD, sort))      # works

Info session

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.1

loaded via a namespace (and not attached):
 [1] zoo_1.8-9         compiler_4.1.0    htmltools_0.5.1.1 tools_4.1.0       xts_0.12.1        yaml_2.2.1       
 [7] rmarkdown_2.10    grid_4.1.0        knitr_1.33        xfun_0.23         digest_0.6.27     rlang_0.4.11     
[13] lattice_0.20-44   evaluate_0.14    
@mattdowle mattdowle added this to the 1.14.1 milestone Sep 2, 2021
@mattdowle
Copy link
Member

mattdowle commented Sep 2, 2021

Great find. I can reproduce the first error :

> D[1:3,] |> DT(D[4:5,], on="cyl")
Error: object of type 'closure' is not subsettable
> debugger()
Message:  Error: object of type 'closure' is not subsettable
Available environments had calls:
1: DT(D[1:3, ], D[4:5, ], on = "cyl")
2: `[.data.table`(x, ...)
3: tryCatch(eval(.massagei(isub), x, ienv), error = function(e) {
    if (grepl(":=.*defined for use in j.*only", e$message)) 
        stopf("Operator := detected in i, the first argument inside DT[..
4: tryCatchList(expr, classes, parentenv, handlers)
5: tryCatchOne(expr, names, parentenv, handlers[[1]])
6: value[[3]](cond)
7: .checkTypos(e, names_x)
8: stopf(err$message)
9: stop(gettextf(fmt, ..., domain = domain), domain = NA, call. = FALSE)

Seems like that line is finding stats::D function rather than the D object in calling scope. Hopefully that's just an enclos= to fix up. Or maybe the .massagei doesn't need to happen when i is data.frame.

I think that the expectation when calling a data.table query on a data.frame is that it should behave in a similar way; that is,
assignments and modifications should affect the original data.frame.

I'm glad you brought this up. I was thinking about that too. I'm not sure about expectations, but I fear the fear. Do we feel the fear and do it anyway, or do we be more cautious and fit into dplyr and R-style chains which copy-on-write?

Consider the following. It is currently possible in data.table-dev to modify mtcars by reference using DT().

# fresh R session
> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
> find("mtcars")
[1] "package:datasets"
> require(data.table)
Loading required package: data.table
data.table 1.14.1 IN DEVELOPMENT built 2021-09-02 03:56:34 UTC; mdowle using 6 threads (see ?getDTthreads).  Latest news: r-datatable.com
> mtcars |> DT(2,cyl:=NA)
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0  NA 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
[ snip data.frame print method output ]
> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0  NA  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
> find("mtcars")
[1] "package:datasets"
> head(datasets::mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0  NA  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

So it did actually change mtcars by reference in dataset's namespace. I'm thinking that leaving that door open is asking for trouble. In complicated dependency chains, and packages in companies that accept all kinds of input from different users at different times (i.e. hard to monitor and track), we can't have data.frame being changed by reference in other places, inadvertently. The idea was that := only works on data.table. That you have to opt-in to data.table to get modify-by-reference. You have to use a different operator (i.e. :=), and := has to be used on a data.table. In other words, there were deliberate hurdles/protections in place to achieve modify-by-reference. If we allow DT(,:=) to modify a data.frame by reference, then a package can be created which uses DT(,:=) to do so. Now, if that package accepts data input from the user, and a non-data.table-aware user passes a data.frame to that package, their data.frame might be modified by reference by that package. That crosses a line where the user didn't opt-in to that. We can't just trust all packages using DT(,:=) on a data.frame to copy() the data.frame first ... it needs to be prevented for safety. WDYT?

@grantmcdermott
Copy link
Contributor

My 2 cents:

it needs to be prevented for safety

Agree. I suppose the question is whether to disable := completely when passed a DF, or simply fall back to regular copy-on-modify assignment (maybe invoking copy, maybe passing to within).

My strong preference would be for the latter, possibly with a one-time warning.

@MichaelChirico
Copy link
Member

Agree with Grant here -- backing up to do what was asked, but safely, seems like the user-friendliest choice. I would hope this is most likely to happen in use cases where a copy is not too expensive

@Kamgang-B
Copy link
Contributor Author

Kamgang-B commented Sep 23, 2021

1-
From @mattdowle's first comment

Or maybe the .massagei doesn't need to happen when i is data.frame.

The issue does not happen only with a data.frame but also with a data.table.

2-
I thought that this issue (name conflict) was only related to the special case where the dataset is named D but I've just realized that it is much more general: When a dataset has the same name as a built-in function (at least those that I randomly chose), then this generates an error if the i argument starts with the dataset name (like in DT$... or DT[...]... etc.). The examples below explain what I mean in a better way.

# all dataset names below conflict with built-in functions
filter = choose = beta = mtcars
dt = df = D = as.data.table(filter)

df |> DT(df[, .I[which.max(mpg)], by=cyl]$V1)           # error 
dt |> DT(dt[, .I[which.max(mpg)], by=cyl]$V1)           # error
D  |> DT(D[, .I[which.max(mpg)], by=cyl]$V1)            # error
choose |> DT(choose[, .I[which.max(mpg)], by=cyl]$V1)   # error
filter |> DT(filter[, .I[which.max(mpg)], by=cyl]$V1)   # error 
beta   |> DT(beta[, .I[which.max(mpg)], by=cyl]$V1)     # error

I think that the issue is caused by the line .global$print = "" in the current implementation of DT function. When this line is removed, then everything works fine.

DT2 = function (x, ...) {
  old = getOption("datatable.optimize")
  if (!is.data.table(x) && old > 2L) {
	options(datatable.optimize = 2L)
  }
  ans = data.table:::`[.data.table`(x, ...)
  options(datatable.optimize = old)
  # data.table:::.global$print = ""                     # REMOVING .global$print = ""
  ans
}

df |> DT2(df[, .I[which.max(mpg)], by=cyl]$V1)          # works
dt |> DT2(dt[, .I[which.max(mpg)], by=cyl]$V1)          # works
D  |> DT2(D[, .I[which.max(mpg)], by=cyl]$V1)           # works
choose |> DT2(choose[, .I[which.max(mpg)], by=cyl]$V1)  # works
filter |> DT2(filter[, .I[which.max(mpg)], by=cyl]$V1)  # works
beta   |> DT2(beta[, .I[which.max(mpg)], by=cyl]$V1)    # works

Side effect of removing .global$print = "" : assignment at the end of a chain (when piping) does not print anything (behave like DT[, col:=value]) unless |> DT() is added. example: as.data.table(mtcars) |> DT(, halfmpg := mpg/2) does print anything while as.data.table(mtcars) |> DT(, halfmpg := mpg/2) |> DT() does print the data. This is discussed in #5106.

3-
About modification in place vs copy-on-modify
I understand the risk and agree that := should not modify data.frame by reference (and so agree that copy-on-modify is better). But I wonder if DT function could not gain a new argument inplace that defaults to FALSE. This would be useful in the special case where copying is expensive.

@mattdowle
Copy link
Member

mattdowle commented Sep 23, 2021

Thanks a lot for this investigation. Saves me a lot of time investigating that naming conflict. I'll take a look.

Just to reply on inplace in point 3 and the comments about copying being expensive. I've been working on making copy() be a shallow copy. Then it won't be expensive. There was talk of exporting shallow() but I always wanted to make copy() be shallow and just have copy(). That way, existing usage of copy() becomes faster too. After a copy(), the first := on a column would need to copy that column. It's a bit tricky for multiple reasons but that's the direction I'm working on at the moment. Let's reconsider inplace after copy() is shallow. I can see that inplace might still be useful after that so long as it didn't open the door to allowing an inplace update on a data.frame that was referenced by other objects.

@mattdowle
Copy link
Member

mattdowle commented Sep 24, 2021

@Kamgang-B PR #5176 now merged should solve issue 2 here. Thanks again for your investigation and tests. Still working on copy() and that will take longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants