Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing bug in 'on' with complex variable names #3094

Merged
merged 14 commits into from
Oct 7, 2018
Merged
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@

1. Providing an `i` subset expression when attempting to delete a column correctly failed with helpful error, but when the column was missing too created a new column full of `NULL` values, [#3089](https://github.com/Rdatatable/data.table/issues/3089). Thanks to Michael Chirico for reporting.

2. Column names that look like expressions (e.g. `"a<=colB"`) caused an error when used in `on=` even when wrapped with backticks, [#3092](https://github.com/Rdatatable/data.table/issues/3092). Additionally, `on=` now supports white spaces around operators; e.g. `on = "colA == colB"`. Thanks to @mt1022 for reporting and to @MarkusBonsch for fixing.

#### NOTES

1. When data.table first loads it now checks the DLL's MD5. This is to detect installation issues on Windows when you upgrade and i) the DLL is in use by another R session and ii) the CRAN source version > CRAN binary binary which happens just after a new release (R prompts users to install from source until the CRAN binary is available). This situation can lead to a state where the package's new R code calls old C code in the old DLL; [R#17478](https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17478), [#3056](https://github.com/Rdatatable/data.table/issues/3056). This broken state can persist until, hopefully, you experience a strange error caused by the mismatch. Otherwise, wrong results may occur silently. This situation applies to any R package with compiled code not just data.table, is Windows-only, and is long-standing. It has only recently been understood as it typically only occurs during the few days after each new release until binaries are available on CRAN. Thanks to Gabor Csardi for the suggestion to use `tools::checkMD5sums()`.
Expand Down
135 changes: 101 additions & 34 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -488,40 +488,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
}
if (!missing(on)) {
# on = .() is now possible, #1257
parse_on <- function(onsub) {
ops = c("==", "<=", "<", ">=", ">", "!=")
pat = paste0("(", ops, ")", collapse="|")
if (is.call(onsub) && onsub[[1L]] == "eval") {
onsub = eval(onsub[[2L]], parent.frame(2L), parent.frame(2L))
if (is.call(onsub) && onsub[[1L]] == "eval") onsub = onsub[[2L]]
}
if (is.call(onsub) && as.character(onsub[[1L]]) %in% c("list", ".")) {
spat = paste0("[ ]+(", pat, ")[ ]+")
onsub = lapply(as.list(onsub)[-1L], function(x) gsub(spat, "\\1", deparse(x, width.cutoff=500L)))
onsub = as.call(c(quote(c), onsub))
}
on = eval(onsub, parent.frame(2L), parent.frame(2L))
if (!is.character(on))
stop("'on' argument should be a named atomic vector of column names indicating which columns in 'i' should be joined with which columns in 'x'.")
this_op = regmatches(on, gregexpr(pat, on))
idx = (vapply(this_op, length, 0L) == 0L)
this_op[idx] = "=="
this_op = unlist(this_op, use.names=FALSE)
idx_op = match(this_op, ops, nomatch=0L)
if (any(idx_op %in% c(0L, 6L)))
stop("Invalid operators ", paste(this_op[idx_op==0L], collapse=","), ". Only allowed operators are ", paste(ops[1:5], collapse=""), ".")
if (is.null(names(on))) {
on[idx] = if (isnull_inames) paste(on[idx], paste0("V", seq_len(sum(idx))), sep="==") else paste(on[idx], on[idx], sep="==")
} else {
on[idx] = paste(names(on)[idx], on[idx], sep="==")
}
split = tstrsplit(on, paste0("[ ]*", pat, "[ ]*"))
on = setattr(split[[2L]], 'names', split[[1L]])
if (length(empty_idx <- which(names(on) == "")))
names(on)[empty_idx] = on[empty_idx]
list(on = on, ops = idx_op)
}
on_ops = parse_on(substitute(on))
on_ops = .parse_on(substitute(on), isnull_inames)
on = on_ops[[1L]]
ops = on_ops[[2L]]
# TODO: collect all '==' ops first to speeden up Cnestedid
Expand Down Expand Up @@ -3056,3 +3023,103 @@ isReallyReal <- function(x) {
)
)
}


.parse_on <- function(onsub, isnull_inames) {
## helper that takes the 'on' string(s) and extracts comparison operators and column names from it.
#' @param onsub the substituted on
#' @param isnull_inames bool; TRUE if i has no names.
#' @return List with two entries:
#' 'on' : character vector providing the column names for the join.
#' Names correspond to columns in x, entries correspond to columns in i
#' 'ops': integer vector. Gives the indices of the operators that connect the columns in x and i.
ops = c("==", "<=", "<", ">=", ">", "!=")
pat = paste0("(", ops, ")", collapse="|")
if (is.call(onsub) && onsub[[1L]] == "eval") {
onsub = eval(onsub[[2L]], parent.frame(2L), parent.frame(2L))
if (is.call(onsub) && onsub[[1L]] == "eval") { onsub = onsub[[2L]] }
}
if (is.call(onsub) && as.character(onsub[[1L]]) %in% c("list", ".")) {
spat = paste0("[ ]+(", pat, ")[ ]+")
onsub = lapply(as.list(onsub)[-1L], function(x) gsub(spat, "\\1", deparse(x, width.cutoff=500L)))
onsub = as.call(c(quote(c), onsub))
}
on = eval(onsub, parent.frame(2L), parent.frame(2L))
if (!is.character(on))
stop("'on' argument should be a named atomic vector of column names indicating which columns in 'i' should be joined with which columns in 'x'.")
## extract the operators and potential variable names from 'on'.
## split at backticks to take care about variable names like `col1<=`.
pieces <- strsplit(on, "(?=[`])", perl = TRUE)
xCols <- character(length(on))
## if 'on' is named, the names are the xCols for sure
if(!is.null(names(on))){
xCols <- names(on)
}
iCols <- character(length(on))
operators <- character(length(on))
## loop over the elements and extract operators and column names.
for(i in seq_along(pieces)){
thisCols <- character(0)
thisOperators <- character(0)
j <- 1
while(j <= length(pieces[[i]])){
if(pieces[[i]][j] == "`"){
## start of a variable name with backtick.
thisCols <- c(thisCols, pieces[[i]][j+1])
j <- j+3 # +1 is the column name, +2 is delimiting "`", +3 is next relevant entry.`
} else {
## no backtick
## search for operators
thisOperators <- c(thisOperators,
unlist(regmatches(pieces[[i]][j], gregexpr(pat, pieces[[i]][j])),
use.names = FALSE))
## search for column names
thisCols <- c(thisCols, trimws(strsplit(pieces[[i]][j], pat)[[1]]))
## there can be empty string column names because of trimws, remove them
thisCols <- thisCols[thisCols != ""]
j <- j+1
}
}
if (length(thisOperators) == 0) {
## if no operator is given, it must be ==
operators[i] <- "=="
} else if (length(thisOperators) == 1) {
operators[i] <- thisOperators
} else {
## multiple operators found in one 'on' part. Something is wrong.
stop("Found more than one operator in one 'on' statement: ", on[i], ". Please specify a single operator.")
}
if (length(thisCols) == 2){
## two column names found, first is xCol, second is iCol for sure
xCols[i] <- thisCols[1]
iCols[i] <- thisCols[2]
} else if (length(thisCols) == 1){
## a single column name found. Can mean different things
if(xCols[i] != ""){
## xCol is given by names(on). thisCols must be iCol
iCols[i] <- thisCols[1]
} else if (isnull_inames){
## i has no names. It will be given the names V1, V2, ... automatically.
## The single column name is the x column. It will match to the ith column in i.
xCols[i] <- thisCols[1]
iCols[i] <- paste0("V", i)
} else {
## i has names and one single column name is given by on.
## This means that xCol and iCol have the same name.
xCols[i] <- thisCols[1]
iCols[i] <- thisCols[1]
}
} else if (length(thisCols) == 0){
stop("'on' contains no column name: ", on[i], ". Each 'on' clause must contain one or two column names.")
} else {
stop("'on' contains more than 2 column names: ", on[i], ". Each 'on' clause must contain one or two column names.")
}
}
idx_op = match(operators, ops, nomatch=0L)
if (any(idx_op %in% c(0L, 6L)))
stop("Invalid operators ", paste(operators[idx_op %in% c(0L, 6L)], collapse=","), ". Only allowed operators are ", paste(ops[1:5], collapse=""), ".")
## the final on will contain the xCol as name, the iCol as value
on <- iCols
names(on) <- xCols
return(list(on = on, ops = idx_op))
}
47 changes: 37 additions & 10 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -12278,28 +12278,55 @@ DT = data.table(A=1:5)
test(1947.1, DT[A<0, c('A','B'):=.(NULL, A)], error="When deleting columns, i should not be provided")
test(1947.2, DT, data.table(A=1:5))

## tests for backticks and spaces in column names of on=, #2931
DT <- data.table(id = 1:3, `counts(a>=0)` = 1:3, sameName = 1:3)
i <- data.table(idi = 1:3, ` weirdName>=` = 1:3, sameName = 1:3)
## test white spaces around operator
test(1948.1, DT[i, on = "id >= idi"], DT[i, on = "id>=idi"])
test(1948.2, DT[i, on = "id>= idi"], DT[i, on = "id>=idi"])
test(1948.3, DT[i, on = "id >=idi"], DT[i, on = "id>=idi"])
## test column names containing operators
test(1948.4, setnames(DT[i, on = "id>=` weirdName>=`"], c("id","counts(a>=0)", "sameName", " weirdName>=", "i.sameName")),
DT[i, on = "id>=idi"])
test(1948.5, setnames(DT[i, on = "id>=` weirdName>=`"], c("id","counts(a>=0)", "sameName", " weirdName>=", "i.sameName")),
DT[i, on = "id>=idi"])
test(1948.6, setnames(DT[i, on = "id >= ` weirdName>=`"], c("id","counts(a>=0)", "sameName", " weirdName>=", "i.sameName")),
DT[i, on = "id>=idi"])
test(1948.7, setnames(DT[i, on = "`counts(a>=0)`==` weirdName>=`"], c("id","counts(a>=0)", "sameName", " weirdName>=", "i.sameName")),
DT[i, on = "id==idi"])
## mixed example
test(1948.8, DT[i, on = c( id = "idi", "sameName", "`counts(a>=0)`==` weirdName>=`")], DT[i, on = "id==idi", c("id", "counts(a>=0)", "sameName")])
## testing 'eval' in on clause
test(1948.9, DT[i, on = eval(eval("id<=idi"))], DT[i, on = "id<=idi"])
## testing for errors
test(1948.11, DT[i, on = ""], error = "'on' contains no column name: . Each 'on' clause must contain one or two column names.")
test(1948.12, DT[i, on = "id>=idi>=1"], error = "Found more than one operator in one 'on' statement: id>=idi>=1. Please specify a single operator.")
test(1948.13, DT[i, on = "`id``idi`<=id"], error = "'on' contains more than 2 column names: `id``idi`<=id. Each 'on' clause must contain one or two column names.")
test(1948.14, DT[i, on = "id != idi"], error = "Invalid operators !=. Only allowed operators are ==<=<>=>.")
test(1948.15, DT[i, on = 1L], error = "'on' argument should be a named atomic vector of column names indicating which columns in 'i' should be joined with which columns in 'x'.")

# helpful error when on= is provided but not i, rather than silently ignoring on=
test(1948.1, DT[,,on=A], error="When i and j are both missing, no other argument should be used.")
test(1948.2, DT[,1,on=A], error="i must be provided when on= is provided")
test(1948.3, DT[1,,with=FALSE], error="j must be provided when with=FALSE")
test(1949.1, DT[,,on=A], error="When i and j are both missing, no other argument should be used.")
test(1949.2, DT[,1,on=A], error="i must be provided when on= is provided")
test(1949.3, DT[1,,with=FALSE], error="j must be provided when with=FALSE")

if (test_bit64) {
# explicit coverage of 2-column real case in uniqlist. Keeps coming up in codecov checks in PRs that don't touch uniqlist.c
DT = data.table(id=c("A","A","B","B","B"), v=as.integer64(c(1,2,3,3,4)))
test(1949, uniqlist(DT), INT(1,2,3,5))
test(1950, uniqlist(DT), INT(1,2,3,5))
}

# allow nomatch=NULL to work same as nomatch=0L, #857
d1 = data.table(a=1:3, b=2:4)
d2 = data.table(a=2:4, b=3:5)
test(1950.1, d1[d2, on="a", nomatch=NULL], d1[d2, on="a", nomatch=0L])
test(1950.2, d1[d2, on="b", nomatch=NULL], d1[d2, on="b", nomatch=0L])
test(1950.3, d1[d2, on=c("a","b"), nomatch=NULL], d1[d2, on=c("a","b"), nomatch=0L])
test(1950.4, d1[d2, nomatch=3], error="nomatch= must be either NA or NULL .or 0 for backwards compatibility")
test(1951.1, d1[d2, on="a", nomatch=NULL], d1[d2, on="a", nomatch=0L])
test(1951.2, d1[d2, on="b", nomatch=NULL], d1[d2, on="b", nomatch=0L])
test(1951.3, d1[d2, on=c("a","b"), nomatch=NULL], d1[d2, on=c("a","b"), nomatch=0L])
test(1951.4, d1[d2, nomatch=3], error="nomatch= must be either NA or NULL .or 0 for backwards compatibility")

# coverage of which= checks
test(1951.1, d1[a==2, which=3], error="which= must be a logical vector length 1. Either FALSE, TRUE or NA.")
test(1951.2, d1[a==2, 2, which=TRUE], error="which==TRUE.*but j is also supplied")
test(1952.1, d1[a==2, which=3], error="which= must be a logical vector length 1. Either FALSE, TRUE or NA.")
test(1952.2, d1[a==2, 2, which=TRUE], error="which==TRUE.*but j is also supplied")


###################################
Expand Down