Skip to content

Commit

Permalink
Move some setDT validation checks to C (#5427)
Browse files Browse the repository at this point in the history
* move setDT validation checks to C

* fix some tests

* restore prototype

* Rename to reflect return value

* stab at sharing message at R+C levels

* Compiles & basically runs

* Run POSIXlt test first; save LENGTH(dim) to reuse

* new test

* warning, not error

* correct comment

* setDT faster

---------

Co-authored-by: Toby Dylan Hocking <tdhock5@gmail.com>
Co-authored-by: Toby Dylan Hocking <Toby.Hocking@nau.edu>
  • Loading branch information
3 people authored Apr 24, 2024
1 parent 29a0f13 commit 2487c61
Show file tree
Hide file tree
Showing 7 changed files with 75 additions and 25 deletions.
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@
11. `split.data.table` recognizes `sep=` when splitting with `by=`, just like the default and data.frame methods [#5417](https://github.com/Rdatatable/data.table/issues/5417).
12. `setDT` is faster for data with many columns, thanks @MichaelChirico for reporting and fixing the issue, [#5426](https://github.com/Rdatatable/data.table/issues/5426).
## BUG FIXES
1. `unique()` returns a copy the case when `nrows(x) <= 1` instead of a mutable alias, [#5932](https://github.com/Rdatatable/data.table/pull/5932). This is consistent with existing `unique()` behavior when the input has no duplicates but more than one row. Thanks to @brookslogan for the report and @dshemetov for the fix.
Expand Down
35 changes: 13 additions & 22 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -2846,19 +2846,21 @@ setDT = function(x, keep.rownames=FALSE, key=NULL, check.names=FALSE) {
stopf("Cannot convert '%1$s' to data.table by reference because binding is locked. It is very likely that '%1$s' resides within a package (or an environment) that is locked to prevent modifying its variable bindings. Try copying the object to your current environment, ex: var <- copy(var) and then using setDT again.", cname)
}
}
# check no matrix-like columns, #3760. Other than a single list(matrix) is unambiguous and depended on by some revdeps, #3581
if (length(x)>1L) {
idx = vapply_1i(x, function(xi) length(dim(xi)))>1L
if (any(idx))
warningf("Some columns are a multi-column type (such as a matrix column): %s. setDT will retain these columns as-is but subsequent operations like grouping and joining may fail. Please consider as.data.table() instead which will create a new column for each embedded column.", brackify(which(idx)))
}
if (is.data.table(x)) {
# fix for #1078 and #1128, see .resetclass() for explanation.
setattr(x, 'class', .resetclass(x, 'data.table'))
if (!missing(key)) setkeyv(x, key) # fix for #1169
if (check.names) setattr(x, "names", make.names(names(x), unique=TRUE))
if (selfrefok(x) > 0L) return(invisible(x)) else setalloccol(x)
} else if (is.data.frame(x)) {
# check no matrix-like columns, #3760. Allow a single list(matrix) is unambiguous and depended on by some revdeps, #3581
# for performance, only warn on the first such column, #5426
for (jj in seq_along(x)) {
if (length(dim(x[[jj]])) > 1L) {
.Call(Cwarn_matrix_column_r, jj)
break
}
}
rn = if (!identical(keep.rownames, FALSE)) rownames(x) else NULL
setattr(x, "row.names", .set_row_names(nrow(x)))
if (check.names) setattr(x, "names", make.names(names(x), unique=TRUE))
Expand All @@ -2877,22 +2879,11 @@ setDT = function(x, keep.rownames=FALSE, key=NULL, check.names=FALSE) {
x = null.data.table()
} else if (is.list(x)) {
# copied from as.data.table.list - except removed the copy
for (i in seq_along(x)) {
if (is.null(x[[i]])) next # allow NULL columns to be created by setDT(list) even though they are not really allowed
# many operations still work in the presence of NULL columns and it might be convenient
# e.g. in package eplusr which calls setDT on a list when parsing JSON. Operations which
# fail for NULL columns will give helpful error at that point, #3480 and #3471
if (inherits(x[[i]], "POSIXlt")) stopf("Column %d is of POSIXlt type. Please convert it to POSIXct using as.POSIXct and run setDT again. We do not recommend use of POSIXlt at all because it uses 40 bytes to store one date.", i)
}
n = vapply_1i(x, length)
n_range = range(n)
if (n_range[1L] != n_range[2L]) {
tbl = sort(table(n))
stopf("All elements in argument 'x' to 'setDT' must be of same length, but the profile of input lengths (length:frequency) is: %s\nThe first entry with fewer than %d entries is %d.", brackify(sprintf('%s:%d', names(tbl), tbl)), n_range[2L], which.max(n<n_range[2L]))
}
nrow = .Call(Csetdt_nrows, x)

xn = names(x)
if (is.null(xn)) {
setattr(x, "names", paste0("V",seq_len(length(x))))
setattr(x, "names", paste0("V", seq_along(x)))
} else {
idx = xn %chin% "" # names can be NA - test 1006 caught that!
if (any(idx)) {
Expand All @@ -2901,8 +2892,8 @@ setDT = function(x, keep.rownames=FALSE, key=NULL, check.names=FALSE) {
}
if (check.names) setattr(x, "names", make.names(xn, unique=TRUE))
}
setattr(x,"row.names",.set_row_names(n_range[2L]))
setattr(x,"class",c("data.table","data.frame"))
setattr(x, "row.names", .set_row_names(nrow))
setattr(x, "class", c("data.table", "data.frame"))
setalloccol(x)
} else {
stopf("Argument 'x' to 'setDT' should be a 'list', 'data.frame' or 'data.table'")
Expand Down
12 changes: 9 additions & 3 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -8591,7 +8591,7 @@ dt = data.table(d1="1984-03-17")
ans = data.table(d1="1984-03-17", d2=as.POSIXct("1984-03-17", tz='UTC'))
test(1612.2, dt[, d2 := strptime(d1, "%Y-%m-%d", tz='UTC')], ans, warning="strptime() usage detected and wrapped with as.POSIXct()")
ll = list(a=as.POSIXlt("2015-01-01"), b=2L)
test(1612.3, setDT(ll), error="Column 1 is of POSIXlt type")
test(1612.3, setDT(ll), error="Column 1 has class 'POSIXlt'")

# tests for all.equal.data.table #1106
# diff nrow
Expand Down Expand Up @@ -13723,7 +13723,13 @@ DT = as.data.table(M)
test(1964.3, rownames(DT), character(0))
test(1964.4, colnames(DT), c("v1","v2"))

test(1965, setDT(list(1, 1:2)), error = 'profile of input lengths') #3121
test(1965.1, setDT(list(1, 1:2)), error='input 2 has length 2.*length 1') #3121
# setDT() skips NULL elements, so get the length from the first non-NULL
test(1965.2, setDT(list(NULL, 1, 1:2)), error='input 3 has length 2.*length 1')
M = matrix(nrow=2L, ncol=2L)
ans = structure(list(V1=1:2, V2=M), class=c('data.table', 'data.frame'))
alloc.col(ans)
test(1965.3, setDT(list(1:2, M)), ans, warning='Some columns are a multi-column type.*for example column 2')

# fread/fwrite file name in native and utf-8 encoding, #3078
if (.Platform$OS.type=="windows") {
Expand Down Expand Up @@ -16050,7 +16056,7 @@ test(2088, x[, round(.SD, 1)][, c:=8.88], data.table(a=c(.8, -.4, 1.2), b=c(.6,.
DF = data.frame(a=1:5)
DF$m = matrix(6:15, ncol=2L)
test(2089.1, names(setDT(DF)), c("a","m"),
warning="Some columns are a multi-column type.*[[]2[]].*setDT will retain these columns as-is but.*Please consider as.data.table")
warning="Some columns are a multi-column type.*for example column 2.*setDT will retain these columns as-is but.*Please consider as.data.table")
test(2089.2, as.data.table(DF), data.table(a=1:5, m.V1=6:10, m.V2=11:15)) # the DF here is now a data.table, so as.data.table.data.table() is matrix-column-aware
test(2089.3, print(DF), output="<multi-column>")
DF = data.frame(a=1:5)
Expand Down
42 changes: 42 additions & 0 deletions src/assign.c
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,48 @@ static SEXP shallow(SEXP dt, SEXP cols, R_len_t n)
return(newdt);
}

// Wrapped in a function so the same message is issued for the data.frame case at the R level
void warn_matrix_column(/* 1-indexed */ int i) {
warning(_("Some columns are a multi-column type (such as a matrix column), for example column %d. setDT will retain these columns as-is but subsequent operations like grouping and joining may fail. Please consider as.data.table() instead which will create a new column for each embedded column."), i);
}

// input validation for setDT() list input; assume is.list(x) was tested in R
SEXP setdt_nrows(SEXP x)
{
int base_length = 0;
bool test_matrix_cols = true;

for (R_len_t i = 0; i < LENGTH(x); ++i) {
SEXP xi = VECTOR_ELT(x, i);
/* allow NULL columns to be created by setDT(list) even though they are not really allowed
* many operations still work in the presence of NULL columns and it might be convenient
* e.g. in package eplusr which calls setDT on a list when parsing JSON. Operations which
* fail for NULL columns will give helpful error at that point, #3480 and #3471 */
if (Rf_isNull(xi)) continue;
if (Rf_inherits(xi, "POSIXlt")) {
error(_("Column %d has class 'POSIXlt'. Please convert it to POSIXct (using as.POSIXct) and run setDT() again. We do not recommend the use of POSIXlt at all because it uses 40 bytes to store one date."), i+1);
}
SEXP dim_xi = getAttrib(xi, R_DimSymbol);
R_len_t len_xi;
R_len_t n_dim = LENGTH(dim_xi);
if (n_dim) {
if (test_matrix_cols && n_dim > 1) {
warn_matrix_column(i+1);
test_matrix_cols = false;
}
len_xi = INTEGER(dim_xi)[0];
} else {
len_xi = LENGTH(xi);
}
if (!base_length) {
base_length = len_xi;
} else if (len_xi != base_length) {
error(_("All elements in argument 'x' to 'setDT' must be of equal length, but input %d has length %d whereas the first non-empty input had length %d"), i+1, len_xi, base_length);
}
}
return ScalarInteger(base_length);
}

SEXP alloccol(SEXP dt, R_len_t n, Rboolean verbose)
{
SEXP names, klass; // klass not class at request of pydatatable because class is reserved word in C++, PR #3129
Expand Down
3 changes: 3 additions & 0 deletions src/data.table.h
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,7 @@ SEXP dt_na(SEXP x, SEXP cols);
SEXP alloccol(SEXP dt, R_len_t n, Rboolean verbose);
const char *memrecycle(const SEXP target, const SEXP where, const int start, const int len, SEXP source, const int sourceStart, const int sourceLen, const int colnum, const char *colname);
SEXP shallowwrapper(SEXP dt, SEXP cols);
void warn_matrix_column(int i);

SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols,
SEXP xjiscols, SEXP grporder, SEXP order, SEXP starts,
Expand Down Expand Up @@ -275,6 +276,7 @@ SEXP notchin(SEXP x, SEXP table);
SEXP setattrib(SEXP, SEXP, SEXP);
SEXP assign(SEXP, SEXP, SEXP, SEXP, SEXP);
SEXP copy(SEXP);
SEXP setdt_nrows(SEXP);
SEXP alloccolwrapper(SEXP, SEXP, SEXP);
SEXP selfrefokwrapper(SEXP, SEXP);
SEXP truelength(SEXP);
Expand Down Expand Up @@ -317,6 +319,7 @@ SEXP glast(SEXP);
SEXP gfirst(SEXP);
SEXP gnthvalue(SEXP, SEXP);
SEXP dim(SEXP);
SEXP warn_matrix_column_r(SEXP);
SEXP gvar(SEXP, SEXP);
SEXP gsd(SEXP, SEXP);
SEXP gprod(SEXP, SEXP);
Expand Down
2 changes: 2 additions & 0 deletions src/init.c
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ R_CallMethodDef callMethods[] = {
{"Cdogroups", (DL_FUNC) &dogroups, -1},
{"Ccopy", (DL_FUNC) &copy, -1},
{"Cshallowwrapper", (DL_FUNC) &shallowwrapper, -1},
{"Csetdt_nrows", (DL_FUNC) &setdt_nrows, -1},
{"Calloccolwrapper", (DL_FUNC) &alloccolwrapper, -1},
{"Cselfrefokwrapper", (DL_FUNC) &selfrefokwrapper, -1},
{"Ctruelength", (DL_FUNC) &truelength, -1},
Expand Down Expand Up @@ -141,6 +142,7 @@ R_CallMethodDef callMethods[] = {
{"CstartsWithAny", (DL_FUNC)&startsWithAny, -1},
{"CconvertDate", (DL_FUNC)&convertDate, -1},
{"Cnotchin", (DL_FUNC)&notchin, -1},
{"Cwarn_matrix_column_r", (DL_FUNC)&warn_matrix_column_r, -1},
{NULL, NULL, 0}
};

Expand Down
4 changes: 4 additions & 0 deletions src/wrappers.c
Original file line number Diff line number Diff line change
Expand Up @@ -120,3 +120,7 @@ SEXP dim(SEXP x)
return ans;
}

SEXP warn_matrix_column_r(SEXP i) {
warn_matrix_column(INTEGER(i)[0]);
return R_NilValue;
}

0 comments on commit 2487c61

Please sign in to comment.