Skip to content

Commit 04eeb30

Browse files
authored
as.data.table.list centralized (#3471)
1 parent be1fae5 commit 04eeb30

File tree

8 files changed

+213
-240
lines changed

8 files changed

+213
-240
lines changed

NEWS.md

+20-2
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,8 @@
9292

9393
15. `DT[order(col)[1:5], ...]` (i.e. where `i` is a compound expression involving `order()`) is now optimized to use `data.table`'s multithreaded `forder`, [#1921](https://github.com/Rdatatable/data.table/issues/1921). This example is not a fully optimal top-N query since the full ordering is still computed. The improvement is that the call to `order()` is computed faster for any `i` expression using `order`.
9494
95+
16. `as.data.table` now unpacks columns in a `data.frame` which are themselves a `data.frame`. This need arises when parsing JSON, a corollary in [#3369](https://github.com/Rdatatable/data.table/issues/3369#issuecomment-462662752). `data.table` does not allow columns to be objects which themselves have columns (such as `matrix` and `data.frame`), unlike `data.frame` which does. Bug fix 19 in v1.12.2 (see below) added a helpful error (rather than segfault) to detect such invalid `data.table`, and promised that `as.data.table()` would unpack these columns in the next release (i.e. this release) so that the invalid `data.table` is not created in the first place.
96+
9597
#### BUG FIXES
9698
9799
1. `first`, `last`, `head` and `tail` by group no longer error in some cases, [#2030](https://github.com/Rdatatable/data.table/issues/2030) [#3462](https://github.com/Rdatatable/data.table/issues/3462). Thanks to @franknarf1 for reporting.
@@ -134,7 +136,23 @@
134136
135137
20. `c`, `seq` and `mean` of `ITime` objects now retain the `ITime` class via new `ITime` methods, [#3628](https://github.com/Rdatatable/data.table/issues/3628). Thanks @UweBlock for reporting. The `cut` and `split` methods for `ITime` have been removed since the default methods work, [#3630](https://github.com/Rdatatable/data.table/pull/3630).
136138
137-
20. `as.data.table.array` now handles the case when some of the array's dimension names are `NULL`, [#3636](https://github.com/Rdatatable/data.table/issues/3636).
139+
21. `as.data.table.array` now handles the case when some of the array's dimension names are `NULL`, [#3636](https://github.com/Rdatatable/data.table/issues/3636).
140+
141+
22. Adding a `list` column using `cbind`, `as.data.table`, or `data.table` now works rather than treating the `list` as if it were a set of columns, [#3471](https://github.com/Rdatatable/data.table/pull/3471). However, please note that using `:=` to add columns is preferred.
142+
143+
```R
144+
# cbind( data.table(1:2), list(c("a","b"),"a") )
145+
# V1 V2 NA # v1.12.2 and before
146+
# <int> <char> <char> # introduced invalid NA column name too
147+
# 1: 1 a a
148+
# 2: 2 b a
149+
#
150+
# V1 V2 # v1.12.4+
151+
# <int> <list>
152+
# 1: 1 a,b
153+
# 2: 2 a
154+
```
155+
138156

139157
#### NOTES
140158

@@ -231,7 +249,7 @@
231249
232250
18. `cbind` with a null (0-column) `data.table` now works as expected, [#3445](https://github.com/Rdatatable/data.table/issues/3445). Thanks to @mb706 for reporting.
233251
234-
19. Subsetting does a better job of catching a malformed `data.table` with error rather than segfault. A column may not be NULL, nor may a column be an object such as a data.frame or matrix which have columns. Thanks to a comment and reproducible example in [#3369](https://github.com/Rdatatable/data.table/issues/3369) from Drew Abbot which demonstrated the issue which arose from parsing JSON. The next release will enable `as.data.table` to unpack columns which are data.frame to support this use case.
252+
19. Subsetting does a better job of catching a malformed `data.table` with error rather than segfault. A column may not be NULL, nor may a column be an object which has columns (such as a `data.frame` or `matrix`). Thanks to a comment and reproducible example in [#3369](https://github.com/Rdatatable/data.table/issues/3369) from Drew Abbot which demonstrated the issue which arose from parsing JSON. The next release will enable `as.data.table` to unpack columns which are `data.frame` to support this use case.
235253
236254
#### NOTES
237255

R/as.data.table.R

+68-38
Original file line numberDiff line numberDiff line change
@@ -115,49 +115,75 @@ as.data.table.array = function(x, keep.rownames=FALSE, key=NULL, sorted=TRUE, va
115115
ans[]
116116
}
117117

118-
as.data.table.list = function(x, keep.rownames=FALSE, key=NULL, ...) {
119-
wn = sapply(x,is.null)
120-
if (any(wn)) x = x[!wn]
121-
if (!length(x)) return( null.data.table() )
122-
# fix for #833, as.data.table.list with matrix/data.frame/data.table as a list element..
123-
# TODO: move this entire logic (along with data.table() to C
124-
for (i in seq_along(x)) {
125-
dims = dim(x[[i]])
126-
if (!is.null(dims)) {
127-
ans = do.call("data.table", x)
128-
setnames(ans, make.unique(names(ans)))
129-
return(ans)
118+
as.data.table.list = function(x, keep.rownames=FALSE, key=NULL, check.names=FALSE, ...) {
119+
n = length(x)
120+
eachnrow = integer(n) # vector of lengths of each column. may not be equal if silent repetition is required.
121+
eachncol = integer(n)
122+
missing.check.names = missing(check.names)
123+
for (i in seq_len(n)) {
124+
xi = x[[i]]
125+
if (is.null(xi)) next # eachncol already initialized to 0 by integer() above
126+
if (!is.null(dim(xi)) && missing.check.names) check.names=TRUE
127+
if ("POSIXlt" %chin% class(xi)) {
128+
warning("POSIXlt column type detected and converted to POSIXct. We do not recommend use of POSIXlt at all because it uses 40 bytes to store one date.")
129+
xi = x[[i]] = as.POSIXct(xi)
130+
} else if (is.matrix(xi) || is.data.frame(xi)) {
131+
if (!is.data.table(xi)) {
132+
xi = x[[i]] = as.data.table(xi, keep.rownames=keep.rownames) # we will never allow a matrix to be a column; always unpack the columns
133+
}
134+
# else avoid dispatching to as.data.table.data.table (which exists and copies)
135+
} else if (is.table(xi)) {
136+
xi = x[[i]] = as.data.table.table(xi, keep.rownames=keep.rownames)
137+
} else if (is.function(xi)) {
138+
xi = x[[i]] = list(xi)
130139
}
140+
eachnrow[i] = NROW(xi) # for a vector (including list() columns) returns the length
141+
eachncol[i] = NCOL(xi) # for a vector returns 1
131142
}
132-
n = vapply(x, length, 0L)
133-
mn = max(n)
134-
x = copy(x)
135-
idx = which(n < mn)
136-
if (length(idx)) {
137-
for (i in idx) {
138-
# any is.null(x[[i]]) were removed above, otherwise warning when a list element is NULL
139-
if (inherits(x[[i]], "POSIXlt")) {
140-
warning("POSIXlt column type detected and converted to POSIXct. We do not recommend use of POSIXlt at all because it uses 40 bytes to store one date.")
141-
x[[i]] = as.POSIXct(x[[i]])
142-
}
143-
# Implementing FR #4813 - recycle with warning when nr %% nrows[i] != 0L
144-
if (!n[i] && mn)
145-
warning("Item ", i, " is of size 0 but maximum size is ", mn, ", therefore recycled with 'NA'")
146-
else if (n[i] && mn %% n[i] != 0L)
147-
warning("Item ", i, " is of size ", n[i], " but maximum size is ", mn, " (recycled leaving a remainder of ", mn%%n[i], " items)")
148-
x[[i]] = rep(x[[i]], length.out=mn)
143+
ncol = sum(eachncol) # hence removes NULL items silently (no error or warning), #842.
144+
if (ncol==0L) return(null.data.table())
145+
nrow = max(eachnrow)
146+
ans = vector("list",ncol) # always return a new VECSXP
147+
recycle = function(x, nrow) {
148+
if (length(x)==nrow) {
149+
return(copy(x))
150+
# This copy used to be achieved via .Call(CcopyNamedInList,x) at the top of data.table(). It maintains pre-Rv3.1.0
151+
# behavior, for now. See test 548.2. The copy() calls duplicate() at C level which (importantly) also expands ALTREP objects.
152+
# TODO: port this as.data.table.list() to C and use MAYBE_REFERENCED(x) || ALTREP(x) to save some copies.
153+
# That saving used to be done by CcopyNamedInList but the copies happened again as well, so removing CcopyNamedInList is
154+
# not worse than before, and gets us in a better centralized place to port as.data.table.list to C and use MAYBE_REFERENCED
155+
# again in future.
149156
}
157+
if (identical(x,list())) vector("list", nrow) else rep(x, length.out=nrow) # new objects don't need copy
150158
}
151-
# fix for #842
152-
if (mn > 0L) {
153-
nz = which(n > 0L)
154-
xx = point(vector("list", length(nz)), seq_along(nz), x, nz)
155-
if (!is.null(names(x)))
156-
setattr(xx, 'names', names(x)[nz])
157-
x = xx
159+
vnames = character(ncol)
160+
k = 1L
161+
for(i in seq_len(n)) {
162+
xi = x[[i]]
163+
if (is.null(xi)) next
164+
if (eachnrow[i]>1L && nrow%%eachnrow[i]!=0L) # in future: eachnrow[i]!=nrow
165+
warning("Item ", i, " has ", eachnrow[i], " rows but longest item has ", nrow, "; recycled with remainder.")
166+
if (eachnrow[i]==0L && nrow>0L && is.atomic(xi)) # is.atomic to ignore list() since list() is a common way to initialize; let's not insist on list(NULL)
167+
warning("Item ", i, " has 0 rows but longest item has ", nrow, "; filled with NA") # the rep() in recycle() above creates the NA vector
168+
if (is.data.table(xi)) { # matrix and data.frame were coerced to data.table above
169+
# vnames[[i]] = names(xi) #if (nm!="" && n>1L) paste(nm, names(xi), sep=".") else names(xi)
170+
for (j in seq_along(xi)) {
171+
ans[[k]] = recycle(xi[[j]], nrow)
172+
vnames[k] = names(xi)[j]
173+
k = k+1L
174+
}
175+
} else {
176+
nm = names(x)[i]
177+
vnames[k] = if (length(nm) && !is.na(nm) && nm!="") nm else paste0("V",i)
178+
ans[[k]] = recycle(xi, nrow)
179+
k = k+1L
180+
}
158181
}
159-
setDT(x, key=key) # copy ensured above; also, setDT handles naming
160-
x
182+
if (any(vnames==".SD")) stop("A column may not be called .SD. That has special meaning.")
183+
if (check.names) vnames = make.names(vnames, unique=TRUE)
184+
setattr(ans, "names", vnames)
185+
setDT(ans, key=key) # copy ensured above; also, setDT handles naming
186+
ans
161187
}
162188

163189
# don't retain classes before "data.frame" while converting
@@ -180,6 +206,10 @@ as.data.table.data.frame = function(x, keep.rownames=FALSE, key=NULL, ...) {
180206
setnames(ans, 'rn', keep.rownames[1L])
181207
return(ans)
182208
}
209+
if (any(!sapply(x,is.atomic))) {
210+
# a data.frame with a column that is data.frame needs to be expanded; test 2013.4
211+
return(as.data.table.list(x, keep.rownames=keep.rownames, ...))
212+
}
183213
ans = copy(x) # TO DO: change this deep copy to be shallow.
184214
setattr(ans, "row.names", .set_row_names(nrow(x)))
185215

0 commit comments

Comments
 (0)