as.data.table.array - convert multidimensional array into data.table #1418

jangorecki · 2015-10-30T21:33:35Z

FR for multidimensional array conversion to data.table.
Logic behind conversion is to lookup value from array for each combination of dimensions. Rationale is not only the similar API on subset of array/data.table (see below examples) but the underlying organization of data. It basically reduce array dimensions to tabular structure keeping all the relations between dimensions and corresponding value of a measure - so losslessly.
Below solution is likely to be inefficient due to lookup value from array for each group. The j argument may looks scary but it simply builds following call .(value = x[color, year, country]) to perform subset x array for each group.

library(data.table)
set.seed(1)

# array
ar = array(rnorm(8,10,5), rep(2,3), dimnames = list(color = c("green","red"), year = c("2014","2015"), country = c("UK","IN")))
ar["green","2015",]
ar["green",c("2014","2015"),]

# data.table
as.data.table.array = function(x) do.call(CJ, dimnames(x))[, .(value = eval(as.call(lapply(c("[","x", names(dimnames(x))), as.symbol)))),, keyby = c(names(dimnames(x)))]
dt = as.data.table.array(ar)
dt[J("green","2015")]
dt[J("green", c("2014","2015"))]

update after merge: http://stackoverflow.com/questions/11141406/reshaping-an-array-to-data-frame

The text was updated successfully, but these errors were encountered:

jangorecki · 2015-11-04T18:57:28Z

already have it well managed in separate project.

jangorecki · 2016-03-22T19:07:51Z

reopening as it is worth to improve, current state:

library(data.table)
x = array(c(1, 0, 0, 2, 0, 0, 0, 3), dim=c(2, 2, 2))
as.data.frame(x)
#  V1 V2 V3 V4
#1  1  0  0  0
#2  0  2  0  3
as.data.table(x)
#   x
#1: 1
#2: 0
#3: 0
#4: 2
#5: 0
#6: 0
#7: 0
#8: 3

I would NOT aim for consistency to data.frame here as it doesn't really provide useful output for arrays.

new.as.data.table.array = function(x) {
    d = dim(x)
    dn = dimnames(x)
    if (is.null(dn)) dn = lapply(d, seq.int)
    r = do.call(CJ, c(dn, list(sorted=TRUE, unique=TRUE)))
    dim.cols = copy(names(r))
    jj = as.call(list(
        as.name(":="),
        "value",
        as.call(lapply(c("[","x", dim.cols), as.symbol)) # lookup to 'x' array for each row
    )) # `:=`("value", x[V1, V2, V3])
    r[, eval(jj), by=c(dim.cols)][]
}
new.as.data.table.array(x)
#   V1 V2 V3 value
#1:  1  1  1     1
#2:  1  1  2     0
#3:  1  2  1     0
#4:  1  2  2     0
#5:  2  1  1     0
#6:  2  1  2     0
#7:  2  2  1     2
#8:  2  2  2     3

It would handle use case described in previous comments:

set.seed(1)
# array
x = array(rnorm(8,10,5), rep(2,3), dimnames = list(color = c("green","red"), year = c("2014","2015"), country = c("UK","IN")))
x["green","2015",]
#      UK       IN 
#17.55891 15.62465 
x["green",c("2014","2015"),]
#      country
#year         UK        IN
#  2014 12.87891  6.893797
#  2015 17.55891 15.624655

dt = new.as.data.table.array(x)
dt[J("green","2015")]
#   color year country    value
#1: green 2015      IN 15.62465
#2: green 2015      UK 17.55891
dt[J("green", c("2014","2015"))]
#   color year country     value
#1: green 2014      IN  6.893797
#2: green 2014      UK 12.878907
#3: green 2015      IN 15.624655
#4: green 2015      UK 17.558906

Any feedback to draft welcome.

MichaelChirico · 2016-03-22T21:12:49Z

Maybe better naming than V1:3? Not sure how standard i,j,k is, perhaps dim_1:3?

Like the idea though.

jangorecki · 2016-03-22T22:28:33Z

@MichaelChirico as.data.table.* needs to be quite low level conversion, as few data.table metadata as possible, if source (array) doesn't have names I think it is better to use the data.table's most default ones: V1:V3

Just pushed RC version so feedback on it is welcome, or some new tests, after a while I will rebase it to master.
https://github.com/Rdatatable/data.table/compare/as.data.table.array

in summary:

arrays doesn't scale for more dimensions and sparse data due to cartesian product of dimensions.
as.data.table.array gives ability to keep arrays in a sparse way, modelling multidimensional array in tabular structure the way I believe it should be modelled.

@arunsrinivasan FYI:
The worst thing about that PR is that it is not performance focused, for each dimensions set in cartesian product we are making lookup for a value to input x array, kind of dt[, value:= x[a, b], by = c("a","b")]. It may not even need to scale better due to poor array memory scalability.

MichaelChirico · 2016-03-23T13:00:44Z

I guess we're saving the key argument for #890?

Also, it might be nice to have an option to generate more than one variable from this, e.g. for an M x N x P x Q array, generating Q columns, akin to margin.

This would make it more parallel to as.data.table.matrix.

I think implementation is just a prudent use of dcast.

jangorecki · 2016-03-23T13:32:38Z

@MichaelChirico the use case you are describing would be simply the case when Q dimension would be a measure type dimension. I'm not sure if we really need it, dcast of course can handle that as a post-processing step.
Not sure about the key, cannot find a rationale for a default other than current setkey on all dimensions, this is made in CJ call which is unavoidable.

mrdwab · 2016-03-24T02:56:28Z

@jangorecki Perhaps you also want to consider a "wide" representation, as you would get if you did ftable(x). That's what I make use of in ftable2dt().

Also, with ftable2dt, if one wanted the long skinny version, they could use the "wide" option (ftable2dt(x, "wide")).

jangorecki · 2016-03-25T11:57:49Z

Will hold on with that. I don't see a big problem with dcast'ing measures as post-process, but..
Last dimension could be optionally kept in columns, forming multiple measures, so it would be consistent with as.data.table.matrix.

mrdwab · 2016-03-26T05:19:22Z

@jangorecki I was also sharing it because it might be faster on larger arrays.

Here's the rough version of the function I'm proposing:

am_adt <- function(inarray) {
  if (!is.array(inarray)) stop("input must be an array")
  dims <- dim(inarray)
  if (is.null(dimnames(inarray))) {
    inarray <- provideDimnames(inarray, base = list(as.character(seq_len(max(dims)))))
  }
  FT <- if (any(class(inarray) %in% "ftable")) inarray else ftable(inarray) 
  out <- data.table(as.table(ftable(FT)))
  nam <- names(out)[seq_along(dims)]
  setorderv(out[, (nam) := lapply(.SD, type.convert), .SDcols = nam], nam)[]
}

Here are a couple of large-ish arrays to test against. "M" has no names, and "N" does. It's just 1 million values, put into a 5D array.

dims <- c(10, 20, 50, 10, 10)
set.seed(1)
M <- `dim<-`(sample(100, prod(dims), TRUE), dims)
N <- `dimnames<-`(M, lapply(dims, function(x) c(letters, LETTERS)[seq_len(x)]))

Wrapping your approach in funDT and mine in funAM and running benchmarks, I get:

Certainly, there is room for improvement. I'm not sure that I should always use type.convert for example. It might be better to only use that if the data does not have dimnames, and keep them as characters otherwise.

By the way, regarding your comment about dcasting after getting a wide dataset, I guess the question would be whether dcast is in general faster than melt or not. If melt is faster, then it would make more sense to go to a wide format first, and then get skinny, rather than the other way around. Getting the wide format is certainly fast:

jangorecki · 2016-12-05T00:10:42Z

@mrdwab previous implementation was unnecessarily complex and inefficient. The one just pushed is much faster. Looking at as.data.frame.table method, I think it make sense for this PR actually update table method instead of adding array method.

new as.data.table.array, closes #1418

jangorecki closed this as completed Nov 4, 2015

jangorecki reopened this Mar 22, 2016

jangorecki self-assigned this Mar 22, 2016

jangorecki added the feature request label Mar 22, 2016

jangorecki added a commit that referenced this issue Mar 22, 2016

new as.data.table.array, closes #1418

deb3eba

jangorecki mentioned this issue Mar 23, 2016

new as.data.table.array, closes #1418 #1606

Merged

jangorecki added this to the v1.9.10 milestone Nov 23, 2016

mattdowle modified the milestones: v1.10.6, Candidate Aug 7, 2017

mattdowle closed this as completed in 0ecb60b Aug 7, 2017

mattdowle added a commit that referenced this issue Aug 7, 2017

Merge pull request #1606 from Rdatatable/as.data.table.array

4da1ed5

new as.data.table.array, closes #1418

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

as.data.table.array - convert multidimensional array into data.table #1418

as.data.table.array - convert multidimensional array into data.table #1418

jangorecki commented Oct 30, 2015 •

edited

Loading

jangorecki commented Nov 4, 2015

jangorecki commented Mar 22, 2016

MichaelChirico commented Mar 22, 2016

jangorecki commented Mar 22, 2016

MichaelChirico commented Mar 23, 2016

jangorecki commented Mar 23, 2016

mrdwab commented Mar 24, 2016

jangorecki commented Mar 25, 2016

mrdwab commented Mar 26, 2016

jangorecki commented Dec 5, 2016 •

edited

Loading

as.data.table.array - convert multidimensional array into data.table #1418

as.data.table.array - convert multidimensional array into data.table #1418

Comments

jangorecki commented Oct 30, 2015 • edited Loading

jangorecki commented Nov 4, 2015

jangorecki commented Mar 22, 2016

MichaelChirico commented Mar 22, 2016

jangorecki commented Mar 22, 2016

MichaelChirico commented Mar 23, 2016

jangorecki commented Mar 23, 2016

mrdwab commented Mar 24, 2016

jangorecki commented Mar 25, 2016

mrdwab commented Mar 26, 2016

jangorecki commented Dec 5, 2016 • edited Loading

jangorecki commented Oct 30, 2015 •

edited

Loading

jangorecki commented Dec 5, 2016 •

edited

Loading