-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
collect more statistics about the data #2879
Comments
I'd also be interested in collection of statistics related to a join (eg, does each row of |
@franknarf1 not exactly what you've asked but related. |
In addition to the stats you mentioned, one of the most important ones is |
We need to plug collections of those statistics into existing functions, so explicitly calling analyze <- function(x, ...) {
stopifnot(is.data.table(x))
d <- dim(x)
nr <- d[1L]
nc <- d[2L]
cols <- names(x)
ans <- list(is.sorted=NA, is.indexed=NA, is.unique=NA, is.ascii=NA,
unique.n=NA_integer_, maxgrp.n=NA_integer_, na.count=NA_integer_,
has.na=NA, all.na=NA, has.nan=NA,
min=NA_real_, max=NA_real_)
ans <- rbindlist(sapply(cols, function(x) ans, simplify=FALSE), idcol="column")
o <- sapply(simplify=FALSE, cols, data.table:::forderv, x=x, sort=TRUE, retGrp=TRUE)
if (is.null(attr(x, "index", exact = TRUE))) setattr(x, "index", integer())
idx <- attr(x, "index", TRUE)
is.ascii <- function(x) {
if (!is.character(x)) return(NA)
asc <- iconv(x, "latin1", "ASCII")
!any(is.na(asc) | asc != x)
}
for (i in 1:nc) {
setattr(idx, paste0("__", cols[i]), c(o[[i]]))
ans[i, `:=`(
is.sorted = !length(o[[i]]),
is.indexed = TRUE,
is.ascii = is.ascii(x[[i]]),
unique.n = length(attr(o[[i]], "starts", TRUE)),
maxgrp.n = attr(o[[i]], "maxgrpn", TRUE),
na.count = sum(is.na(x[[i]]))
)][i, `:=`(is.unique = unique.n==nr,
has.na = na.count>0L,
all.na = na.count==nr,
has.nan = NA,
min = NA_real_,
max = NA_real_)]
}
setattr(x, "stats", ans)
}
stats <- function(x) attr(x, "stats", TRUE)
set.seed(108)
x <- data.table(v1=sample(8L, 10L, TRUE), v2=as.factor(sample(letters, 5L)),
v3=rnorm(10L), v4=sample(c("\xfcasd", sample(letters, 4L))),
v5=sample(c(rnorm(5), rep(NA, 5))), v6=NA)
x
# v1 v2 v3 v4 v5 v6
# <int> <fctr> <num> <char> <num> <lgcl>
# 1: 4 x -0.1389979 o NA NA
# 2: 4 j -0.4059470 g 0.4747667 NA
# 3: 3 r -1.6771308 \374asd 0.6404341 NA
# 4: 6 s 0.4459993 y 0.5706510 NA
# 5: 4 n -0.6954863 i NA NA
# 6: 1 x 0.6769990 o -0.6944517 NA
# 7: 5 j 0.9524670 g NA NA
# 8: 2 r -2.2123936 \374asd NA NA
# 9: 5 s 0.9949963 y -0.7555400 NA
#10: 4 n -0.1515556 i NA NA
analyze(x)
stats(x)
# column is.sorted is.indexed is.unique is.ascii unique.n maxgrp.n na.count
# <char> <lgcl> <lgcl> <lgcl> <lgcl> <int> <int> <int>
#1: v1 FALSE TRUE FALSE NA 6 4 0
#2: v2 FALSE TRUE FALSE NA 5 2 0
#3: v3 FALSE TRUE TRUE NA 10 1 0
#4: v4 FALSE TRUE FALSE FALSE 5 2 0
#5: v5 FALSE TRUE FALSE NA 6 5 5
#6: v6 TRUE TRUE FALSE NA 1 10 10
# has.na all.na has.nan min max
# <lgcl> <lgcl> <lgcl> <num> <num>
#1: FALSE FALSE NA NA NA
#2: FALSE FALSE NA NA NA
#3: FALSE FALSE NA NA NA
#4: FALSE FALSE NA NA NA
#5: TRUE FALSE NA NA NA
#6: TRUE TRUE NA NA NA |
The question is should we use plain R attributes to keep statistics or should we use new R C level attributes, probably quite experimental for now because introduced in R 3.5.0. x = rnorm(1e2)
y = sort(x)
.Internal(inspect(x))
#@55b95405e2e0 14 REALSXP g0c7 [NAM(3)] (len=100, tl=0) 1.07949,0.227716,1.68177,-0.300367,0.332526,...
.Internal(inspect(y))
#@55b954cf0ad8 14 REALSXP g0c0 [NAM(3)] wrapper [srt=1,no_na=1]
# @55b95405da50 14 REALSXP g0c7 [NAM(3)] (len=100, tl=0) -3.17337,-2.3548,-1.94658,-1.84502,-1.72388,... Notice 2018-09-14: |
One downside to this -- unless I've misunderstood -- is that it doesn't allow the user to access these attributes. Many of them would improve the performance of user functions. |
@HughParsonage I guess the idea would be to provide accessor functions to the attributes in C |
@HughParsonage access can be easily made by accessor functions as Michael commented, but I am not sure if it is good idea. If user would mark column as non-duplicate, or non-NAs, then processing of that column by our algorithms could silently return wrong answer. It would be best to just make functions like |
Sorry, I'm not suggesting the user has 'write' access, so to speak, for these attributes -- only 'read' access. It would definitely be useful and worth the cost. But I think write-access would be useful too. Ultimately the user has to be a bit careful. After all, it's possible currently to mark an unsorted data.table as sorted. We accept that this is worth the cost. |
data.table could collect more statistics about data while processing. This allows potential optimizations, not limited to internal data.table code. Users can use them to speed up their code and design more data-driven functions.
List of measures to collect:
haskey(x)
!is.null(idx<-attr(attr(x, "index"), idx_name))
anyNA
length(attr(idx, "starts"))
attr(idx, "maxgrpn")
attr(idx, "maxgrpn")==1L
x[c(idx[1L], idx[length(idx)])]
{{hasna}} && length(attr(idx, "starts"))==1L
optionally, as I don't see obvious optimizations coming from those:
The text was updated successfully, but these errors were encountered: