Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group varying factor levels fixed #3906

Merged
merged 3 commits into from
Sep 24, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -325,6 +325,8 @@

43. `:=` could change a `data.table` creation statement in the body of the function calling it, or a variable in calling scope, [#3890](https://github.com/Rdatatable/data.table/issues/3890). Many thanks to @kirillmayantsev for the detailed reports.

44. Grouping could create a `malformed factor` and/or segfault when the factors returned by each group did not have identical levels, [#2199](https://github.com/Rdatatable/data.table/issues/2199) and [#2522](https://github.com/Rdatatable/data.table/issues/2522). Thanks to Václav Hausenblas, @franknarf1, @ben519, and @Henrik-P for reporting.

## NOTES

1. `rbindlist`'s `use.names="check"` now emits its message for automatic column names (`"V[0-9]+"`) too, [#3484](https://github.com/Rdatatable/data.table/pull/3484). See news item 5 of v1.12.2 below.
Expand Down
30 changes: 30 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -16136,6 +16136,36 @@ test(2113.3, rbindlist(list(list(a=numeric(), b=numeric()),
list(a=as.factor(1L), b=as.factor(2:4)))),
data.table(a=as.factor(1L), b=as.factor(2:4)))

# dogroups combine factor levels, #2199 & #2522
DT = data.table(A=1:2)
g = function(x) { if (x==1L) factor(c("a","b888")) else factor(c("b888","c")) } # b888 to cover tl==0 in memrecycle
test(2114.1, DT[,g(.GRP),by=A], data.table(A=INT(1,1,2,2), V1=as.factor(c("a","b888","b888","c"))))
g = function(x) { if (x==1L) factor(c("a","b")) else factor(c("a","b","c")) }
test(2114.2, DT[,g(.GRP),by=A], data.table(A=INT(1,1,2,2,2), V1=as.factor(c("a","b","a","b","c"))))
# original test verbatim from the same issue #2199
set.seed(2)
ids = sample(letters, 20)
dates = 1:40
dt = data.table(CJ(dates, ids, ids))
setnames(dt, c("date", "id1", "id2"))
dt[, value := rnorm(length(date))]
dt = dt[!(date == 1 & (id1 == "a" | id2 == "a"))]
dt = dt[!(date == 4 & (id1 == "e" | id2 == "e"))]
f1 = function(sdt) {
dt1 <- dcast.data.table(sdt, id1 ~ id2)
melt.data.table(dt1, id.vars = "id1")
}
res = dt[, f1(.SD), by=date]
test(2114.3, setnames(res[c(1,.N)],"variable","id2")[,id2:=as.character(id2)][], dt[c(1,.N)])
test(2114.4, print(res), output="date.*0.433")
# and from #2522
DT = data.table(id=1:9, grp=rep(1:3,each=3), val=c("a","b","c", "a","b","c", "a","b","c"))
test(2114.5, as.character(DT[, valfactor1 := factor(val), by = grp]$valfactor1), ans<-rep(c("a","b","c"),3))
test(2114.6, as.character(DT[, valfactor2 := factor(val), by = id]$valfactor2), ans)
DT = data.table(x = rep(letters[c(3, 1, 2)], each = 2))
test(2114.7, DT[, `:=`(g=.GRP, f=factor(.GRP)), by = x],
data.table(x=rep(c("c","a","b"),each=2), g=rep(1:3,each=2), f=factor(rep(as.character(1:3),each=2))))


###################################
# Add new tests above this line #
Expand Down
84 changes: 82 additions & 2 deletions src/assign.c
Original file line number Diff line number Diff line change
Expand Up @@ -818,6 +818,86 @@ const char *memrecycle(SEXP target, SEXP where, int start, int len, SEXP source)
source = PROTECT(copyAsPlain(source)); protecti++;
}
}
const bool sourceIsFactor=isFactor(source), targetIsFactor=isFactor(target);
if (sourceIsFactor && targetIsFactor) {
// TODO: ^^ could be || in future; assign is already handling that so leave that for now. Then move more of assign to call here.
if (!sourceIsFactor || !targetIsFactor) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is always false, maybe just wrapping into /* ... */ instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's there for the future when the todo in that comment is done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, but could be commented out code block too.
would be good to have a followup issue and link it from there

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be but I'd prefer not to. Ok?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of all the lines in this PR, this is the least significant to pick up on.

// # nocov start
error("Internal error: memrecycle source_is_factor=%d but target_is_factor=%d. They both must be factor or both not factor.",
sourceIsFactor, targetIsFactor);
// # nocov end
}
SEXP sourceLevels = PROTECT(getAttrib(source, R_LevelsSymbol)); protecti++;
SEXP targetLevels = PROTECT(getAttrib(target, R_LevelsSymbol)); protecti++;
if (!R_compute_identical(sourceLevels, targetLevels, 0)) {
const int nTargetLevels=length(targetLevels), nSourceLevels=length(sourceLevels);
const SEXP *targetLevelsD=STRING_PTR(targetLevels), *sourceLevelsD=STRING_PTR(sourceLevels);
SEXP newSource = PROTECT(allocVector(INTSXP, length(source))); protecti++;
savetl_init();
for (int k=0; k<nTargetLevels; ++k) {
const SEXP s = targetLevelsD[k];
const int tl = TRUELENGTH(s);
if (tl>0) {
savetl(s);
} else if (tl<0) {
// # nocov start
for (int j=0; j<k; ++j) SET_TRUELENGTH(s, 0); // wipe our negative usage and restore 0
savetl_end(); // then restore R's own usage (if any)
error("Internal error: levels of target are either not unique or have truelength<0");
// # nocov end
}
SET_TRUELENGTH(s, -k-1);
}
int nAdd = 0;
for (int k=0; k<nSourceLevels; ++k) {
const SEXP s = sourceLevelsD[k];
const int tl = TRUELENGTH(s);
if (tl>=0) {
if (tl>0) savetl(s);
SET_TRUELENGTH(s, -nTargetLevels-(++nAdd));
}
}
const int nSource = length(source);
const int *sourceD = INTEGER(source);
int *newSourceD = INTEGER(newSource);
for (int i=0; i<nSource; ++i) { // convert source integers to refer to target levels
const int val = sourceD[i];
newSourceD[i] = val==NA_INTEGER ? NA_INTEGER : -TRUELENGTH(sourceLevelsD[val-1]);
}
source = newSource;
for (int k=0; k<nTargetLevels; ++k) SET_TRUELENGTH(targetLevelsD[k], 0); // don't need those anymore
if (nAdd) {
// cannot grow the levels yet as that would be R call which could fail to alloc and we have no hook to clear up
SEXP *temp = (SEXP *)malloc(nAdd * sizeof(SEXP *));
if (!temp) {
// # nocov start
for (int k=0; k<nSourceLevels; ++k) SET_TRUELENGTH(sourceLevelsD[k], 0);
savetl_end();
error("Unable to allocate working memory of %d bytes to combine factor levels", nAdd*sizeof(SEXP *));
// # nocov end
}
for (int k=0, thisAdd=0; k<nSourceLevels; ++k) {
SEXP s = sourceLevelsD[k];
int tl = TRUELENGTH(s);
if (tl) {
if (tl != -nTargetLevels-thisAdd-1) error("Internal error: extra level check sum failed"); // # nocov
temp[thisAdd++] = s;
SET_TRUELENGTH(s,0);
}
}
savetl_end();
setAttrib(target, R_LevelsSymbol, targetLevels=growVector(targetLevels, nTargetLevels + nAdd));
for (int k=0; k<nAdd; ++k) {
SET_STRING_ELT(targetLevels, nTargetLevels+k, temp[k]);
}
free(temp);
} else {
// all source levels were already in target levels, but not with the same integers; we're done
savetl_end();
}
// now continue, but with the mapped integers in the (new) source
}
}
if (!length(where)) { // e.g. called from rbindlist with where=R_NilValue
switch (TYPEOF(target)) {
case RAWSXP:
Expand All @@ -842,8 +922,8 @@ const char *memrecycle(SEXP target, SEXP where, int start, int len, SEXP source)
}
break;
case REALSXP : {
bool si64 = INHERITS(source, char_integer64);
bool ti64 = INHERITS(target, char_integer64);
bool si64 = Rinherits(source, char_integer64);
bool ti64 = Rinherits(target, char_integer64);
if (si64 && TYPEOF(source)!=REALSXP)
error("Internal error: source has integer64 attribute but is type '%s' not REALSXP", type2char(TYPEOF(source))); // # nocov
if (si64 == ti64) {
Expand Down
27 changes: 10 additions & 17 deletions src/dogroups.c
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX
{
R_len_t rownum, ngrp, nrowgroups, njval=0, ngrpcols, ansloc=0, maxn, estn=-1, thisansloc, grpn, thislen, igrp, origIlen=0, origSDnrow=0;
int nprotect=0;
SEXP ans=NULL, jval, thiscol, BY, N, I, GRP, iSD, xSD, rownames, s, RHS, target, source, tmp;
SEXP ans=NULL, jval, thiscol, BY, N, I, GRP, iSD, xSD, rownames, s, RHS, target, source;
Rboolean wasvector, firstalloc=FALSE, NullWarnDone=FALSE;
clock_t tstart=0, tblock[10]={0}; int nblock[10]={0};

Expand Down Expand Up @@ -304,28 +304,21 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX
if (isNull(target)) {
// first time adding to new column
if (TRUELENGTH(dt) < INTEGER(lhs)[j]) error("Internal error: Trying to add new column by reference but tl is full; setalloccol should have run first at R level before getting to this point in dogroups"); // # nocov
tmp = PROTECT(allocNAVectorLike(RHS, LENGTH(VECTOR_ELT(dt,0))));
// increment length only if the allocation passes, #1676
target = PROTECT(allocNAVectorLike(RHS, LENGTH(VECTOR_ELT(dt,0))));
// Even if we could know reliably to switch from allocNAVectorLike to allocVector for slight speedup, user code could still
// contain a switched halt, and in that case we'd want the groups not yet done to have NA rather than 0 or uninitialized.
// Increment length only if the allocation passes, #1676. But before SET_VECTOR_ELT otherwise attempt-to-set-index-n/n R error
SETLENGTH(dtnames, LENGTH(dtnames)+1);
SETLENGTH(dt, LENGTH(dt)+1);
SET_VECTOR_ELT(dt, INTEGER(lhs)[j]-1, tmp);
SET_VECTOR_ELT(dt, INTEGER(lhs)[j]-1, target);
UNPROTECT(1);
// Even if we could know reliably to switch from allocNAVectorLike to allocVector for slight speedup, user code could still contain a switched halt, and in that case we'd want the groups not yet done to have NA rather than uninitialized or 0.
// dtnames = getAttrib(dt, R_NamesSymbol); // commented this here and added it on the beginning to fix #4990
SET_STRING_ELT(dtnames, INTEGER(lhs)[j]-1, STRING_ELT(newnames, INTEGER(lhs)[j]-origncol-1));
target = VECTOR_ELT(dt,INTEGER(lhs)[j]-1);
copyMostAttrib(RHS, target); // attributes of first group dominate; e.g. initial factor levels come from first group
}
memrecycle(target, order, INTEGER(starts)[i]-1, grpn, RHS); // length mismatch checked above for all jval columns before starting to add any new columns
copyMostAttrib(RHS, target); // not names, otherwise test 778 would fail.
/* OLD FIX: commented now. The fix below resulted in segfault on factor columns because I didn't set the "levels"
Instead of fixing that, I just removed setting class if it's factor. Not appropriate fix.
Correct fix of copying all attributes (except names) added above. Now, everything should be alright.
Test 1144 (#5104) will provide the right output now. Modified accordingly.
OUTDATED: if (!isFactor(RHS)) setAttrib(target, R_ClassSymbol, getAttrib(RHS, R_ClassSymbol));
OUTDATED: // added !isFactor(RHS) to fix #5104 (side-effect of fixing #2531)
See also #155 and #36 */
memrecycle(target, order, INTEGER(starts)[i]-1, grpn, RHS);
// can't error here because length mismatch already checked for all jval columns before starting to add any new columns
}
UNPROTECT(1);
UNPROTECT(1); // jval
continue;
}
maxn = 0;
Expand Down