Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All non-atomic types are now converted to lists in rbindlist #4196

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 21 additions & 6 deletions src/rbindlist.c
Original file line number Diff line number Diff line change
Expand Up @@ -514,18 +514,33 @@ SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg)
if (w==-1 || !length(thisCol=VECTOR_ELT(li, w))) { // !length for zeroCol warning above; #1871
writeNA(target, ansloc, thisnrow); // writeNA is integer64 aware and writes INT64_MIN
} else {
if ((TYPEOF(target)==VECSXP) && TYPEOF(thisCol)>TYPEOF(target)) {
// Exotic non-atomic types need each element to be wrapped in a list, e.g. expression vectors #546
if ((TYPEOF(target)==VECSXP) && (isVectorAtomic(thisCol) || TYPEOF(thisCol)==LISTSXP)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the desired behaviour here is for a pairlist to be converted to a list, rather than being converted to a list where each element is a 1-element pairlist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example of both would help

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense - my understanding is a pairlist is a linked-list, while a regular list is essentially a vector of pointers to the list elements. If we treat a LISTSXP the same way as an EXPRSXP, then you end up with a list, where each element is a pointer to each individual node in the linked list. You end up lose any benefits of having the linked list, because each list element is just a pointer to a 1-node linked list.

E.g the result is something like:

# Linked list, A -> B -> C
ll = pairlist(A=1, B=pairlist(1, 2:3), C='foo')

# Current rbindlist behaviour turns top level pairlist into list (losing the names)
cb = list(1, pairlist(1, 2:3), 'foo')

# Alternate rbindlist behaviour if we treat a pairlist in the same way as expression:
ab = list(pairlist(A=1), pairlist(B=pairlist(1, 2:3)), pairlist(C='foo'))

// do an as.list() on the atomic column; #3528
// pairlists (LISTSXP) can also be coerced to lists using coerceVector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it coerce recursively unwrapping nested parilists into single list of many elements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the top level pairlist is coerced to a list:

> A = data.table(c1 = 1:3, c2 = pairlist(A=1, B=list(1, 2:3), C=pairlist(C1=1, C2=3:4)))
> B = data.table(c1 = 2, c2 = 'asd')
> rbind(A,B)
      c1            c2
   <num>        <list>
1:     1             1
2:     2     <list[2]>
3:     3 <pairlist[2]>
4:     2           asd

thisCol = PROTECT(coerceVector(thisCol, TYPEOF(target))); nprotect++;
} else if ((TYPEOF(target)==VECSXP) && TYPEOF(thisCol)==EXPRSXP) {
// For EXPRSXP each element to be wrapped in a list, e.g. expression vectors #546
SEXP thisColList = PROTECT(allocVector(VECSXP, length(thisCol))); nprotect++;
for(int r=0; r<length(thisCol); ++r) {
SEXP thisElement = VECTOR_ELT(thisCol, r);
if (TYPEOF(thisCol) == EXPRSXP) thisElement = PROTECT(coerceVector(thisElement, EXPRSXP)); nprotect++; // otherwise LANGSXP
thisElement = PROTECT(coerceVector(thisElement, EXPRSXP)); nprotect++; // otherwise LANGSXP
SET_VECTOR_ELT(thisColList, r, thisElement);
}
thisCol = thisColList;
} else if ((TYPEOF(target)==VECSXP) && TYPEOF(thisCol)<TYPEOF(target)) {
// do an as.list() on the atomic column; #3528
thisCol = PROTECT(coerceVector(thisCol, TYPEOF(target))); nprotect++;
} else if ((TYPEOF(target)==VECSXP) && !isVector(thisCol) && TYPEOF(thisCol)!=TYPEOF(target)) {
// Anything not a vector we can assign directly through SET_VECTOR_ELT
// Although tecnically there should only be one list element for any type met here,
// the length of the type may be > 1, in which case the other columns in data.table
// will have been recycled. We therefore in turn have to recycle the list elements
// to match the number of rows.
SEXP thisColList = PROTECT(allocVector(VECSXP, length(thisCol))); nprotect++;
for(int r=0; r<length(thisCol); ++r) {
SET_VECTOR_ELT(thisColList, r, thisCol);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Demonstration of the recycling issue and why we need this for loop:

> A = data.table(a=0L, B=quote(1+1))
> A
       a      B
   <int> <call>
1:     0  1 + 1
2:     0  1 + 1
3:     0  1 + 1

The for loop means we get the right result with rbind:

> rbind(A)
       a         B
   <int>    <list>
1:     0 <call[3]>
2:     0 <call[3]>
3:     0 <call[3]>

Otherwise we would have

> rbind(A)
       a         B
   <int>    <list>
1:     0 <call[3]>
2:     0 
3:     0

It would be good to fix this in data.table()/as.data.table() so that these non-vector types are wrapped in list at the time of data.table construction

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, very nice demo

thisCol = thisColList;
} else if ((TYPEOF(target)==VECSXP) && TYPEOF(thisCol)!=TYPEOF(target)) {
// should be unreachable
error("Internal error: rbindlist cannot handle type %s\n", type2char(TYPEOF(thisCol))); // # nocov
}
Copy link
Contributor Author

@sritchie73 sritchie73 Jan 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic abstracted into a separate function in utils.c because it is/will be otherwise duplicated in Casmatrix in #4144. I also envision in a future PR adding an R exposable wrapper so that non-atomic vector columns may be caught and coerced to list in as.data.table and setDT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if a single PROTECT is sufficient here - the PROTECT stack in coerceAsList may be thisnrow + 1 if TYPEOF(thisCol)==EXPRSXP because each element in the vector has to be coerced to an EXPRSXP and passed to SET_VECTOR_ELT in the list wrapper, otherwise the list elements are LANGSXP if you just do coerceVector(thisCol, VECSXP). There is a corresponding UNPROTECT call in coerceAsList that handles this, but I don't know if that means wrapping the result here in a single PROTECT risks this memory being freed before rbindlist returns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method i've settled on after hitting the protect stack limit in testing #4144 is to have the C function called from R manage the PROTECT stack, and for functions it calls to simply increment a protect counter.

I've modified coerceAsList to this effect in the latest commit. Inside coerceAsList, each new allocated list element PROTECT is popped from the stack once it's been assigned into the new list using SET_VECTOR_ELT. In the coercion loop in rbindlist.c I make a separate nprotect counter for coerceAsList, as it may either increment the PROTECT stack by one, or not at all (e.g. when simply returning the input list). This allows us to pop the coerced list from the PROTECT stack after its contents have been copied into the target vector. This means these changes will no longer hit the PROTECT stack limit when handling >50,000 columns that need to be converted to lists.

// else coerces if needed within memrecycle; with a no-alloc direct coerce from 1.12.4 (PR #3909)
const char *ret = memrecycle(target, R_NilValue, ansloc, thisnrow, thisCol, 0, -1, idcol+j+1, foundName);
Expand Down