-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All non-atomic types are now converted to lists in rbindlist #4196
base: master
Are you sure you want to change the base?
Conversation
Any non-allowed column not listed in the typeorder in src/init.c are now appropriately detected in src/rbindlist.c. In these cases, the column type becomes a list, with each element wrapped in a list.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #4196 +/- ##
==========================================
- Coverage 99.61% 99.59% -0.02%
==========================================
Files 72 72
Lines 13871 13938 +67
==========================================
+ Hits 13817 13882 +65
- Misses 54 56 +2 ☔ View full report in Codecov by Sentry. |
src/rbindlist.c
Outdated
@@ -514,9 +514,33 @@ SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg) | |||
if (w==-1 || !length(thisCol=VECTOR_ELT(li, w))) { // !length for zeroCol warning above; #1871 | |||
writeNA(target, ansloc, thisnrow); // writeNA is integer64 aware and writes INT64_MIN | |||
} else { | |||
if ((TYPEOF(target)==VECSXP || TYPEOF(target)==EXPRSXP) && TYPEOF(thisCol)!=TYPEOF(target)) { | |||
if ((TYPEOF(target)==VECSXP) && (isVectorAtomic(thisCol) || TYPEOF(thisCol)==LISTSXP)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the desired behaviour here is for a pairlist
to be converted to a list
, rather than being converted to a list where each element is a 1-element pairlist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
example of both would help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense - my understanding is a pairlist is a linked-list, while a regular list is essentially a vector of pointers to the list elements. If we treat a LISTSXP the same way as an EXPRSXP, then you end up with a list, where each element is a pointer to each individual node in the linked list. You end up lose any benefits of having the linked list, because each list element is just a pointer to a 1-node linked list.
E.g the result is something like:
# Linked list, A -> B -> C
ll = pairlist(A=1, B=pairlist(1, 2:3), C='foo')
# Current rbindlist behaviour turns top level pairlist into list (losing the names)
cb = list(1, pairlist(1, 2:3), 'foo')
# Alternate rbindlist behaviour if we treat a pairlist in the same way as expression:
ab = list(pairlist(A=1), pairlist(B=pairlist(1, 2:3)), pairlist(C='foo'))
src/rbindlist.c
Outdated
SEXP thisColList = PROTECT(allocVector(VECSXP, length(thisCol))); nprotect++; | ||
for(int r=0; r<length(thisCol); ++r) { | ||
SET_VECTOR_ELT(thisColList, r, thisCol); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Demonstration of the recycling issue and why we need this for loop:
> A = data.table(a=0L, B=quote(1+1))
> A
a B
<int> <call>
1: 0 1 + 1
2: 0 1 + 1
3: 0 1 + 1
The for loop means we get the right result with rbind:
> rbind(A)
a B
<int> <list>
1: 0 <call[3]>
2: 0 <call[3]>
3: 0 <call[3]>
Otherwise we would have
> rbind(A)
a B
<int> <list>
1: 0 <call[3]>
2: 0
3: 0
It would be good to fix this in data.table()/as.data.table() so that these non-vector types are wrapped in list at the time of data.table construction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, very nice demo
I've also come across this unexpected behaviour:
When |
src/rbindlist.c
Outdated
// Exotic non-atomic types need each element to be wrapped in a list, e.g. expression vectors #546 | ||
if ((TYPEOF(target)==VECSXP) && (isVectorAtomic(thisCol) || TYPEOF(thisCol)==LISTSXP)) { | ||
// do an as.list() on the atomic column; #3528 | ||
// pairlists (LISTSXP) can also be coerced to lists using coerceVector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it coerce recursively unwrapping nested parilists into single list of many elements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the top level pairlist is coerced to a list:
> A = data.table(c1 = 1:3, c2 = pairlist(A=1, B=list(1, 2:3), C=pairlist(C1=1, C2=3:4)))
> B = data.table(c1 = 2, c2 = 'asd')
> rbind(A,B)
c1 c2
<num> <list>
1: 1 1
2: 2 <list[2]>
3: 3 <pairlist[2]>
4: 2 asd
Filling with NULL in such a case looks seems fine for me. |
It departs from the expected behaviour of fill=TRUE for all other column types though. |
TODO: list wrapper of atomic vectors is not class aware. |
Wrapping atomic vectors in list now preserves class. The integer64 case errors because memrecycle copies the integer64 class to the list TODO: fix or handle |
Wrapping atomic vectors in list now correctly preserves class. |
does this PR needs update after #546 (comment) ? @sritchie73 |
@jangorecki I'm not sure: I think the issue would be that if you do not wrap each expression in a list, then the length of the expression column will not match the length of the other non-expression columns. |
unless length matches, that would happen when each row is meant to store a language object, rather than expression. |
I was going to provide simple example but I noticed that expression looks to be recycled, making the results not what I really wanted l=list(x=1:2, y=2:3, expr=expression(1+2,2+3))
sapply(l, length)
# x y expr
# 2 2 2
as.data.table(l)
# x y expr
# <int> <int> <expr>
#1: 1 2 expression(1 + 2, 2 + 3)
#2: 2 3 expression(1 + 2, 2 + 3) but after closer look, it seems that the issue is in print data.table, because as.data.table(l)$expr
expression(1 + 2, 2 + 3) @MichaelChirico any idea if there is an open issue for printing expression column behavior? |
If lapply(expression(1+2,3), class)
#[[1]]
#[1] "call"
#
#[[2]]
#[1] "numeric" |
That's essentially the conclusion I came to for any object that's not considered internally to be a vector by R (i.e. !isVector() == true in the R's C interface) because they have a length property that does not correspond to the number of elements. This causes problems when trying to print, and also errors when these types are in the first column of the data.table (because the first column is used to determine .N) |
Agree, could you please ensure PR is ready to review? otherwise better to mark it as WIP. |
I've just reviewed the unit tests and run them; the current behaviour of this PR is to wrap each expression element in a list, even if the input expression column is a vector of length > 1:
I've added a unit test to cover this case. |
Sounds somewhat familiar (it's related to #3011) but I didn't see anything in What's happening is This will be better addressed by #3414 -- in fact I already see a |
@sritchie73 one line seems to be not covered yet: https://codecov.io/gh/Rdatatable/data.table/compare/c0052964694a4c618ab182aa474f924d40576d94...00c6d859b158737f130d3f5a1da416693358e14d/diff |
I'll need to do some digging later. I don't see how that line isn't covered given all the other lines in that for loop are covered... |
I injected a few Rprintf statements and can confirm that the tests are running code within that loop before and after the line that's supposedly not covered, so I'm not sure what to make of this: Running this test:
I get both messages below output to the R console:
|
@sritchie73 just checking if this is still needed? If so we can merge to Also if you'd like someone else to take over the PR, do feel free. cc @ben-schwen since I know you've been working a lot on |
It may be better to merge other rbindlist PR(s) (Ben had something recently from where we took only regression fix), and once the other one is merged master, then merge master to this one. |
#5453 is the other pending |
Closes #546 and fixes #3811
Example:
A consequence of this PR is that where a data.table has > 1 row and an expression column, each expression is separated out into its own list element:
I don't know if this is desired behaviour, or whether we should default back to base R's coercion rules (i.e. returning an expression vector) - presumably if the user has constructed an expression vector column of length > 1 they would desire an expression vector back? E.g. #4040 is requesting expression columns to be preserved.