Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uniqueN could supports list input #1224

Closed
jangorecki opened this issue Jul 15, 2015 · 13 comments
Closed

uniqueN could supports list input #1224

jangorecki opened this issue Jul 15, 2015 · 13 comments
Assignees

Comments

@jangorecki
Copy link
Member

And not just lists but any type which is not supported by uniqueN could simply redirects to length(unique(.)).

library(data.table)
d1 <- data.table(a = 1:4, l = list(list(letters[1:2]),list(Sys.time()),list(1:10),list(letters[1:2])))
d1[,length(unique(l))]
# [1] 3
d1[,uniqueN(l)]
#Error in uniqueN(l) : 
#  x must be an atomic vector or data.frames/data.tables

Looks pretty easy to do, I will try to make PR for that.

@arunsrinivasan
Copy link
Member

@jangorecki this seems to do something different.

dt = data.table(x=c(1,1,2,3), y=c(2,2,3,4))
nrow(unique(dt)) # [1] 3

length(unique(as.list(dt))) # [1] 2

It seems to not check row-wise, rather if any element of list is identical to another.

@jangorecki
Copy link
Member Author

I used uniqueN / length(unique( on single column while you did it on both columns. To make it more clear I simplified the example data:

library(data.table)
d1 <- data.table(l = list(list(letters[1:2]),list(Sys.time()),list(1:10),list(letters[1:2])))
d1[,length(unique(l))]
d1[,uniqueN(l)]

@arunsrinivasan
Copy link
Member

Okay. I guess the idea here is that a list is a vector as well.. that's fine.
Could you make sure it'd work on multiple columns as well? For example:

d1 <- data.table(a = c(1:3,1), l = list(list(letters[1:2]),list(Sys.time()),list(1:10),list(letters[1:2])))
nrow(unique(as.data.frame(d1))) # [1] 3

We'd want that to work, if someone did:

d1[, uniqueN(.SD)] 

This wouldn't work at the moment. Adding a check as to whether any column is of list type might be quite expensive when one performs a grouping operation (checking type for each group). And this is quite a rare case.. Hm.

@arunsrinivasan
Copy link
Member

Maybe for single column case we can already use the fix you've provided, and take care of .SD / data.table cases later. Will merge asap. Thanks @jangorecki.

@jangorecki
Copy link
Member Author

This should not affect .SD as the check is if (!is.atomic(x) && !is.data.frame(x)) return(length(unique(x))).

data.table(a = list(1:2,4))[,is.data.frame(.SD)]
# [1] TRUE
data.table(a = list(1:2,4))[,uniqueN(a)]
# [1] 2
data.table(a = list(1:2,4))[,uniqueN(.SD)]
#Error in forderv(x, by = by, retGrp = TRUE) : 
#  First column being ordered is type 'list', not yet supported
data.table(aa = 1:2, a = list(1:2,4))[,uniqueN(.SD)]
# [1] 2

It still can throw error in a specific case of list column as first column, but this is related to #1229.
I will update doc, readme, and squash soon.

jangorecki added a commit to jangorecki/data.table that referenced this issue Jul 16, 2015
arunsrinivasan added a commit that referenced this issue Jul 17, 2015
uniqueN supports any types, solves #1224
@arunsrinivasan arunsrinivasan reopened this Mar 1, 2016
@arunsrinivasan
Copy link
Member

This fix returns wrong result for list(1:2, 3:4). length(unique()) should only be dispatched for list of lists.. not list of atomics.

@jangorecki
Copy link
Member Author

@arunsrinivasan I'm not sure if I follow. You are saying that list should be handled same as data.table? I don't think it is a proper behavior for list, while it is for data.table.
I can create uniqueN generic so it can be flexible.

@arunsrinivasan
Copy link
Member

Nope. I'm comfortable with uniqueN not supporting list-of-lists. It was not the original purpose. But it needs to do what forder() does for list of atomics (using mget() in j argument would not work as intended otherwise).

@jangorecki
Copy link
Member Author

I will quote from source code, it is quite on point.

# simple straightforward helper function to get the number 
# of groups in a vector or data.table. Here by data.table, 
# we really mean `.SD` - used in a grouping operation

@jangorecki jangorecki self-assigned this Mar 1, 2016
@arunsrinivasan
Copy link
Member

That was done before your update to lists. It just wasn't updated. What's your point?

@arunsrinivasan
Copy link
Member

Will revisit later if the issue of lists come up. I'll revert this functionality to atomics / data.frames for now.

@jangorecki
Copy link
Member Author

@arunsrinivasan yes, just doing it now.

@arunsrinivasan arunsrinivasan modified the milestone: v1.9.8 Mar 1, 2016
@MichaelChirico
Copy link
Member

MichaelChirico commented Dec 11, 2016

A use case for this is in the syntax:

DT[ , uniqueN(.(var1, var2)), by = var3]

Could be handled with uniqueN(.SD)&.SDcols = c("var1", "var2") or with uniqueN(setDT(var1, var2)).

The latter in particular pokes at: uniqueN should work at least for data.table-like lists.

I was also confused by the documentation, which has:

uniqueN(x, by=if (is.list(x)) seq_along(x) else NULL, na.rm=FALSE)

I was surprised to see a test for is.list(x) in the official usage of uniqueN, but x can't be a list list (though it's made clear in the Arguments section that this isn't handled).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants