Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-Forge #5248] implement "set" (as in setdiff, union etc..) operations for data.tables for convenience #547

Closed
4 tasks done
arunsrinivasan opened this issue Jun 8, 2014 · 8 comments
Assignees
Milestone

Comments

@arunsrinivasan
Copy link
Member

Submitted by: Arun ; Assigned to: Nobody; R-Forge link

data.table has already a very efficient way for getting unique and duplicated. It takes 0.5 seconds where base's unique takes 8 seconds for example - check this benchmark on rpub.

Similarly, why not use the internal engine to generate set operations on data.table?

  • setdiff
  • union
  • intersect
  • setequal

Condition: Number of columns and column classes should match.

@jangorecki
Copy link
Member

setdiff function name might be replaced except (postgres, sqlserver naming convention), to not interfere with set* used by by reference functions.

@jangorecki
Copy link
Member

@arunsrinivasan
Copy link
Member Author

basic benchmark on my machine.

require(data.table)
set.seed(1L)
dt1 = setDT(lapply(1:3, function(x) sample(500:600, 5e7, TRUE)))
dt2 = setDT(lapply(1:3, function(x) sample(560:660, 1e6, TRUE)))

system.time(dt_s <- fsetdiff(dt1, dt2))
#   5.479   0.738   6.264 
system.time(dt_i <- fintersect(dt1, dt2))
#   2.356   0.095   2.475 
system.time(dt_u <- funion(dt1, dt2))
#   2.423   0.333   2.786 

system.time(dt_s2 <- fsetdiff(dt1, dt2, all=TRUE))
#   7.280   0.733   8.080 
system.time(dt_i2 <- fintersect(dt1, dt2, all=TRUE))
#   6.476   0.311   6.851 
system.time(dt_u2 <- funion(dt1, dt2, all=TRUE))
#   0.166   0.163   0.336 

@jangorecki
Copy link
Member

another good reference presenting some graphs on set operators and corresponding SQL queries (sometimes SQL workarounds) for INTERSECT ALL and EXCEPT ALL: http://www.gplivna.eu/papers/sql_set_operators.htm

@arunsrinivasan
Copy link
Member Author

Oh this one's much nicer indeed.

@jangorecki
Copy link
Member

Also leaving some benchmarks on latest data.table and dplyr: Set operators in R
source: https://gist.github.com/jangorecki/0a542fbb8157dec958f5

@jangorecki
Copy link
Member

small volume data re-run:

data 1e6 unique

#DT1: 1000000 rows, 1000000 unq rows
#DT2: 1000000 rows, 1000000 unq rows
#DT1 intersect DT2: 499109 rows
#DT1 setdiff DT2: 500891 rows
#DT1 union DT2: 1500891 rows

## intersect
#Unit: milliseconds
# expr       min        lq      mean    median        uq       max neval
#   df 1599.0095 1651.5923 1743.7876 1714.8873 1826.2016 1929.4169    10
#   dt  818.0777  846.5367  894.1579  902.0847  917.2623  989.7598    10

## setdiff
#Unit: milliseconds
# expr       min        lq      mean    median        uq      max neval
#   df 1610.5838 1648.2851 1862.0443 1771.8324 2054.5137 2331.562    10
#   dt  689.9719  693.1796  722.6169  711.1247  725.3226  807.525    10

## union
#Unit: milliseconds
# expr       min        lq      mean    median        uq       max neval
#   df 2171.7295 2374.7261 2473.9191 2429.1520 2616.0950 2911.9051    10
#   dt  418.4617  433.3666  487.3305  446.4286  573.3986  620.1645    10

## setequal
#Unit: milliseconds
# expr       min       lq      mean    median        uq      max neval
#   df 2083.8287 2125.593 2241.1061 2187.7803 2353.0298 2510.583    10
#   dt  789.9313  795.676  837.7273  811.4293  841.5154 1011.968    10

data 1e6 duplicates

#DT1: 1000000 rows, 432159 unq rows
#DT2: 1000000 rows, 432456 unq rows
#DT1 intersect DT2: 93474 rows
#DT1 setdiff DT2: 338685 rows
#DT1 union DT2: 771141 rows

## intersect
#Unit: milliseconds
# expr       min        lq      mean    median        uq       max neval
#   df 1282.7889 1295.7182 1361.1802 1336.6922 1413.4156 1485.1399    10
#   dt  529.3002  534.2962  554.2797  539.0737  555.6799  668.7858    10

## setdiff
#Unit: milliseconds
# expr       min       lq     mean    median        uq       max neval
#   df 1510.8082 1514.792 1708.682 1776.9783 1821.2516 1912.1937    10
#   dt  642.8345  670.856  710.395  707.8905  757.1905  771.2871    10

## union
#Unit: milliseconds
# expr     min       lq      mean    median        uq       max neval
#   df 1794.22 1795.561 2066.1633 2025.5487 2347.7738 2495.5978    10
#   dt  336.82  357.294  391.5749  377.1024  407.0398  545.6484    10

## setequal
#Unit: milliseconds
# expr       min        lq      mean    median       uq      max neval
#   df 1904.0811 1923.1564 1992.1805 1983.6733 2032.273 2122.502    10
#   dt  929.4357  936.4791  969.8934  947.6446 1015.818 1056.489    10

@jangorecki
Copy link
Member

question where set operators could benefit from by argument: http://stackoverflow.com/q/36112817/2490497 leaving just for future reference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants