-
Notifications
You must be signed in to change notification settings - Fork 997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
subsetting n rows by .SD is O(e^n) #1400
Comments
Interesting. How many groups per 100k rows? If you had done
before this (as an extreme case), then your observation would be confounded, right? If you posted
somewhere (not into a post here, of course), then we could all see the exact same data and compare alternatives (using |
@franknarf1 Test away! It takes about 10 minutes to download the data the first time, unless someone else has recently done it in which case the server has a cached copy that it serves up faster. Please do not use this code, see updated example below ##------------------------------------------------------------------------------
## Initalize
##------------------------------------------------------------------------------
library(data.table)
devtools::install_github("chicago/RSocrata", ref="sprint7")
bus <- RSocrata::read.socrata(hostname="https://data.cityofchicago.org",
resourcePath="r5kz-chrr.csv", ## <- actually JSON
keyfield = "id")
## If you want to run this again, save the file rather than downloading it again
# saveRDS(bus, "bus.Rds")
# bus <- readRDS("bus.Rds")
bus <- as.data.table(bus)
##------------------------------------------------------------------------------
## Current function to do subsetting
##------------------------------------------------------------------------------
most_recent_bus_lic_fast <- function(dt){
dat <- copy(dt)
## Keep records where LICENSE_ID is the maximum LICENSE_ID for all records
## of a particular LICENSE_NUMBER (in other words, just keep the most
## recent license number)
dat <- dat[ , keep := (LICENSE.ID) == max(LICENSE.ID), LICENSE.NUMBER]
dat <- dat[keep==TRUE]
dat <- dat[ , keep:=NULL][]
return(dat)
}
system.time(bus_recent1 <- most_recent_bus_lic_fast(bus))
##------------------------------------------------------------------------------
## The "living the dream" function to do subsetting
##------------------------------------------------------------------------------
most_recent_bus_lic_slow <- function(dt){
dat <- copy(dt)
## Same as above, but using the preferred .SD method
dat <- dat[ , .SD[LICENSE.ID == max(LICENSE.ID)], LICENSE.NUMBER]
return(dat)
}
## Time tests with only 10k chunks of records, previous test was with 100k
setkey(bus, LICENSE.NUMBER, LICENSE.ID)
system.time(bus_recent2a <- most_recent_bus_lic_slow(bus[1:10000]))
system.time(bus_recent2b <- most_recent_bus_lic_slow(bus[1:20000]))
system.time(bus_recent2c <- most_recent_bus_lic_slow(bus[1:30000]))
system.time(bus_recent2d <- most_recent_bus_lic_slow(bus[1:40000]))
system.time(bus_recent2e <- most_recent_bus_lic_slow(bus[1:50000])) |
PS: I love working in the public sector where I can often share my data. |
Hm, haven't downloaded it myself yet, but weren't you doing If license IDs are unique or you don't mind just taking the first max:
should work. One minor point that could be costing you runtime: You shouldn't need to
Copying a big table can take time and |
@franknarf1 my issue / question is "why does .SD take take exponentially longer as you linearly add rows" Originally I showed I have good reasons for using Yes, it's possible to write the same procedure using the Comparison of |
As @franknarf1 mentioned, to understand the timings you first need to understand the size of the loops. In this case - how many more unique license numbers do you get per each 100k rows you add, i.e. what are these numbers: nrow(dat[1:1e5, 1, LICENSE_NUMBER])
nrow(dat[1:2e5, 1, LICENSE_NUMBER])
... |
Sorry, still haven't looked at your data, but here's mine, for which it is linear:
Note that my benchmark does not have the subsetting operation inside of it; |
There are lots of examples that don't show this problem, which is why I provided the example that I provided. If it were easy to find a good example I would have posted this question a year ago when I first had the problem. I've had the same issue with many types of real world data. The It doesn't take that long to download the data. Use the CSV download and The LICENSE_NUMBER is about 30% unique, see the NAsummary below str(fbus)
# Classes ‘data.table’ and 'data.frame': 843765 obs. of 31 variables:
# $ ID : chr "1000000-20020221" "1000049-20010816" "1000049-20020516" "1000049-20020816" ...
# $ LICENSE_ID : num 1000000 1162772 1233615 1265665 1342680 ...
# $ ACCOUNT_NUMBER : num 200001 200068 10141 200068 10141 ...
# $ SITE_NUMBER : num 1 1 2 1 2 1 2 1 1 2 ...
# $ LEGAL_NAME : chr "MARK BOSTON" "ANTONIA CASTREJON" "PEPE\"S RETAIL MEATS, INC." "ANTONIA CASTREJON" ...
# $ DOING_BUSINESS_AS_NAME : chr "COLORS IN MOTION" "ILLUSIONS HAIR DESIGN" "PEREZ MEXICAN FOOD" "ILLUSIONS HAIR DESIGN" ...
# $ ADDRESS : chr "6421 N DAMEN AVE" "3800 W DIVERSEY AVE" "853-855 W RANDOLPH ST 1ST" "3800 W DIVERSEY AVE" ...
# $ CITY : chr "CHICAGO" "CHICAGO" "CHICAGO" "CHICAGO" ...
# $ STATE : chr "IL" "IL" "IL" "IL" ...
# $ ZIP_CODE : chr "60645" "60647" "60607" "60647" ...
# $ WARD : num 50 30 27 30 27 30 27 30 30 27 ...
# $ PRECINCT : num 28 999 1 999 1 999 1 999 999 1 ...
# $ POLICE_DISTRICT : num 24 25 12 25 12 25 12 25 25 12 ...
# $ LICENSE_CODE : num 1011 1010 1006 1010 1006 ...
# $ LICENSE_DESCRIPTION : chr "Home Repair" "Limited Business License" "Retail Food Establishment" "Limited Business License" ...
# $ LICENSE_NUMBER : num 1e+06 1e+06 1e+06 1e+06 1e+06 ...
# $ APPLICATION_TYPE : chr "ISSUE" "RENEW" "RENEW" "RENEW" ...
# $ APPLICATION_CREATED_DATE : IDate, format: "2000-06-19" NA NA ...
# $ APPLICATION_REQUIREMENTS_COMPLETE: IDate, format: "2002-02-15" "2001-06-25" "2002-03-27" ...
# $ PAYMENT_DATE : IDate, format: "2002-02-15" "2001-08-20" "2002-04-17" ...
# $ CONDITIONAL_APPROVAL : chr "N" "N" "N" "N" ...
# $ LICENSE_TERM_START_DATE : IDate, format: "2002-02-21" "2001-08-16" "2002-05-16" ...
# $ LICENSE_TERM_EXPIRATION_DATE : IDate, format: "2002-11-15" "2002-08-15" "2003-05-15" ...
# $ LICENSE_APPROVED_FOR_ISSUANCE : IDate, format: "2002-02-21" "2001-08-20" "2002-04-17" ...
# $ DATE_ISSUED : IDate, format: "2002-02-22" "2002-04-30" "2002-04-18" ...
# $ LICENSE_STATUS : chr "AAI" "AAI" "AAI" "AAI" ...
# $ LICENSE_STATUS_CHANGE_DATE : IDate, format: NA NA NA ...
# $ SSA : num NA NA NA NA NA NA NA NA NA NA ...
# $ LATITUDE : num 42 41.9 41.9 41.9 41.9 ...
# $ LONGITUDE : num -87.7 -87.7 -87.6 -87.7 -87.6 ...
# $ LOCATION : chr "(41.99851437112669, -87.68001090539342)" "(41.931960332638006, -87.72215036594574)" "(41.88426142200001, -87.6495341312589)" "(41.931960332638006, -87.72215036594574)" ...
# - attr(*, ".internal.selfref")=<externalptr>
geneorama::NAsummary(fbus)
# col Count nNA rNA nUnique rUnique
# ID 1 843765 0 0.0000 843529 0.9997
# LICENSE_ID 2 843765 0 0.0000 843765 1.0000
# ACCOUNT_NUMBER 3 843765 0 0.0000 146771 0.1739
# SITE_NUMBER 4 843765 0 0.0000 387 0.0004
# LEGAL_NAME 5 843765 0 0.0000 144814 0.1716
# DOING_BUSINESS_AS_NAME 6 843765 0 0.0000 165890 0.1966
# ADDRESS 7 843765 0 0.0000 156344 0.1852
# CITY 8 843765 0 0.0000 1666 0.0019
# STATE 9 843765 0 0.0000 55 0.0000
# ZIP_CODE 10 843765 8 0.0000 2494 0.0029
# WARD 11 843765 61102 0.0724 51 0.0000
# PRECINCT 12 843765 79015 0.0936 79 0.0000
# POLICE_DISTRICT 13 843765 74450 0.0882 32 0.0000
# LICENSE_CODE 14 843765 0 0.0000 132 0.0001
# LICENSE_DESCRIPTION 15 843765 0 0.0000 132 0.0001
# LICENSE_NUMBER 16 843765 1 0.0000 259203 0.3071 <- here you go
# APPLICATION_TYPE 17 843765 0 0.0000 6 0.0000
# APPLICATION_CREATED_DATE 18 843765 651561 0.7722 3813 0.0045
# APPLICATION_REQUIREMENTS_COMPLETE 19 843765 17442 0.0206 4167 0.0049
# PAYMENT_DATE 20 843765 20734 0.0245 4994 0.0059
# CONDITIONAL_APPROVAL 21 843765 0 0.0000 2 0.0000
# LICENSE_TERM_START_DATE 22 843765 2499 0.0029 3574 0.0042
# LICENSE_TERM_EXPIRATION_DATE 23 843765 63 0.0000 1636 0.0019
# LICENSE_APPROVED_FOR_ISSUANCE 24 843765 43718 0.0518 4776 0.0056
# DATE_ISSUED 25 843765 0 0.0000 3478 0.0041
# LICENSE_STATUS 26 843765 0 0.0000 5 0.0000
# LICENSE_STATUS_CHANGE_DATE 27 843765 786570 0.9322 3396 0.0040
# SSA 28 843765 622703 0.7380 54 0.0000
# LATITUDE 29 843765 61956 0.0734 73869 0.0875
# LONGITUDE 30 843765 61956 0.0734 73876 0.0875
# LOCATION 31 843765 0 0.0000 73888 0.0875 |
@geneorama my guess your expectations are simply off - what slows down loops (aka Compute the numbers I suggested, and if those do not grow "exponentially", then we have something interesting here. |
@eantonya I think I (just now) realized your point about the time increases if it were exponential. It's hard to pick up on the nuances by just reading the thread, and I think you missed the detail that the entire 800k+ data is processed in less than 1 second if you do the process in steps / use So sure, the increments could be growing if the distribution is changing, but the whole thing shouldn't take more than a second, and it definitely shouldn't take more than 10 seconds. |
system.time(bus[1:800000, .SD[1L], by=LICENSE.NUMBER])
# user system elapsed
# 19.379 0.215 19.804
system.time(bus[1:800000, .I[1L], by=LICENSE.NUMBER])
# user system elapsed
# 0.085 0.007 0.092
system.time(bus[bus[1:800000, .I[1L], by=LICENSE.NUMBER]$V1])
# user system elapsed
# 0.157 0.018 0.176
|
@arunsrinivasan Am I using |
There's an entire issue for it #735 |
For the record, I think @franknarf1 had the best suggestion in the beginning. Since I only wanted to subset the rows within a group I don't need the overhead of using In my mind since the S in This advice was very good advice (and I use this pattern myself quite often):
As @franknarf1 pointed out there could be an issue with taking the first max, which is why I prefer the longer route inspired by the SO link. I would just break it up into pieces to attempt to make it more readable ii <- bus[ , .I[(LICENSE.ID) == max(LICENSE.ID)], LICENSE.NUMBER]
bus_most_recent <- bus[ii$V1]
rm(ii) I'm wondering though, whenever the |
Yeah, that's the spirit of the optimization for By the way, eantonya is the @eddi you see authoring that now-idiomatic |
@franknarf1 Thanks. I couldn't follow the linked issue yesterday, so I didn't see the link. Also, thanks for linking eatonya to eddi. I've been using I didn't want to get off track yesterday, but I use Also I do the |
Re-running this in this environment:
I ran this:
Output plot: So, maybe this was really a @geneorama are you able to reproduce the slow timing again? I would close this now but our dependency is still on |
Here's what I got from running on R 3.1.0 via Jan's Docker image https://hub.docker.com/r/jangorecki/r-3.1.0/~/dockerfile/ So, whatever problem was there doesn't appear to be there anymore... closing |
There is still a huge performance issue, but maybe it doesn't belong in this issue anymore because the growth of the time is no longer exponential. Still, by my calculations it takes about 300 times longer to use Originally I focused on the growing time because it was growing so fast that I couldn't estimate how long it would take to do the whole data set. Now the growth is linear, so I can estimate it, which is where I got the 300x figure above. Since I authored my example before we've updated Note the following will install the RSocrata package from CRAN if you don't already have it ##------------------------------------------------------------------------------
## Install RSocrata, load libraries, download data
## (note, new url , and new field names)
##------------------------------------------------------------------------------
library(data.table)
if(!"RSocrata" %in% rownames(installed.packages())) install.packages("RSocrata")
bus <-as.data.table(read.socrata("https://data.cityofchicago.org/resource/xqx5-8hwx.csv"))
##------------------------------------------------------------------------------
## Current function to do subsetting
## This is the best I can do in terms of performance; create a dummy
## variable called "keep" then delete it later.
##------------------------------------------------------------------------------
most_recent_bus_lic_fast <- function(dt){
dat <- copy(dt)
## Keep records where LICENSE_ID is the maximum LICENSE_ID for all records
## of a particular LICENSE_NUMBER (in other words, just keep the most
## recent license number)
dat <- dat[ , keep := (license_id) == max(license_id), license_number]
dat <- dat[keep==TRUE]
dat <- dat[ , keep:=NULL][]
return(dat)
}
## This takes about .7 seconds on my humble laptop
system.time(bus_recent1 <- most_recent_bus_lic_fast(bus))
##------------------------------------------------------------------------------
## This is how I would want to subset; by using the beauty of .SD and
## just one line of code!
##------------------------------------------------------------------------------
most_recent_bus_lic_slow <- function(dt){
dat <- copy(dt)
## Same as above, but using the preferred .SD method
dat <- dat[ , .SD[license_id == max(license_id)], license_number]
return(dat)
}
##------------------------------------------------------------------------------
## Calculate system times
##------------------------------------------------------------------------------
## Set the key for .SD to work, guess it should really be in the function?
setkey(bus, license_number, license_id)
## Initialize table of system times
system_times <- data.table(size = seq(10000, 50000, by=10000),
elapsed = NA_real_)
## Calculate times for each chunk size
for(i in 1:nrow(system_times)){
print(i)
cur_size <- system_times[i, size]
cur_elapsed <- system.time(most_recent_bus_lic_slow(bus[1:cur_size]))[["elapsed"]]
system_times[i, elapsed := cur_elapsed]
}
system_times
##------------------------------------------------------------------------------
## Plot system times
##------------------------------------------------------------------------------
plot(system_times)
##------------------------------------------------------------------------------
## Model and plot system times with estimated system time for nrow(bus)
##------------------------------------------------------------------------------
time_model <- lm(elapsed ~ size, system_times)
pred <- unname(predict(time_model, data.frame(size=nrow(bus)) ))
system_times <- rbind(system_times,
data.table(size = nrow(bus),
elapsed = pred))
plot(system_times) The plot of times looks linear now: However it still takes less than a second to subset with dummy variables, and it would take a long time with I don't know how this came to your attention, but strangely this came up with a colleague of mine just last week and I referred him to this issue. He said that he has a similar problem. I just got him on the We both agree, I remember reading other things that @jangorecki and/or @arunsrinivasan wrote about the behind the scenes optimization and please let me know if you want a new issue, or if one's already open, or if you don't care. I'm not opening an issue unless prompted, thank you! My current specs:
|
@geneorama Dupe of #613 ? (I'm not sure what that issue is about precisely, but it seems related.) Btw, Arun answered it on SO: https://stackoverflow.com/a/31854111 Your use of
so sorting and dropping dupes is another option. All approaches are fast except for j = .SD[condition] ... so don't use that one; study GForce; and keep an eye on #735 until #613 is done ...?
|
@franknarf1 look, you're way better at data.table than me. I love dt, but this level of detail hurts my brain. I'm quite happy to use it normally, then when something takes a long time (which is very rare) I work around it. I'm not above converting to data.frame, doing Edit: thank you very much for your comments and analysis. Seriously, you guys are amazing. I read your answers all the time on SO. Edit2: Actually, Arun's approach makes a lot of sense looking at this again. It's not that complicated. The linked issues are complicated though. |
@geneorama it looks like what you're saying is: the workaround syntax feels much less natural than what your first instinct was. Therefore one/both of the following could help you as a "lay"/casual (for lack of a better phrasing off the top of my head) user
I think the first is covered in the issues Frank linked. The second might be filed separately or included in an existing issue about vignettes. Thanks for your feedback :) |
Please feel free to edit the subject, I may not be saying this right...
I've noticed again and again that subsetting on .SD gets less efficient as the size of a
data.table
increases. This has been consistent across versions of R, Linux and Windows, and versions ofdata.table
.Here's an example based on license data from data.cityofchicago.org
Never mind that I'm debugging, the point is that each time I add 100k rows the additional processing time grows exponentially, even though I set LICENSE_NUMBER as the key.
The times visualized:
plot(diff(c(1.34, 2.95, 4.56, 6.32, 8.85, 11.28, 15.27, 23.22)))
lines(lowess(diff(c(1.34, 2.95, 4.56, 6.32, 8.85, 11.28, 15.27, 23.22))))
Has anyone else noticed this?
The text was updated successfully, but these errors were encountered: