Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use lookup (join to dictionary) for performance boost #2603

Open
ben519 opened this issue Feb 1, 2018 · 14 comments · May be fixed by #3279
Open

Use lookup (join to dictionary) for performance boost #2603

ben519 opened this issue Feb 1, 2018 · 14 comments · May be fixed by #3279

Comments

@ben519
Copy link

ben519 commented Feb 1, 2018

When I have a large dataset with a date column read in as a character, I can convert it to Date type like so

set.seed(2018)
dt <- data.table(
  DateChar = as.character(as.Date("2000-01-01") + sample(10, replace = T, size = 10^7))
)

# Simple conversion using as.Date
system.time(dt[, Date1 := as.Date(DateChar)])  # ~59 seconds

# Simple conversion using as.IDate
system.time(dt[, Date2 := as.IDate(DateChar)])  # ~59 seconds

But this is painfully slow. So, I usually build a table of the unique date characters, convert those to Date type, and then do a join + insertion back into my original table

system.time({
  uniqueDates <- unique(dt[, list(DateChar)])
  uniqueDates[, Date3 := as.Date(DateChar)]
  dt[uniqueDates, Date3 := i.Date3, on = "DateChar"]
})  # ~0.5 seconds (about 120 times speedup!)

This results in an enormous speedup. I was wondering if similar logic could be embedded in data.table to do this internally where it's appropriate. I haven't given it a ton of thought, but here are some basic points

  • This could be extended for other methods like day(), weekday(), ... (I don't think you could generalize this behavior for any function. Imagine trying to apply this technique when the user does something like dt[, foo := myfunc(foo)] where myfunc <- function(x) ifelse(x < 10, x, rnorm(n = 1)). With that said, having a curated set of some functions where this technique can be applied would still be greatly helpful.)
  • In my example, there would obviously be a performance hit if all the dates in the table were unique. Personally, I'd be willing to live with that occasional hit, but maybe some dumb logic could be embedded like "if the number of values > 1M and the number of unique elements represent less than 50% of the total rows, then apply the technique". Counting the unique elements may be slow, so perhaps the user can opt in for the behavior or it could only be applied on columns containing an index.

Obviously I haven't fleshed out all the details of how this would work, but hopefully my idea is clear. The performance boost of this technique would be incredibly helpful to my workflow (after all the reason I use data.table over dplyr and pandas is because of its superior performance). Searching through the issues, @MichaelChirico touched on this a bit in #2503

R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.2
data.table_1.10.5
@franknarf1
Copy link
Contributor

Matt mentioned doing something like that for fwrite:

(...copying here since comments on SO can get deleted...)

Since you have 100M rows, I suppose you have some repeating dates, so it might be faster to do something like A[, theDate := as.character(theDate[1L]), by=theDate]. fwrite is fairly new, so I'm guessing this treatment of dates is an oversight and they'd welcome a feature request on improving it. – Frank May 20 '16 at 14:05

@ Frank Nice idea. Dates are on the long to do list: #1664. Your idea triggered a thought ... fwrite could do a lookup internally rather than converting each and every date separately. – Matt Dowle Jun 21 '16 at 19:01

And looking at fwrite, I see a lookup for month and day, I think, and maybe that can be reused...?

@ben519
Copy link
Author

ben519 commented Feb 1, 2018

Thanks @franknarf1 but I'm thinking more general (not just dates). Another area where this comes up a lot for me is with regex. For example

set.seed(2017)
str_pool <- stringi::stri_rand_strings(n = 1000, length = 20)
dt <- data.table(Foo = sample(str_pool, size = 10^7, replace = T))

system.time(dt[, NumVowels1 := stringr::str_count(Foo, "[A-Za-z]")])  # ~10.798

system.time({
  uniqueFoos <- unique(dt[, list(Foo)])
  uniqueFoos[, NumVowels2 := stringr::str_count(Foo, "[A-Za-z]")]
  dt[uniqueFoos, NumVowels2 := i.NumVowels2, on = "Foo"]
})  # ~0.736

If I had the ability to tell data.table to apply this technique with something like dt[, NumVowels1 := stringr::str_count(Foo, "[A-Za-z]"), useLookup = T] that would be ideal.

EDIT
I just got what you did @franknarf1. In fact system.time(dt[, NumVowels3 := stringr::str_count(Foo, "[A-Za-z]")[1L], by = Foo]) # ~0.252 runs even faster than what I had. Still, it would be nice if this technique was more integrated with data.table. For example

dt[, `:=`(
  NumVowels1 = stringr::str_count(Foo, "[A-Za-z]"),
  NumConsonants1 = stringr::str_count(Foo, "[^A-Za-z]"),
  ...,
  useLookup = T
)]

would be convenient.

@st-pasha
Copy link
Contributor

st-pasha commented Feb 1, 2018

For this technique to be effective, the ratio nUniqueValues / nRows should be significantly less than 1. If it is close to one then we'd be just wasting time on grouping or maintaining a dictionary. Unfortunately figuring out the number of uniques is expensive in itself (unless it is somehow known beforehand)...

@tdeenes
Copy link
Member

tdeenes commented Feb 1, 2018

I also use this technique when I know beforehand that the number of unique values is limited compared to the number of rows, which is not an uncommon scenario. I think the core message of @ben519 is that data.table could provide a switch (like useLookup = FALSE) which is FALSE by default but can be turned on if required.

@MichaelChirico
Copy link
Member

question here is about API... I think adding useLookup as an option to [.data.table is probably infeasible given the need to know the uniqueN/N ratio for it to be worthwhile, something that's generally easier for the user than for data.table to know (perhaps, like auto-indexing, we could keep the uniqueN available in attr for columns whenever it's calculated to facilitate this? also, related, #2458).

you're right that doing the lookup & join by hand every time can be tedious... perhaps useLookup could instead be a function which wraps around other functions to apply the technique when prompted, a la useLookup(as.Date(x)) would know to apply as.Date using a lookup-and-join approach rather than applying as.Date to the full vector (lazy-eval is our friend here)...

@jangorecki jangorecki changed the title Potential Performance Boost? Use lookup (join to dictionary) for performance boost Feb 2, 2018
@markdanese
Copy link

We have tried lots of ways to get dates in faster. It is a huge speed issue for large datasets. There are some functions that parse date time faster, but they have issues. Lubridate has a fast_strptime function that speeds parsing up a bit. There is also the fasttime package, and the iotools package. So, one option is to roll your own parser, or borrow one from somewhere else. This has to be a solved problem somewhere. I would do this if I had the programming ability, but that is well beyond my skill set.

As for the ratio of uniqueN to N, you could just use a rule of thumb. There are only ~36,500 unique dates in the last 100 years. So, any column over, say 400,000 values must have a ratio of less than 10% (and is probably much lower).

@matthiaskaeding
Copy link

You could also omit the lookup step via:

dt[, Date3 := as.IDate(.BY[[1L]]),by=DateChar]

Maybe this could be somehow done via an additional argument?

@MichaelChirico
Copy link
Member

MichaelChirico commented Jan 12, 2019

Raising this again. Ran into it yesterday working with a data set spanning 13 days with 25 million rows -- as.IDate on the whole dataset took forever but as.IDate on the unique values + merge back was far faster (especially since the data was already sorted).

Thinking of filing a PR -- idea is to add an argument to as.IDate which would induce the merge-based method (probably by default is the best way? given the results below).

Ran a quick benchmark to evaluate the benefit. Benefits are biggest when there are few unique dates relative to the number of rows (makes sense, and I think this case is common -- there are only so many days after all...)

Larger values of unq_pct are impossible for n = 1e7 since dates so large become invalid.

library(data.table)
library(microbenchmark)

set.seed(203949550)
times = CJ(n_rows = 10^(1:7), unq_pct = 1:9/10)

for (ii in times[is.na(ratio), which = TRUE]) {
  input = times[ii, as.character(.Date(sample(unq_pct*n_rows, n_rows, TRUE)))]
  
  t0 = get_nanotime()
  as.IDate(input)
  elapsed = get_nanotime() - t0
  times[ii, raw_IDate := elapsed]
  
  t0 = get_nanotime()
  DT = data.table(input)
  DT[DT[ , TRUE, keyby = input][ , IDate := as.IDate(input)], 
     on = 'input', i.IDate]
  elapsed = get_nanotime() - t0
  times[ii, use_merge := elapsed]
  
  print(times[ii])
}

times[ , ratio := use_merge/raw_IDate][]
#     n_rows unq_pct   raw_IDate   use_merge       ratio
#  1:  1e+01     0.1      199959     3021756  15.1118779
#  2:  1e+01     0.2      165319     3474975  21.0198162
#  3:  1e+01     0.3      176818     3166302  17.9071248
#  4:  1e+01     0.4      158889     3153484  19.8470882
#  5:  1e+01     0.5      171664     3242940  18.8912061
#  6:  1e+01     0.6      178870     3320678  18.5647565
#  7:  1e+01     0.7      186232    19897569 106.8429110
#  8:  1e+01     0.8      195861    15482463  79.0482179
#  9:  1e+01     0.9      177210     9200518  51.9187292
# 10:  1e+02     0.1      632861    20505854  32.4018292
# 11:  1e+02     0.2      885172    20330939  22.9683485
# 12:  1e+02     0.3      615945    22799364  37.0152595
# 13:  1e+02     0.4      711640     6825605   9.5913734
# 14:  1e+02     0.5      605054     3534530   5.8416769
# 15:  1e+02     0.6      637470     3570637   5.6012628
# 16:  1e+02     0.7      643126     4058829   6.3110946
# 17:  1e+02     0.8      613068     3210657   5.2370324
# 18:  1e+02     0.9      629629     3663639   5.8187266
# 19:  1e+03     0.1     4769450     3736763   0.7834788
# 20:  1e+03     0.2     5114147     4461260   0.8723371
# 21:  1e+03     0.3     5005257     5240364   1.0469720
# 22:  1e+03     0.4     5157089     5530070   1.0723239
# 23:  1e+03     0.5     5438516     6854953   1.2604455
# 24:  1e+03     0.6     5372007     7471935   1.3909019
# 25:  1e+03     0.7     5185766     8062086   1.5546567
# 26:  1e+03     0.8     5239942     7201256   1.3743007
# 27:  1e+03     0.9     5318541     7377168   1.3870661
# 28:  1e+04     0.1    52512012    11358165   0.2162965
# 29:  1e+04     0.2    53388903    17137704   0.3209975
# 30:  1e+04     0.3    54443715    23328555   0.4284894
# 31:  1e+04     0.4    55652452    29480601   0.5297269
# 32:  1e+04     0.5    55530846    35698163   0.6428529
# 33:  1e+04     0.6    55306087    36110368   0.6529185
# 34:  1e+04     0.7    55263600    40638049   0.7353493
# 35:  1e+04     0.8    55160108    41273226   0.7482441
# 36:  1e+04     0.9    55478597    43758029   0.7887371
# 37:  1e+05     0.1   351426054    80883563   0.2301581
# 38:  1e+05     0.2   326886378   165915966   0.5075646
# 39:  1e+05     0.3   354084604   212982177   0.6015008
# 40:  1e+05     0.4   336944421   283950983   0.8427235
# 41:  1e+05     0.5   335977548   330784138   0.9845424
# 42:  1e+05     0.6   336208725   338942515   1.0081312
# 43:  1e+05     0.7   346194055   369929968   1.0685625
# 44:  1e+05     0.8   340199877   402226454   1.1823239
# 45:  1e+05     0.9   342883541   418062408   1.2192548
# 46:  1e+06     0.1  3609760940   967885293   0.2681300
# 47:  1e+06     0.2  4047479480  1266096206   0.3128110
# 48:  1e+06     0.3  4380826654  1911159551   0.4362555
# 49:  1e+06     0.4  4570411455  2321844447   0.5080165
# 50:  1e+06     0.5  4298216359  3036054850   0.7063523
# 51:  1e+06     0.6  4351227178  3228498691   0.7419743
# 52:  1e+06     0.7  4704161234  3881075957   0.8250304
# 53:  1e+06     0.8  4762439856  4293166396   0.9014636
# 54:  1e+06     0.9  5161067131  3978574040   0.7708821
# 55:  1e+07     0.1 48453784641 10943946312   0.2258636
# 56:  1e+07     0.2 49662117478 15133634109   0.3047320
# 57:  1e+07     0.3 50916879803          NA          NA
# 58:  1e+07     0.4          NA          NA          NA
# 59:  1e+07     0.5          NA          NA          NA
# 60:  1e+07     0.6          NA          NA          NA
# 61:  1e+07     0.7          NA          NA          NA
# 62:  1e+07     0.8          NA          NA          NA
# 63:  1e+07     0.9          NA          NA          NA
#     n_rows unq_pct   raw_IDate   use_merge       ratio

@jangorecki

This comment was marked as outdated.

@MichaelChirico
Copy link
Member

@jangorecki I don't think I follow what you have in mind for a more general version of this. Any as.* calls within [.data.table?

As of now, what I implemented in #3279 is implemented within as.IDate.default; let's continue discussion there or perhaps open a new broader issue?

@MichaelChirico MichaelChirico linked a pull request Jan 13, 2019 that will close this issue
@jangorecki
Copy link
Member

I should have read full topic before commenting. Your PR looks like a good way to go with this.

@ben519
Copy link
Author

ben519 commented Jun 4, 2022

UPDATE ~4 years later..

I've bumped into this (deficiency?) yet again with another example that I thought was worth sharing. In this example, I have a function, clean_str(x) that, given a vector of strings, returns cleaned strings after a sequence of string processing.

@franknarf1 's suggested trick for my original example doesn't seems to boost performance on this example, which I find very odd.

Note that I'm using data.table v 1.14.3

library(data.table)
requireNamespace("stringi")

set.seed(2022)
foo <- data.table(x = stringi::stri_rand_strings(n = 10^7, 4, '[a-z]'))

print(foo)
               x
          <char>
       1:   vqdo
       2:   eqbb
       3:   jtae
       4:   dnpd
       5:   unpw
      ---       
 9999996:   ozbo
 9999997:   npog
 9999998:   schs
 9999999:   qhhn
10000000:   wiec

uniqueN(foo)
456976

clean_str <- function(x){
  # Given a character vector x, "clean" the strings via a sequence of replacements
  # (In reality this does a lot more string processing)
  
  y <- gsub(pattern = "[aeiou]", replacement = " ", x = x)
  y <- gsub(pattern = "(sh|pq|tv)", replacement = " ", x = y)
  y <- gsub(pattern = "^l", replacement = "zz", x = y)
  y <- gsub(pattern = "^\\s", replacement = "", x = y)
  y <- gsub(pattern = "\\s$", replacement = "", x = y)
  y <- gsub(pattern = "\\s+", replacement = "X", x = y)
  
  return(y)
}

# Method 1
system.time(foo[, y := clean_str(x)])
# user  system elapsed 
# 33.563   0.075  33.644

# Method 2
system.time(foo[, y := clean_str(x[1L]), by = x])
# user  system elapsed 
# 32.803   0.138  29.474 

# Method 3
system.time({
  temp <- unique(foo)
  temp[, y := clean_str(x)]
  foo[temp, y := i.y, on = "x"]
})
# user  system elapsed 
# 16.535   0.158   5.055

Notice that the lookup technique is nearly 6 times faster than the other two approaches. In my real world example, the time difference is even more substantial.

@MichaelChirico
Copy link
Member

FWIW, you might want to try converting to factor and processing on levels(x)

@jangorecki
Copy link
Member

question here is about API... I think adding useLookup as an option to [.data.table is probably infeasible given the need to know the uniqueN/N ratio for it to be worthwhile, something that's generally easier for the user than for data.table to know (perhaps, like auto-indexing, we could keep the uniqueN available in attr for columns whenever it's calculated to facilitate this? also, related, #2458).

This feature is discussed in #4387 and implemented in #4386

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants