Use lookup (join to dictionary) for performance boost #2603

ben519 · 2018-02-01T19:38:00Z

When I have a large dataset with a date column read in as a character, I can convert it to Date type like so

set.seed(2018)
dt <- data.table(
  DateChar = as.character(as.Date("2000-01-01") + sample(10, replace = T, size = 10^7))
)

# Simple conversion using as.Date
system.time(dt[, Date1 := as.Date(DateChar)])  # ~59 seconds

# Simple conversion using as.IDate
system.time(dt[, Date2 := as.IDate(DateChar)])  # ~59 seconds

But this is painfully slow. So, I usually build a table of the unique date characters, convert those to Date type, and then do a join + insertion back into my original table

system.time({
  uniqueDates <- unique(dt[, list(DateChar)])
  uniqueDates[, Date3 := as.Date(DateChar)]
  dt[uniqueDates, Date3 := i.Date3, on = "DateChar"]
})  # ~0.5 seconds (about 120 times speedup!)

This results in an enormous speedup. I was wondering if similar logic could be embedded in data.table to do this internally where it's appropriate. I haven't given it a ton of thought, but here are some basic points

This could be extended for other methods like day(), weekday(), ... (I don't think you could generalize this behavior for any function. Imagine trying to apply this technique when the user does something like dt[, foo := myfunc(foo)] where myfunc <- function(x) ifelse(x < 10, x, rnorm(n = 1)). With that said, having a curated set of some functions where this technique can be applied would still be greatly helpful.)
In my example, there would obviously be a performance hit if all the dates in the table were unique. Personally, I'd be willing to live with that occasional hit, but maybe some dumb logic could be embedded like "if the number of values > 1M and the number of unique elements represent less than 50% of the total rows, then apply the technique". Counting the unique elements may be slow, so perhaps the user can opt in for the behavior or it could only be applied on columns containing an index.

Obviously I haven't fleshed out all the details of how this would work, but hopefully my idea is clear. The performance boost of this technique would be incredibly helpful to my workflow (after all the reason I use data.table over dplyr and pandas is because of its superior performance). Searching through the issues, @MichaelChirico touched on this a bit in #2503

R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.2
data.table_1.10.5

The text was updated successfully, but these errors were encountered:

franknarf1 · 2018-02-01T20:10:28Z

Matt mentioned doing something like that for fwrite:

(...copying here since comments on SO can get deleted...)

Since you have 100M rows, I suppose you have some repeating dates, so it might be faster to do something like A[, theDate := as.character(theDate[1L]), by=theDate]. fwrite is fairly new, so I'm guessing this treatment of dates is an oversight and they'd welcome a feature request on improving it. – Frank May 20 '16 at 14:05

@ Frank Nice idea. Dates are on the long to do list: #1664. Your idea triggered a thought ... fwrite could do a lookup internally rather than converting each and every date separately. – Matt Dowle Jun 21 '16 at 19:01

And looking at fwrite, I see a lookup for month and day, I think, and maybe that can be reused...?

ben519 · 2018-02-01T20:26:27Z

Thanks @franknarf1 but I'm thinking more general (not just dates). Another area where this comes up a lot for me is with regex. For example

set.seed(2017)
str_pool <- stringi::stri_rand_strings(n = 1000, length = 20)
dt <- data.table(Foo = sample(str_pool, size = 10^7, replace = T))

system.time(dt[, NumVowels1 := stringr::str_count(Foo, "[A-Za-z]")])  # ~10.798

system.time({
  uniqueFoos <- unique(dt[, list(Foo)])
  uniqueFoos[, NumVowels2 := stringr::str_count(Foo, "[A-Za-z]")]
  dt[uniqueFoos, NumVowels2 := i.NumVowels2, on = "Foo"]
})  # ~0.736

If I had the ability to tell data.table to apply this technique with something like dt[, NumVowels1 := stringr::str_count(Foo, "[A-Za-z]"), useLookup = T] that would be ideal.

EDIT
I just got what you did @franknarf1. In fact system.time(dt[, NumVowels3 := stringr::str_count(Foo, "[A-Za-z]")[1L], by = Foo]) # ~0.252 runs even faster than what I had. Still, it would be nice if this technique was more integrated with data.table. For example

dt[, `:=`(
  NumVowels1 = stringr::str_count(Foo, "[A-Za-z]"),
  NumConsonants1 = stringr::str_count(Foo, "[^A-Za-z]"),
  ...,
  useLookup = T
)]

would be convenient.

st-pasha · 2018-02-01T21:06:10Z

For this technique to be effective, the ratio nUniqueValues / nRows should be significantly less than 1. If it is close to one then we'd be just wasting time on grouping or maintaining a dictionary. Unfortunately figuring out the number of uniques is expensive in itself (unless it is somehow known beforehand)...

tdeenes · 2018-02-01T21:55:00Z

I also use this technique when I know beforehand that the number of unique values is limited compared to the number of rows, which is not an uncommon scenario. I think the core message of @ben519 is that data.table could provide a switch (like useLookup = FALSE) which is FALSE by default but can be turned on if required.

MichaelChirico · 2018-02-02T03:40:08Z

question here is about API... I think adding useLookup as an option to [.data.table is probably infeasible given the need to know the uniqueN/N ratio for it to be worthwhile, something that's generally easier for the user than for data.table to know (perhaps, like auto-indexing, we could keep the uniqueN available in attr for columns whenever it's calculated to facilitate this? also, related, #2458).

you're right that doing the lookup & join by hand every time can be tedious... perhaps useLookup could instead be a function which wraps around other functions to apply the technique when prompted, a la useLookup(as.Date(x)) would know to apply as.Date using a lookup-and-join approach rather than applying as.Date to the full vector (lazy-eval is our friend here)...

markdanese · 2018-02-04T19:35:41Z

We have tried lots of ways to get dates in faster. It is a huge speed issue for large datasets. There are some functions that parse date time faster, but they have issues. Lubridate has a fast_strptime function that speeds parsing up a bit. There is also the fasttime package, and the iotools package. So, one option is to roll your own parser, or borrow one from somewhere else. This has to be a solved problem somewhere. I would do this if I had the programming ability, but that is well beyond my skill set.

As for the ratio of uniqueN to N, you could just use a rule of thumb. There are only ~36,500 unique dates in the last 100 years. So, any column over, say 400,000 values must have a ratio of less than 10% (and is probably much lower).

matthiaskaeding · 2018-02-14T17:27:43Z

You could also omit the lookup step via:

dt[, Date3 := as.IDate(.BY[[1L]]),by=DateChar]

Maybe this could be somehow done via an additional argument?

MichaelChirico · 2019-01-12T17:55:45Z

Raising this again. Ran into it yesterday working with a data set spanning 13 days with 25 million rows -- as.IDate on the whole dataset took forever but as.IDate on the unique values + merge back was far faster (especially since the data was already sorted).

Thinking of filing a PR -- idea is to add an argument to as.IDate which would induce the merge-based method (probably by default is the best way? given the results below).

Ran a quick benchmark to evaluate the benefit. Benefits are biggest when there are few unique dates relative to the number of rows (makes sense, and I think this case is common -- there are only so many days after all...)

Larger values of unq_pct are impossible for n = 1e7 since dates so large become invalid.

library(data.table)
library(microbenchmark)

set.seed(203949550)
times = CJ(n_rows = 10^(1:7), unq_pct = 1:9/10)

for (ii in times[is.na(ratio), which = TRUE]) {
  input = times[ii, as.character(.Date(sample(unq_pct*n_rows, n_rows, TRUE)))]
  
  t0 = get_nanotime()
  as.IDate(input)
  elapsed = get_nanotime() - t0
  times[ii, raw_IDate := elapsed]
  
  t0 = get_nanotime()
  DT = data.table(input)
  DT[DT[ , TRUE, keyby = input][ , IDate := as.IDate(input)], 
     on = 'input', i.IDate]
  elapsed = get_nanotime() - t0
  times[ii, use_merge := elapsed]
  
  print(times[ii])
}

times[ , ratio := use_merge/raw_IDate][]
#     n_rows unq_pct   raw_IDate   use_merge       ratio
#  1:  1e+01     0.1      199959     3021756  15.1118779
#  2:  1e+01     0.2      165319     3474975  21.0198162
#  3:  1e+01     0.3      176818     3166302  17.9071248
#  4:  1e+01     0.4      158889     3153484  19.8470882
#  5:  1e+01     0.5      171664     3242940  18.8912061
#  6:  1e+01     0.6      178870     3320678  18.5647565
#  7:  1e+01     0.7      186232    19897569 106.8429110
#  8:  1e+01     0.8      195861    15482463  79.0482179
#  9:  1e+01     0.9      177210     9200518  51.9187292
# 10:  1e+02     0.1      632861    20505854  32.4018292
# 11:  1e+02     0.2      885172    20330939  22.9683485
# 12:  1e+02     0.3      615945    22799364  37.0152595
# 13:  1e+02     0.4      711640     6825605   9.5913734
# 14:  1e+02     0.5      605054     3534530   5.8416769
# 15:  1e+02     0.6      637470     3570637   5.6012628
# 16:  1e+02     0.7      643126     4058829   6.3110946
# 17:  1e+02     0.8      613068     3210657   5.2370324
# 18:  1e+02     0.9      629629     3663639   5.8187266
# 19:  1e+03     0.1     4769450     3736763   0.7834788
# 20:  1e+03     0.2     5114147     4461260   0.8723371
# 21:  1e+03     0.3     5005257     5240364   1.0469720
# 22:  1e+03     0.4     5157089     5530070   1.0723239
# 23:  1e+03     0.5     5438516     6854953   1.2604455
# 24:  1e+03     0.6     5372007     7471935   1.3909019
# 25:  1e+03     0.7     5185766     8062086   1.5546567
# 26:  1e+03     0.8     5239942     7201256   1.3743007
# 27:  1e+03     0.9     5318541     7377168   1.3870661
# 28:  1e+04     0.1    52512012    11358165   0.2162965
# 29:  1e+04     0.2    53388903    17137704   0.3209975
# 30:  1e+04     0.3    54443715    23328555   0.4284894
# 31:  1e+04     0.4    55652452    29480601   0.5297269
# 32:  1e+04     0.5    55530846    35698163   0.6428529
# 33:  1e+04     0.6    55306087    36110368   0.6529185
# 34:  1e+04     0.7    55263600    40638049   0.7353493
# 35:  1e+04     0.8    55160108    41273226   0.7482441
# 36:  1e+04     0.9    55478597    43758029   0.7887371
# 37:  1e+05     0.1   351426054    80883563   0.2301581
# 38:  1e+05     0.2   326886378   165915966   0.5075646
# 39:  1e+05     0.3   354084604   212982177   0.6015008
# 40:  1e+05     0.4   336944421   283950983   0.8427235
# 41:  1e+05     0.5   335977548   330784138   0.9845424
# 42:  1e+05     0.6   336208725   338942515   1.0081312
# 43:  1e+05     0.7   346194055   369929968   1.0685625
# 44:  1e+05     0.8   340199877   402226454   1.1823239
# 45:  1e+05     0.9   342883541   418062408   1.2192548
# 46:  1e+06     0.1  3609760940   967885293   0.2681300
# 47:  1e+06     0.2  4047479480  1266096206   0.3128110
# 48:  1e+06     0.3  4380826654  1911159551   0.4362555
# 49:  1e+06     0.4  4570411455  2321844447   0.5080165
# 50:  1e+06     0.5  4298216359  3036054850   0.7063523
# 51:  1e+06     0.6  4351227178  3228498691   0.7419743
# 52:  1e+06     0.7  4704161234  3881075957   0.8250304
# 53:  1e+06     0.8  4762439856  4293166396   0.9014636
# 54:  1e+06     0.9  5161067131  3978574040   0.7708821
# 55:  1e+07     0.1 48453784641 10943946312   0.2258636
# 56:  1e+07     0.2 49662117478 15133634109   0.3047320
# 57:  1e+07     0.3 50916879803          NA          NA
# 58:  1e+07     0.4          NA          NA          NA
# 59:  1e+07     0.5          NA          NA          NA
# 60:  1e+07     0.6          NA          NA          NA
# 61:  1e+07     0.7          NA          NA          NA
# 62:  1e+07     0.8          NA          NA          NA
# 63:  1e+07     0.9          NA          NA          NA
#     n_rows unq_pct   raw_IDate   use_merge       ratio

MichaelChirico · 2019-01-13T04:54:03Z

@jangorecki I don't think I follow what you have in mind for a more general version of this. Any as.* calls within [.data.table?

As of now, what I implemented in #3279 is implemented within as.IDate.default; let's continue discussion there or perhaps open a new broader issue?

jangorecki · 2019-01-13T05:47:21Z

I should have read full topic before commenting. Your PR looks like a good way to go with this.

ben519 · 2022-06-04T13:17:19Z

UPDATE ~4 years later..

I've bumped into this (deficiency?) yet again with another example that I thought was worth sharing. In this example, I have a function, clean_str(x) that, given a vector of strings, returns cleaned strings after a sequence of string processing.

@franknarf1 's suggested trick for my original example doesn't seems to boost performance on this example, which I find very odd.

Note that I'm using data.table v 1.14.3

library(data.table)
requireNamespace("stringi")

set.seed(2022)
foo <- data.table(x = stringi::stri_rand_strings(n = 10^7, 4, '[a-z]'))

print(foo)
               x
          <char>
       1:   vqdo
       2:   eqbb
       3:   jtae
       4:   dnpd
       5:   unpw
      ---       
 9999996:   ozbo
 9999997:   npog
 9999998:   schs
 9999999:   qhhn
10000000:   wiec

uniqueN(foo)
456976

clean_str <- function(x){
  # Given a character vector x, "clean" the strings via a sequence of replacements
  # (In reality this does a lot more string processing)
  
  y <- gsub(pattern = "[aeiou]", replacement = " ", x = x)
  y <- gsub(pattern = "(sh|pq|tv)", replacement = " ", x = y)
  y <- gsub(pattern = "^l", replacement = "zz", x = y)
  y <- gsub(pattern = "^\\s", replacement = "", x = y)
  y <- gsub(pattern = "\\s$", replacement = "", x = y)
  y <- gsub(pattern = "\\s+", replacement = "X", x = y)
  
  return(y)
}

# Method 1
system.time(foo[, y := clean_str(x)])
# user  system elapsed 
# 33.563   0.075  33.644

# Method 2
system.time(foo[, y := clean_str(x[1L]), by = x])
# user  system elapsed 
# 32.803   0.138  29.474 

# Method 3
system.time({
  temp <- unique(foo)
  temp[, y := clean_str(x)]
  foo[temp, y := i.y, on = "x"]
})
# user  system elapsed 
# 16.535   0.158   5.055

Notice that the lookup technique is nearly 6 times faster than the other two approaches. In my real world example, the time difference is even more substantial.

MichaelChirico · 2022-06-04T15:55:31Z

FWIW, you might want to try converting to factor and processing on levels(x)

jangorecki · 2022-06-04T19:49:34Z

question here is about API... I think adding useLookup as an option to [.data.table is probably infeasible given the need to know the uniqueN/N ratio for it to be worthwhile, something that's generally easier for the user than for data.table to know (perhaps, like auto-indexing, we could keep the uniqueN available in attr for columns whenever it's calculated to facilitate this? also, related, #2458).

This feature is discussed in #4387 and implemented in #4386

jangorecki changed the title ~~Potential Performance Boost?~~ Use lookup (join to dictionary) for performance boost Feb 2, 2018

jangorecki added the performance label Feb 2, 2018

This comment was marked as outdated.

Sign in to view

MichaelChirico linked a pull request Jan 13, 2019 that will close this issue

use look-up table in IDate conversion #3279

Draft

jangorecki added the joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins label Apr 6, 2020

MichaelChirico added the idate/itime label Sep 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use lookup (join to dictionary) for performance boost #2603

Use lookup (join to dictionary) for performance boost #2603

ben519 commented Feb 1, 2018

franknarf1 commented Feb 1, 2018

ben519 commented Feb 1, 2018 •

edited

Loading

st-pasha commented Feb 1, 2018

tdeenes commented Feb 1, 2018

MichaelChirico commented Feb 2, 2018

markdanese commented Feb 4, 2018

matthiaskaeding commented Feb 14, 2018

MichaelChirico commented Jan 12, 2019 •

edited

Loading

This comment was marked as outdated.

MichaelChirico commented Jan 13, 2019

jangorecki commented Jan 13, 2019

ben519 commented Jun 4, 2022 •

edited

Loading

MichaelChirico commented Jun 4, 2022

jangorecki commented Jun 4, 2022

Use lookup (join to dictionary) for performance boost #2603

Use lookup (join to dictionary) for performance boost #2603

Comments

ben519 commented Feb 1, 2018

franknarf1 commented Feb 1, 2018

ben519 commented Feb 1, 2018 • edited Loading

st-pasha commented Feb 1, 2018

tdeenes commented Feb 1, 2018

MichaelChirico commented Feb 2, 2018

markdanese commented Feb 4, 2018

matthiaskaeding commented Feb 14, 2018

MichaelChirico commented Jan 12, 2019 • edited Loading

This comment was marked as outdated.

MichaelChirico commented Jan 13, 2019

jangorecki commented Jan 13, 2019

ben519 commented Jun 4, 2022 • edited Loading

MichaelChirico commented Jun 4, 2022

jangorecki commented Jun 4, 2022

ben519 commented Feb 1, 2018 •

edited

Loading

MichaelChirico commented Jan 12, 2019 •

edited

Loading

ben519 commented Jun 4, 2022 •

edited

Loading