-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use lookup (join to dictionary) for performance boost #2603
Comments
Matt mentioned doing something like that for fwrite:
And looking at fwrite, I see a lookup for month and day, I think, and maybe that can be reused...? |
Thanks @franknarf1 but I'm thinking more general (not just dates). Another area where this comes up a lot for me is with regex. For example
If I had the ability to tell data.table to apply this technique with something like EDIT
would be convenient. |
For this technique to be effective, the ratio |
I also use this technique when I know beforehand that the number of unique values is limited compared to the number of rows, which is not an uncommon scenario. I think the core message of @ben519 is that data.table could provide a switch (like |
question here is about API... I think adding you're right that doing the lookup & join by hand every time can be tedious... perhaps |
We have tried lots of ways to get dates in faster. It is a huge speed issue for large datasets. There are some functions that parse date time faster, but they have issues. Lubridate has a fast_strptime function that speeds parsing up a bit. There is also the fasttime package, and the iotools package. So, one option is to roll your own parser, or borrow one from somewhere else. This has to be a solved problem somewhere. I would do this if I had the programming ability, but that is well beyond my skill set. As for the ratio of uniqueN to N, you could just use a rule of thumb. There are only ~36,500 unique dates in the last 100 years. So, any column over, say 400,000 values must have a ratio of less than 10% (and is probably much lower). |
You could also omit the lookup step via: dt[, Date3 := as.IDate(.BY[[1L]]),by=DateChar] Maybe this could be somehow done via an additional argument? |
Raising this again. Ran into it yesterday working with a data set spanning 13 days with 25 million rows -- Thinking of filing a PR -- idea is to add an argument to Ran a quick benchmark to evaluate the benefit. Benefits are biggest when there are few unique dates relative to the number of rows (makes sense, and I think this case is common -- there are only so many days after all...) Larger values of
|
This comment was marked as outdated.
This comment was marked as outdated.
@jangorecki I don't think I follow what you have in mind for a more general version of this. Any As of now, what I implemented in #3279 is implemented within |
I should have read full topic before commenting. Your PR looks like a good way to go with this. |
UPDATE ~4 years later.. I've bumped into this (deficiency?) yet again with another example that I thought was worth sharing. In this example, I have a function, @franknarf1 's suggested trick for my original example doesn't seems to boost performance on this example, which I find very odd. Note that I'm using data.table v 1.14.3
Notice that the lookup technique is nearly 6 times faster than the other two approaches. In my real world example, the time difference is even more substantial. |
FWIW, you might want to try converting to factor and processing on levels(x) |
|
When I have a large dataset with a date column read in as a character, I can convert it to Date type like so
But this is painfully slow. So, I usually build a table of the unique date characters, convert those to Date type, and then do a join + insertion back into my original table
This results in an enormous speedup. I was wondering if similar logic could be embedded in data.table to do this internally where it's appropriate. I haven't given it a ton of thought, but here are some basic points
day()
,weekday()
, ... (I don't think you could generalize this behavior for any function. Imagine trying to apply this technique when the user does something likedt[, foo := myfunc(foo)]
wheremyfunc <- function(x) ifelse(x < 10, x, rnorm(n = 1))
. With that said, having a curated set of some functions where this technique can be applied would still be greatly helpful.)Obviously I haven't fleshed out all the details of how this would work, but hopefully my idea is clear. The performance boost of this technique would be incredibly helpful to my workflow (after all the reason I use data.table over dplyr and pandas is because of its superior performance). Searching through the issues, @MichaelChirico touched on this a bit in #2503
The text was updated successfully, but these errors were encountered: