-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up key lookup in GroupedDataFrame #2095
Comments
Yes that linear search is really bad for performance. Actually we already have a hash table structure: the But currently we don't store it in |
This is what I thought 😄 (judging by the logic of what we do), but I just do not have enough experience with this part of code base to be sure how exactly this can be achieved.
This is exactly what I thought. If someone uses Summing up: do you have time to make a PR for this? (if not - I can have a look). Thank you! |
One subtlety I hadn't spotted is that we have specialized methods for categorical and pooled arrays, which doesn't create the hash table. And creating it would be quite wasteful when it's not used. So we should probably do the same as for group indices: set it to I probably won't have the time soon, so go ahead. |
Agreed, but only in case hash table is not created (if hash table is created then I understand it is best to create the mapping immediately). Right? |
Yes, basically we should just keep the table instead of throwing it away. Though currently we take a very memory-hungry approach by allocating as many slots as rows. That avoids having to resize the table, but if we keep the structure around it will be a bit wasteful (very much so if you have only ten groups...). |
I have not analyzed it in detail (and that is why I asked if you could implement this PR 😄), but:
Maybe in the future it could be optimized. If not - maybe there is an efficient way to "compress" this structure. Another alternative is to create this structure as (if we went the other way - i.e. by creating a separate structure later - I can implement it; if we go the first way - I think you have more understanding of the details to do it correctly). |
I guess it really depends on the use case. With a small number of groups compared to number of rows, recomputing is OK. For a large number of groups, it will take about as much time as the grouping operation itself. But in the latter case, keeping the hash table that was built for grouping is OK, since it's not too wasteful. So maybe we could apply a threshold, and discard the hash table if the number of groups is too small. For a small number of groups, on the first |
Sounds good 😄. What threshold do you think is reasonable? 10%? Also there should be an option to opt-out from storing it (notably in |
I have looked at this
A possible alternative is to store |
I have done some tests with There are two possible approaches. The first one is:
and the second one:
The difference is only in the signature of
I am not sure which approach would be better. I would tend to prefer In particular - if there are not many groups then then for sure |
I agree that the lookup time probably matters more than the construction time, especially if we build a new dict on the first indexing (instead of reusing the existing hash table). Though I wonder whether reusing our hash table couldn't give us the best of both worlds. It's just a few vectors whose types are always the same, and since we control the methods we could use
Why would we have compilation issues? We already specialize grouping methods on the type of grouping columns (assuming the number of different types will be small). Specializing a small lookup function should be OK too. At any rate, I admit the |
When #2046 is merged I think we should redesign https://github.com/JuliaData/DataFrames.jl/pull/2046/files#diff-6349421a054bb7e74b7f27ae86304cbaR275 (but I did not want to postpone merging of that PR).
What I mean is that on the first call to this function (or maybe even earlier when we call
compute_indices
) we should build a dictionary fromTuples
to integer indices in aGroupedDataFrame
. Then the lookup of Tuple would be very fast if done repeatedly.If we did this we could actually say that the way to add an index to a
DataFrame
, which people very often ask for, is to simply rungroupby
on it. Then you would have a very fast lookup of the index which is a common operation.@nalimilan - do you have any thoughts about it (i.e. if we want it - as it would eat up some memory + what would be the best way to implement it - that is: at what moment it would be best to populate such a dictionary)?
The text was updated successfully, but these errors were encountered: