Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up key lookup in GroupedDataFrame #2095

Closed
bkamins opened this issue Jan 31, 2020 · 11 comments · Fixed by #2132
Closed

Speed up key lookup in GroupedDataFrame #2095

bkamins opened this issue Jan 31, 2020 · 11 comments · Fixed by #2132
Labels
non-breaking The proposed change is not breaking performance
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented Jan 31, 2020

When #2046 is merged I think we should redesign https://github.com/JuliaData/DataFrames.jl/pull/2046/files#diff-6349421a054bb7e74b7f27ae86304cbaR275 (but I did not want to postpone merging of that PR).

What I mean is that on the first call to this function (or maybe even earlier when we call compute_indices) we should build a dictionary from Tuples to integer indices in a GroupedDataFrame. Then the lookup of Tuple would be very fast if done repeatedly.

If we did this we could actually say that the way to add an index to a DataFrame, which people very often ask for, is to simply run groupby on it. Then you would have a very fast lookup of the index which is a common operation.

@nalimilan - do you have any thoughts about it (i.e. if we want it - as it would eat up some memory + what would be the best way to implement it - that is: at what moment it would be best to populate such a dictionary)?

@nalimilan
Copy link
Member

Yes that linear search is really bad for performance.

Actually we already have a hash table structure: the gslots vector returned by row_group_slots allows mapping a hash to the index of the first row in the group, and the groups vector allows getting the index of the group corresponding to that row.

But currently we don't store it in GroupedDataFrame as it's not used. Keeping that information would only use a significant amount of memory when the number of groups is large, so maybe that's OK. We could add an option to discard it.

@bkamins
Copy link
Member Author

bkamins commented Feb 1, 2020

Actually we already have a hash table structure

This is what I thought 😄 (judging by the logic of what we do), but I just do not have enough experience with this part of code base to be sure how exactly this can be achieved.

We could add an option to discard it.

This is exactly what I thought. If someone uses by we do not need to store them as GroupedDataFrame gets discarded anyway immediately.

Summing up: do you have time to make a PR for this? (if not - I can have a look). Thank you!

@nalimilan
Copy link
Member

One subtlety I hadn't spotted is that we have specialized methods for categorical and pooled arrays, which doesn't create the hash table. And creating it would be quite wasteful when it's not used. So we should probably do the same as for group indices: set it to nothing, and only create it on first access.

I probably won't have the time soon, so go ahead.

@bkamins
Copy link
Member Author

bkamins commented Feb 2, 2020

Agreed, but only in case hash table is not created (if hash table is created then I understand it is best to create the mapping immediately). Right?

@nalimilan
Copy link
Member

Yes, basically we should just keep the table instead of throwing it away. Though currently we take a very memory-hungry approach by allocating as many slots as rows. That avoids having to resize the table, but if we keep the structure around it will be a bit wasteful (very much so if you have only ten groups...).

@bkamins
Copy link
Member Author

bkamins commented Feb 3, 2020

I have not analyzed it in detail (and that is why I asked if you could implement this PR 😄), but:

Though currently we take a very memory-hungry approach by allocating as many slots as rows.

Maybe in the future it could be optimized. If not - maybe there is an efficient way to "compress" this structure.

Another alternative is to create this structure as Dict the first time user asks for group using a Tuple as you have mentioned I think. This would be a bit wasteful (as we would repeat the work done a second time), but I guess it should be fast to create it (one just needs to iterate once over starts field). On my laptop creating such a structure for 10,000,000 groups takes just over 1 second (and I would assume this is the practical upper-bound). E.g. for 1000 groups the time is not noticeable.

(if we went the other way - i.e. by creating a separate structure later - I can implement it; if we go the first way - I think you have more understanding of the details to do it correctly).

@nalimilan
Copy link
Member

I guess it really depends on the use case. With a small number of groups compared to number of rows, recomputing is OK. For a large number of groups, it will take about as much time as the grouping operation itself. But in the latter case, keeping the hash table that was built for grouping is OK, since it's not too wasteful. So maybe we could apply a threshold, and discard the hash table if the number of groups is too small. For a small number of groups, on the first getindex call, we would reconstruct the hash table by basically calling row_group_slots on the subset of first rows for each group, and store it.

@bkamins
Copy link
Member Author

bkamins commented Feb 5, 2020

Sounds good 😄.

What threshold do you think is reasonable? 10%?

Also there should be an option to opt-out from storing it (notably in by we do not need to store it). So this means that this 10% (or whatever other value) should be a kwarg argument of groupby.

@bkamins
Copy link
Member Author

bkamins commented Feb 25, 2020

I have looked at this gslots data structure and I am not fully clear how we could use it.
If I understand correctly we would have to do:

  1. compute hash of a passed tuple that the user wants to look up using the algorithm from hashrows but just for a single value
  2. using gslots locate the row
  3. check if the row is hit or miss (this will be type unstable now - we could fix it if we stored key columns as type stable in GroupedDataFrame, but this means that we would have compilation issues for all operations on GroupedDataFrame)
  4. if there is a conflict repeat the process as in row_group_slots implementation

A possible alternative is to store Dict{Any, Int} (to avoid constant compilations) that will be populated on the first call of getindex on a GroupedDataFrame using a tuple (it would store mappings form tuples to group indices). The cost of it - as discussed is that we would repeat the job twice if there are very many groups, but maybe this is good enough for now? Populating such a dict should be less than 1 second for 10^7 rows (I will benchmark it if we go this way), so I hope it would be an acceptable latency.

@bkamins
Copy link
Member Author

bkamins commented Feb 26, 2020

I have done some tests with Dict (as I am not clear how to do the same with row_group_slots efficiently). Also I think it is better to have "something" now, and we can always improve it in the future if we have better ideas.

There are two possible approaches. The first one is:

function genidxmap(gdf)
    _genidxmap(gdf, ntuple(i -> parent(gdf)[!, gdf.cols[i]], length(gdf.cols)))
end

function _genidxmap(gdf, cols)
    d = Dict{Tuple{eltype.(cols)...}, Int}()
    sizehint!(d, length(gdf.starts))
    for (i, s) in enumerate(gdf.starts)
        d[getindex.(cols, i)] = i
    end
    d
end

and the second one:

function genidxmap2(gdf)
    _genidxmap2(gdf, ntuple(i -> parent(gdf)[!, gdf.cols[i]], length(gdf.cols)))
end

function _genidxmap2(gdf, cols)
    d = Dict{Any, Int}()
    sizehint!(d, length(gdf.starts))
    for (i, s) in enumerate(gdf.starts)
        d[getindex.(cols, i)] = i
    end
    d
end

The difference is only in the signature of d dictionary. Here are the consequences:

  • creation of mapping using genidxmap is ~ 5x faster than with genidxmap2
  • later indexing is faster with genidxmap2 than with genidxmap (around 3x, because we have one dynamic dispatch here) - the reason is that Dict{Any, Int} can be stored in GroupedDataFrame as a concrete type and Dict{Tuple{eltype.(cols)...}, Int} has to be stored as an abstract type as I am reluctant to put the eltypes of grouping columns into a signature of GroupedDataFrame to avoid recompilation.

I am not sure which approach would be better. I would tend to prefer Dict{Any, Int} as I expect that if someone wants to use indexing then it is crucial that later the indexing operation is fast and creation of the index can be slower.

In particular - if there are not many groups then then for sure Dict{Any, Int} should be preferred I think.

@nalimilan
Copy link
Member

I agree that the lookup time probably matters more than the construction time, especially if we build a new dict on the first indexing (instead of reusing the existing hash table).

Though I wonder whether reusing our hash table couldn't give us the best of both worlds. It's just a few vectors whose types are always the same, and since we control the methods we could use @nospecialize where needed (i.e. on indexing), while still benefiting from type stability when building the table (as we do now, but we could also rebuild a smaller one on the first indexing if we discarded it after grouping because it was too large). Hence my question:

  • check if the row is hit or miss (this will be type unstable now - we could fix it if we stored key columns as type stable in GroupedDataFrame, but this means that we would have compilation issues for all operations on GroupedDataFrame)

Why would we have compilation issues? We already specialize grouping methods on the type of grouping columns (assuming the number of different types will be small). Specializing a small lookup function should be OK too.

At any rate, I admit the Dict solution is the simplest and probably good enough (if you don't have a lot of groups), so if you want to implement that go ahead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-breaking The proposed change is not breaking performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants