Speed up key lookup in GroupedDataFrame #2095

bkamins · 2020-01-31T02:24:00Z

When #2046 is merged I think we should redesign https://github.com/JuliaData/DataFrames.jl/pull/2046/files#diff-6349421a054bb7e74b7f27ae86304cbaR275 (but I did not want to postpone merging of that PR).

What I mean is that on the first call to this function (or maybe even earlier when we call compute_indices) we should build a dictionary from Tuples to integer indices in a GroupedDataFrame. Then the lookup of Tuple would be very fast if done repeatedly.

If we did this we could actually say that the way to add an index to a DataFrame, which people very often ask for, is to simply run groupby on it. Then you would have a very fast lookup of the index which is a common operation.

@nalimilan - do you have any thoughts about it (i.e. if we want it - as it would eat up some memory + what would be the best way to implement it - that is: at what moment it would be best to populate such a dictionary)?

The text was updated successfully, but these errors were encountered:

nalimilan · 2020-02-01T10:49:20Z

Yes that linear search is really bad for performance.

Actually we already have a hash table structure: the gslots vector returned by row_group_slots allows mapping a hash to the index of the first row in the group, and the groups vector allows getting the index of the group corresponding to that row.

But currently we don't store it in GroupedDataFrame as it's not used. Keeping that information would only use a significant amount of memory when the number of groups is large, so maybe that's OK. We could add an option to discard it.

bkamins · 2020-02-01T14:04:26Z

Actually we already have a hash table structure

This is what I thought 😄 (judging by the logic of what we do), but I just do not have enough experience with this part of code base to be sure how exactly this can be achieved.

We could add an option to discard it.

This is exactly what I thought. If someone uses by we do not need to store them as GroupedDataFrame gets discarded anyway immediately.

Summing up: do you have time to make a PR for this? (if not - I can have a look). Thank you!

nalimilan · 2020-02-02T10:54:02Z

One subtlety I hadn't spotted is that we have specialized methods for categorical and pooled arrays, which doesn't create the hash table. And creating it would be quite wasteful when it's not used. So we should probably do the same as for group indices: set it to nothing, and only create it on first access.

I probably won't have the time soon, so go ahead.

bkamins · 2020-02-02T21:57:14Z

Agreed, but only in case hash table is not created (if hash table is created then I understand it is best to create the mapping immediately). Right?

nalimilan · 2020-02-03T14:26:12Z

Yes, basically we should just keep the table instead of throwing it away. Though currently we take a very memory-hungry approach by allocating as many slots as rows. That avoids having to resize the table, but if we keep the structure around it will be a bit wasteful (very much so if you have only ten groups...).

bkamins · 2020-02-03T14:39:46Z

I have not analyzed it in detail (and that is why I asked if you could implement this PR 😄), but:

Though currently we take a very memory-hungry approach by allocating as many slots as rows.

Maybe in the future it could be optimized. If not - maybe there is an efficient way to "compress" this structure.

Another alternative is to create this structure as Dict the first time user asks for group using a Tuple as you have mentioned I think. This would be a bit wasteful (as we would repeat the work done a second time), but I guess it should be fast to create it (one just needs to iterate once over starts field). On my laptop creating such a structure for 10,000,000 groups takes just over 1 second (and I would assume this is the practical upper-bound). E.g. for 1000 groups the time is not noticeable.

(if we went the other way - i.e. by creating a separate structure later - I can implement it; if we go the first way - I think you have more understanding of the details to do it correctly).

nalimilan · 2020-02-05T10:43:34Z

I guess it really depends on the use case. With a small number of groups compared to number of rows, recomputing is OK. For a large number of groups, it will take about as much time as the grouping operation itself. But in the latter case, keeping the hash table that was built for grouping is OK, since it's not too wasteful. So maybe we could apply a threshold, and discard the hash table if the number of groups is too small. For a small number of groups, on the first getindex call, we would reconstruct the hash table by basically calling row_group_slots on the subset of first rows for each group, and store it.

bkamins · 2020-02-05T12:38:39Z

Sounds good 😄.

What threshold do you think is reasonable? 10%?

Also there should be an option to opt-out from storing it (notably in by we do not need to store it). So this means that this 10% (or whatever other value) should be a kwarg argument of groupby.

bkamins · 2020-02-25T22:39:26Z

I have looked at this gslots data structure and I am not fully clear how we could use it.
If I understand correctly we would have to do:

compute hash of a passed tuple that the user wants to look up using the algorithm from hashrows but just for a single value
using gslots locate the row
check if the row is hit or miss (this will be type unstable now - we could fix it if we stored key columns as type stable in GroupedDataFrame, but this means that we would have compilation issues for all operations on GroupedDataFrame)
if there is a conflict repeat the process as in row_group_slots implementation

A possible alternative is to store Dict{Any, Int} (to avoid constant compilations) that will be populated on the first call of getindex on a GroupedDataFrame using a tuple (it would store mappings form tuples to group indices). The cost of it - as discussed is that we would repeat the job twice if there are very many groups, but maybe this is good enough for now? Populating such a dict should be less than 1 second for 10^7 rows (I will benchmark it if we go this way), so I hope it would be an acceptable latency.

bkamins · 2020-02-26T09:16:07Z

I have done some tests with Dict (as I am not clear how to do the same with row_group_slots efficiently). Also I think it is better to have "something" now, and we can always improve it in the future if we have better ideas.

There are two possible approaches. The first one is:

function genidxmap(gdf)
    _genidxmap(gdf, ntuple(i -> parent(gdf)[!, gdf.cols[i]], length(gdf.cols)))
end

function _genidxmap(gdf, cols)
    d = Dict{Tuple{eltype.(cols)...}, Int}()
    sizehint!(d, length(gdf.starts))
    for (i, s) in enumerate(gdf.starts)
        d[getindex.(cols, i)] = i
    end
    d
end

and the second one:

function genidxmap2(gdf)
    _genidxmap2(gdf, ntuple(i -> parent(gdf)[!, gdf.cols[i]], length(gdf.cols)))
end

function _genidxmap2(gdf, cols)
    d = Dict{Any, Int}()
    sizehint!(d, length(gdf.starts))
    for (i, s) in enumerate(gdf.starts)
        d[getindex.(cols, i)] = i
    end
    d
end

The difference is only in the signature of d dictionary. Here are the consequences:

creation of mapping using genidxmap is ~ 5x faster than with genidxmap2
later indexing is faster with genidxmap2 than with genidxmap (around 3x, because we have one dynamic dispatch here) - the reason is that Dict{Any, Int} can be stored in GroupedDataFrame as a concrete type and Dict{Tuple{eltype.(cols)...}, Int} has to be stored as an abstract type as I am reluctant to put the eltypes of grouping columns into a signature of GroupedDataFrame to avoid recompilation.

I am not sure which approach would be better. I would tend to prefer Dict{Any, Int} as I expect that if someone wants to use indexing then it is crucial that later the indexing operation is fast and creation of the index can be slower.

In particular - if there are not many groups then then for sure Dict{Any, Int} should be preferred I think.

nalimilan · 2020-02-26T10:22:59Z

I agree that the lookup time probably matters more than the construction time, especially if we build a new dict on the first indexing (instead of reusing the existing hash table).

Though I wonder whether reusing our hash table couldn't give us the best of both worlds. It's just a few vectors whose types are always the same, and since we control the methods we could use @nospecialize where needed (i.e. on indexing), while still benefiting from type stability when building the table (as we do now, but we could also rebuild a smaller one on the first indexing if we discarded it after grouping because it was too large). Hence my question:

check if the row is hit or miss (this will be type unstable now - we could fix it if we stored key columns as type stable in GroupedDataFrame, but this means that we would have compilation issues for all operations on GroupedDataFrame)

Why would we have compilation issues? We already specialize grouping methods on the type of grouping columns (assuming the number of different types will be small). Specializing a small lookup function should be OK too.

At any rate, I admit the Dict solution is the simplest and probably good enough (if you don't have a lot of groups), so if you want to implement that go ahead.

bkamins mentioned this issue Feb 11, 2020

by drops zero-length categorical groups #2106

Open

bkamins added non-breaking The proposed change is not breaking performance labels Feb 12, 2020

bkamins added this to the 1.0 milestone Feb 12, 2020

bkamins mentioned this issue Feb 19, 2020

map(DataFrame, groups) does not return a collection of DataFrames #2126

Closed

bkamins mentioned this issue Feb 26, 2020

use Dict{Any,Int} to speed up GroupDataFrame lookup #2132

Merged

bkamins closed this as completed in #2132 Feb 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up key lookup in GroupedDataFrame #2095

Speed up key lookup in GroupedDataFrame #2095

bkamins commented Jan 31, 2020

nalimilan commented Feb 1, 2020

bkamins commented Feb 1, 2020

nalimilan commented Feb 2, 2020

bkamins commented Feb 2, 2020

nalimilan commented Feb 3, 2020

bkamins commented Feb 3, 2020

nalimilan commented Feb 5, 2020

bkamins commented Feb 5, 2020

bkamins commented Feb 25, 2020

bkamins commented Feb 26, 2020 •

edited

Loading

nalimilan commented Feb 26, 2020

Speed up key lookup in GroupedDataFrame #2095

Speed up key lookup in GroupedDataFrame #2095

Comments

bkamins commented Jan 31, 2020

nalimilan commented Feb 1, 2020

bkamins commented Feb 1, 2020

nalimilan commented Feb 2, 2020

bkamins commented Feb 2, 2020

nalimilan commented Feb 3, 2020

bkamins commented Feb 3, 2020

nalimilan commented Feb 5, 2020

bkamins commented Feb 5, 2020

bkamins commented Feb 25, 2020

bkamins commented Feb 26, 2020 • edited Loading

nalimilan commented Feb 26, 2020

bkamins commented Feb 26, 2020 •

edited

Loading