groupby performance #2162

sk2 · 2022-08-08T08:09:48Z

sk2
Aug 8, 2022

Hi, I have a number of large datasets.

It appears to be much slower to group on two separate columns, than it is to do on the columns individually.
I am doing like in the Pandas example of

grouped = df.groupby(["class", "order"])

I saw the groupby code was added later, and perhaps I am pushing it further than is typically the case - as I didn't see any examples of this in the documentation.
Is there a better approach to tackling this problem?
Would converting to categorical improve performance?

Thanks!

sk2 · 2022-08-08T15:42:48Z

sk2
Aug 8, 2022
Author

I think the group by needs to keep in memory, hence the performance hit.
I was able to make it workable by first doing a simple group by, and the analysing the summary statistics from this aggregation. I could then do a filter on the values that met a certain threshold, and then do the full (multi column) group by.

The filtering step reduces the search space so that the multi column group by is manageable.

3 replies

JovanVeljanoski Aug 8, 2022
Maintainer

Indeed - while the computations are out of core, the result of the groupby is an in memory dataframe. So you should be careful if you have enough RAM to house the output of the groupby aggregation.

Then there is the of sparseness: by default vaex has some heuristic on whether to do the full cartesian product between the groupby columns or only those for which there is actual data present. You might want to play with that parameter depending on your usecase or data.

sk2 Aug 10, 2022
Author

Thanks! I’ll have another look into it.

I’ll also have a shot at creating the hashes as virtual columns and then grouping in this, and see if that has any different performance.

A somewhat related question (happy to open a new thread if it helps keep things clearer): can I do a group by on one statistic (such as city) and then a binby within these?
Essentially combining the two. The alternative would be to calculate the bins as a virtual column (eg modulo division to a 10 min period) and then do a normal group by on these.

JovanVeljanoski Aug 11, 2022
Maintainer

A somewhat related question (happy to open a new thread if it helps keep things clearer): can I do a group by on one statistic (such as city) and then a binby within these?

I would just ordinal_encode the non-numeric column and then you can do binby as usual. Vaex recognizes if a column has been encoded, and the for that (or those) columns will be set automatically, one per value - kind of like groupby already. Check it out:

import vaex
df = vaex.datasets.titanic()
df = df.ordinal_encode('embarked')
df.count(binby=['embarked', 'age'], shape=10)
result = df.count(binby=['embarked', 'age'], shape=10, array_type='xarray')

sk2 · 2022-08-11T06:26:25Z

sk2
Aug 11, 2022
Author

I am trying to pre-calculate the hash, so that groupby only needs to work on one column rather than ten, but am having challenges vectorising the following:

def hashed(a, b, c, d):
    return hash(a + b + c + d)

df['hashed'] = hashed(df.a, df.b, df.c, df.d)

I get

TypeError: unhashable type: 'Expression'

I've looked into the vaex.hash code but couldn't seen an easy way to use it for this purpose.
Is there a way to hash across multiple columns?
thanks!

2 replies

JovanVeljanoski Aug 11, 2022
Maintainer

Anything to do with hashing is currently not exposed for external use in any way..

If you want to do anything with hashing I am afraid for the time being you'll have to handle it externally (see register_function or apply).

Maybe you can just create a single string out of the 4 columns, and ordinal_encode that?

sk2 Aug 12, 2022
Author

Thanks, I’ll give that a try

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby performance #2162

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

groupby performance #2162

sk2 Aug 8, 2022

Replies: 2 comments · 5 replies

sk2 Aug 8, 2022 Author

JovanVeljanoski Aug 8, 2022 Maintainer

sk2 Aug 10, 2022 Author

JovanVeljanoski Aug 11, 2022 Maintainer

sk2 Aug 11, 2022 Author

JovanVeljanoski Aug 11, 2022 Maintainer

sk2 Aug 12, 2022 Author

sk2
Aug 8, 2022

Replies: 2 comments 5 replies

sk2
Aug 8, 2022
Author

JovanVeljanoski Aug 8, 2022
Maintainer

sk2 Aug 10, 2022
Author

JovanVeljanoski Aug 11, 2022
Maintainer

sk2
Aug 11, 2022
Author

JovanVeljanoski Aug 11, 2022
Maintainer

sk2 Aug 12, 2022
Author