-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GROUPING SETS feature in Polars #7948
Comments
Are you looking for window functions?
Using the example from the postgres docs:
|
No to the first - not looking for Window functions. Yes to the second - the DuckDB method is what I ended up using on the Polars DataFrame that I wanted aggregated. I was asking if Polars can natively implement this feature, something like this:
To produce a table like:
The groupingsets API can also have options for performing either a ROLLUP or CUBE variant of the GROUPING SETS function. |
I was curious to see if it was possible to create such a table with current polars syntax:
An attempt at a general approach:
Code:
|
I believe you need to concat. They are described as unions over groupby aggregations if I am not mistaken |
Yes, as coded by @avimallu in the first post. Grouping sets are simply concatenated sets of groupby queries meant to calculate aggregations on measures for different combination of dimensions (grouping levels). |
I fully agreee with @avimallu. Group By Cube, Rollup, Grouping Sets are common in SQL and also in PySpark. I also miss it in Polars and a workaround requires several line of code, as @cmdlineluser shoed above. Please implement. 👍 |
The combinatoric groups can be created at the Python level quite readily with import polars as pl
from itertools import islice, combinations
from pandas import Timestamp, to_timedelta
from numpy.random import default_rng
rng = default_rng(0)
center_ids = [f"Center_{i}" for i in range(2)]
locations = ["East", "West"]
service_types = ["Web Hosting", "Data Processing", "Cloud Storage"]
ldf = pl.DataFrame({
"center_id": rng.choice(center_ids, size=(size := 100)),
"location": rng.choice(locations, size=size),
"service_type": rng.choice(service_types, size=size),
"timestamp": Timestamp.now() - to_timedelta(rng.integers(0, 3_650, size=size), unit='D'),
"cpu_usage": rng.uniform(0, 100, size=size),
"mem_usage": rng.uniform(0, 64, size=size),
}).lazy()
def grouping_sets(ldf, groups, exprs):
"""generic implementation of grouping sets functionality
Evaluates exprs within each of the passed groups.
"""
frames = []
for i, gs in enumerate(groups):
if not gs:
query = ldf.select(exprs)
else:
query = ldf.group_by(gs).agg(exprs)
frames.append(
query.with_columns(
# columns to track groupings
groupings=pl.lit(gs),
group_id=pl.lit(i),
)
)
return pl.concat(frames, how='diagonal')
# combinatoric helpers
def rollup(*items):
"""produce shrinking subsets of items"""
return (
[*islice(items, i, None)] for i in range(len(items)+1)
)
def cube(*items):
"""reversed version of itertools powerset recipe"""
for size in range(len(items)+1, -1, -1):
for combo in combinations(items, size):
yield [*combo]
print(
grouping_sets( # explicit grouping sets groups
ldf,
groups=[['center_id', 'location', 'service_type'], ['location'], []],
exprs=[pl.col('cpu_usage').mean(), pl.col('mem_usage').mean()]
)
.collect(),
# shape: (15, 7)
# ┌───────────┬──────────┬─────────────────┬───────────┬───────────┬─────────────────────────────────┬──────────┐
# │ center_id ┆ location ┆ service_type ┆ cpu_usage ┆ mem_usage ┆ groupings ┆ group_id │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ list[str] ┆ i32 │
# ╞═══════════╪══════════╪═════════════════╪═══════════╪═══════════╪═════════════════════════════════╪══════════╡
# │ Center_0 ┆ West ┆ Web Hosting ┆ 65.743955 ┆ 40.348164 ┆ ["center_id", "location", "ser… ┆ 0 │
# │ Center_1 ┆ East ┆ Web Hosting ┆ 53.60846 ┆ 27.885183 ┆ ["center_id", "location", "ser… ┆ 0 │
# │ Center_1 ┆ West ┆ Cloud Storage ┆ 67.4061 ┆ 32.278916 ┆ ["center_id", "location", "ser… ┆ 0 │
# │ Center_0 ┆ West ┆ Data Processing ┆ 54.954341 ┆ 21.972752 ┆ ["center_id", "location", "ser… ┆ 0 │
# │ Center_0 ┆ East ┆ Web Hosting ┆ 49.510622 ┆ 31.492153 ┆ ["center_id", "location", "ser… ┆ 0 │
# │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
# │ Center_1 ┆ West ┆ Web Hosting ┆ 49.162421 ┆ 35.982805 ┆ ["center_id", "location", "ser… ┆ 0 │
# │ Center_0 ┆ West ┆ Cloud Storage ┆ 59.67797 ┆ 39.92657 ┆ ["center_id", "location", "ser… ┆ 0 │
# │ null ┆ West ┆ null ┆ 55.707917 ┆ 33.647801 ┆ ["location"] ┆ 1 │
# │ null ┆ East ┆ null ┆ 52.259052 ┆ 30.260479 ┆ ["location"] ┆ 1 │
# │ null ┆ null ┆ null ┆ 54.155928 ┆ 32.123506 ┆ [] ┆ 2 │
# └───────────┴──────────┴─────────────────┴───────────┴───────────┴─────────────────────────────────┴──────────┘
grouping_sets( # rollup pattern
ldf,
groups=rollup('center_id', 'location', 'service_type'),
exprs=[pl.col('cpu_usage').mean(), pl.col('mem_usage').mean()]
)
.collect(),
# shape: (22, 7)
# ┌───────────┬──────────┬─────────────────┬───────────┬───────────┬─────────────────────────────────┬──────────┐
# │ center_id ┆ location ┆ service_type ┆ cpu_usage ┆ mem_usage ┆ groupings ┆ group_id │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ list[str] ┆ i32 │
# ╞═══════════╪══════════╪═════════════════╪═══════════╪═══════════╪═════════════════════════════════╪══════════╡
# │ Center_1 ┆ West ┆ Web Hosting ┆ 49.162421 ┆ 35.982805 ┆ ["center_id", "location", "ser… ┆ 0 │
# │ Center_0 ┆ East ┆ Web Hosting ┆ 49.510622 ┆ 31.492153 ┆ ["center_id", "location", "ser… ┆ 0 │
# │ Center_0 ┆ West ┆ Cloud Storage ┆ 59.67797 ┆ 39.92657 ┆ ["center_id", "location", "ser… ┆ 0 │
# │ Center_1 ┆ East ┆ Data Processing ┆ 52.559742 ┆ 35.97784 ┆ ["center_id", "location", "ser… ┆ 0 │
# │ Center_0 ┆ East ┆ Data Processing ┆ 38.898137 ┆ 31.946376 ┆ ["center_id", "location", "ser… ┆ 0 │
# │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
# │ null ┆ East ┆ Cloud Storage ┆ 54.270371 ┆ 29.515165 ┆ ["location", "service_type"] ┆ 1 │
# │ null ┆ null ┆ Data Processing ┆ 47.754155 ┆ 29.978704 ┆ ["service_type"] ┆ 2 │
# │ null ┆ null ┆ Cloud Storage ┆ 58.605781 ┆ 32.230542 ┆ ["service_type"] ┆ 2 │
# │ null ┆ null ┆ Web Hosting ┆ 54.883262 ┆ 33.790533 ┆ ["service_type"] ┆ 2 │
# │ null ┆ null ┆ null ┆ 54.155928 ┆ 32.123506 ┆ [] ┆ 3 │
# └───────────┴──────────┴─────────────────┴───────────┴───────────┴─────────────────────────────────┴──────────┘
grouping_sets( # cube pattern
ldf,
groups=cube('center_id', 'location', 'service_type'),
exprs=[pl.col('cpu_usage').mean(), pl.col('mem_usage').mean()]
)
.collect(),
# shape: (36, 7)
# ┌───────────┬──────────┬─────────────────┬───────────┬───────────┬─────────────────────────────────┬──────────┐
# │ center_id ┆ location ┆ service_type ┆ cpu_usage ┆ mem_usage ┆ groupings ┆ group_id │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ list[str] ┆ i32 │
# ╞═══════════╪══════════╪═════════════════╪═══════════╪═══════════╪═════════════════════════════════╪══════════╡
# │ Center_1 ┆ East ┆ Cloud Storage ┆ 58.068076 ┆ 26.131858 ┆ ["center_id", "location", "ser… ┆ 0 │
# │ Center_1 ┆ West ┆ Cloud Storage ┆ 67.4061 ┆ 32.278916 ┆ ["center_id", "location", "ser… ┆ 0 │
# │ Center_0 ┆ East ┆ Cloud Storage ┆ 48.573814 ┆ 34.590125 ┆ ["center_id", "location", "ser… ┆ 0 │
# │ Center_1 ┆ West ┆ Data Processing ┆ 40.087963 ┆ 35.195383 ┆ ["center_id", "location", "ser… ┆ 0 │
# │ Center_1 ┆ West ┆ Web Hosting ┆ 49.162421 ┆ 35.982805 ┆ ["center_id", "location", "ser… ┆ 0 │
# │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
# │ null ┆ East ┆ null ┆ 52.259052 ┆ 30.260479 ┆ ["location"] ┆ 5 │
# │ null ┆ null ┆ Data Processing ┆ 47.754155 ┆ 29.978704 ┆ ["service_type"] ┆ 6 │
# │ null ┆ null ┆ Web Hosting ┆ 54.883262 ┆ 33.790533 ┆ ["service_type"] ┆ 6 │
# │ null ┆ null ┆ Cloud Storage ┆ 58.605781 ┆ 32.230542 ┆ ["service_type"] ┆ 6 │
# │ null ┆ null ┆ null ┆ 54.155928 ┆ 32.123506 ┆ [] ┆ 7 │
# └───────────┴──────────┴─────────────────┴───────────┴───────────┴─────────────────────────────────┴──────────┘
sep='\n',
) |
add the native |
Problem description
Recently, I needed to filter data to look at rolled up aggregates at different levels in Python.
I was quite fond of
data.table
s implementation of it in R, and I searched for something similar in Polars, but it wasn't there. Thankfully, DuckDB's Polars integration came to the rescue here, but I would love to see it implemented in Polars if it isn't too much effort.Handy reference for GROUPINGSETS, ROLLUP and CUBE: PostgresSQL, Oracle and SQL server.
As a distinct feature which would be difficult in plain SQL, the issue of using
null
as the column value for higher level rollups can be replaced with an argument on how to represent it (such as"Total"
).Similar to #6645 but not as a feature request. and not as a year/month/day rollup. The solution there works fine for fewer columns, but not for larger combinations:
The text was updated successfully, but these errors were encountered: