GROUPING SETS feature in Polars #7948

avimallu · 2023-04-03T03:00:25Z

Problem description

Recently, I needed to filter data to look at rolled up aggregates at different levels in Python.

I was quite fond of data.tables implementation of it in R, and I searched for something similar in Polars, but it wasn't there. Thankfully, DuckDB's Polars integration came to the rescue here, but I would love to see it implemented in Polars if it isn't too much effort.

Handy reference for GROUPINGSETS, ROLLUP and CUBE: PostgresSQL, Oracle and SQL server.

As a distinct feature which would be difficult in plain SQL, the issue of using null as the column value for higher level rollups can be replaced with an argument on how to represent it (such as "Total").

Similar to #6645 but not as a feature request. and not as a year/month/day rollup. The solution there works fine for fewer columns, but not for larger combinations:

groupingsets=[[pl.col('date').dt.year().alias("year"), pl.col('date').dt.month().alias("month")],
              [pl.col('date').dt.year().alias("year")],
              [pl.lit(None).cast(pl.Int32()).alias('year')]] 
pl.concat(
    [exdf.groupby(x).agg(pl.col('x').mean()) for x in groupingsets], how='diagonal'
    ).sort(['year','month'])

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2023-04-03T12:18:51Z

Are you looking for window functions?

df.with_columns(
   per_brand = pl.sum("sales").over("brand"),
   per_size = pl.sum("sales").over("size"),
   total = pl.sum("sales")
)

shape: (4, 6)
┌───────┬──────┬───────┬───────────┬──────────┬───────┐
│ brand ┆ size ┆ sales ┆ per_brand ┆ per_size ┆ total │
│ ---   ┆ ---  ┆ ---   ┆ ---       ┆ ---      ┆ ---   │
│ str   ┆ str  ┆ i64   ┆ i64       ┆ i64      ┆ i64   │
╞═══════╪══════╪═══════╪═══════════╪══════════╪═══════╡
│ Foo   ┆ L    ┆ 10    ┆ 30        ┆ 15       ┆ 50    │
│ Foo   ┆ M    ┆ 20    ┆ 30        ┆ 35       ┆ 50    │
│ Bar   ┆ M    ┆ 15    ┆ 20        ┆ 35       ┆ 50    │
│ Bar   ┆ L    ┆ 5     ┆ 20        ┆ 15       ┆ 50    │
└───────┴──────┴───────┴───────────┴──────────┴───────┘

Using the example from the postgres docs:

import duckdb
import polars as pl

df = pl.DataFrame({
   "brand": ["Foo", "Foo", "Bar", "Bar"],
   "size": ["L", "M", "M", "L"],
   "sales": [10, 20, 15, 5]
})

duckdb.sql("""
FROM df
SELECT brand, size, sum(sales) 
GROUP BY GROUPING SETS ((brand), (size), ())
""")

┌─────────┬─────────┬────────────┐
│  brand  │  size   │ sum(sales) │
│ varchar │ varchar │   int128   │
├─────────┼─────────┼────────────┤
│ Foo     │ NULL    │         30 │
│ Bar     │ NULL    │         20 │
│ NULL    │ L       │         15 │
│ NULL    │ M       │         35 │
│ NULL    │ NULL    │         50 │
└─────────┴─────────┴────────────┘

avimallu · 2023-04-03T13:34:46Z

No to the first - not looking for Window functions.

Yes to the second - the DuckDB method is what I ended up using on the Polars DataFrame that I wanted aggregated. I was asking if Polars can natively implement this feature, something like this:

import polars as pl

df = pl.DataFrame({
   "brand": ["Foo", "Foo", "Bar", "Bar"],
   "size": ["L", "M", "M", "L"],
   "sales": [10, 20, 15, 5]
})

(
  df.
  grouping_sets((brand), (size), (), replace_null_as="Total")
  .agg(
     pl.col("sales").sum(),
     # Additional aggregations
     pl.col("sales").mean())
)

To produce a table like:

┌─────────┬─────────┬────────────┐
│  brand  │  size   │ sum(sales) │
│ varchar │ varchar │   int128   │
├─────────┼─────────┼────────────┤
│ Foo     │ Total   │         30 │
│ Bar     │ Total   │         20 │
│ Total   │ L       │         15 │
│ Total   │ M       │         35 │
│ Total   │ Total   │         50 │
└─────────┴─────────┴────────────┘

The groupingsets API can also have options for performing either a ROLLUP or CUBE variant of the GROUPING SETS function.

cmdlineluser · 2023-04-04T03:08:47Z

I was curious to see if it was possible to create such a table with current polars syntax:

(
   df
   .select(structs = 
      pl.struct(
         pl.struct("brand", pl.sum("sales").over("brand")),
         pl.struct("size",  pl.sum("sales").over("size")),
      ))
   .unnest("structs")
   .melt()
   .unnest("value")
   .unique(maintain_order=True)
   .drop("variable")
)

shape: (4, 3)
┌───────┬───────┬──────┐
│ brand ┆ sales ┆ size │
│ ---   ┆ ---   ┆ ---  │
│ str   ┆ i64   ┆ str  │
╞═══════╪═══════╪══════╡
│ Foo   ┆ 30    ┆ null │
│ Bar   ┆ 20    ┆ null │
│ null  ┆ 15    ┆ L    │
│ null  ┆ 35    ┆ M    │
└───────┴───────┴──────┘

An attempt at a general approach:

grouping_sets(
    df, 
    groups=("brand", "size"), 
    exprs=(
       pl.col("sales").sum().alias("sum"), 
       pl.col("sales").mean().alias("mean")
    )
)

shape: (5, 4)
┌───────┬─────┬──────┬──────┐
│ brand ┆ sum ┆ mean ┆ size │
│ ---   ┆ --- ┆ ---  ┆ ---  │
│ str   ┆ i64 ┆ f64  ┆ str  │
╞═══════╪═════╪══════╪══════╡
│ Foo   ┆ 30  ┆ 15.0 ┆ null │
│ Bar   ┆ 20  ┆ 10.0 ┆ null │
│ null  ┆ 15  ┆ 7.5  ┆ L    │
│ null  ┆ 35  ┆ 17.5 ┆ M    │
│ null  ┆ 50  ┆ 12.5 ┆ null │
└───────┴─────┴──────┴──────┘

Code:

def grouping_sets(df, groups, exprs, with_total=True):
    structs = (
       pl.struct(
          pl.struct(
             group, 
             *(expr.over(group).alias(expr.meta.output_name()) 
               for expr in exprs)
          )
          for group in groups
       )
    )

    result = (
       df.select(structs=structs)
         .unnest("structs")
         .melt()
         .unnest("value")
         .unique(maintain_order=True)
         .drop("variable")
    )

    if with_total:
       null = result.select(
          pl.lit(None).alias(col).cast(dtype) 
          for col, dtype in result.schema.items()
       )

       result = pl.concat([
          result,
          null.with_columns(df.select(*exprs))
       ])

    return result

with_total is used in the example to represent the () behaviour.

ritchie46 · 2023-04-04T05:16:07Z

I was curious to see if it was possible to create such a table with current polars syntax:

I believe you need to concat. They are described as unions over groupby aggregations if I am not mistaken

jangorecki · 2024-03-20T10:18:59Z

I was curious to see if it was possible to create such a table with current polars syntax:

I believe you need to concat. They are described as unions over groupby aggregations if I am not mistaken

Yes, as coded by @avimallu in the first post. Grouping sets are simply concatenated sets of groupby queries meant to calculate aggregations on measures for different combination of dimensions (grouping levels).

nobites · 2024-04-26T09:42:15Z

I fully agreee with @avimallu. Group By Cube, Rollup, Grouping Sets are common in SQL and also in PySpark. I also miss it in Polars and a workaround requires several line of code, as @cmdlineluser shoed above. Please implement. 👍

camriddell · 2024-06-24T16:29:10Z

The combinatoric groups can be created at the Python level quite readily with itertools then Polars can perform heavy lifting for evaluation during .collect

import polars as pl
from itertools import islice, combinations
from pandas import Timestamp, to_timedelta
from numpy.random import default_rng


rng = default_rng(0)

center_ids = [f"Center_{i}" for i in range(2)]
locations = ["East", "West"]
service_types = ["Web Hosting", "Data Processing", "Cloud Storage"]

ldf = pl.DataFrame({
    "center_id":    rng.choice(center_ids, size=(size := 100)),
    "location":     rng.choice(locations, size=size),
    "service_type": rng.choice(service_types, size=size),
    "timestamp":    Timestamp.now() - to_timedelta(rng.integers(0, 3_650, size=size), unit='D'),
    "cpu_usage":    rng.uniform(0, 100, size=size),
    "mem_usage":    rng.uniform(0, 64, size=size),
}).lazy()

def grouping_sets(ldf, groups, exprs):
    """generic implementation of grouping sets functionality

    Evaluates exprs within each of the passed groups.
    """
    frames = []
    for i, gs in enumerate(groups):
        if not gs:
            query = ldf.select(exprs)
        else:
            query = ldf.group_by(gs).agg(exprs)
        frames.append(
            query.with_columns(
                # columns to track groupings
                groupings=pl.lit(gs),
                group_id=pl.lit(i),
            )
        )
    return pl.concat(frames, how='diagonal')

# combinatoric helpers
def rollup(*items):
    """produce shrinking subsets of items"""
    return (
        [*islice(items, i, None)] for i in range(len(items)+1)
    )
def cube(*items):
    """reversed version of itertools powerset recipe"""
    for size in range(len(items)+1, -1, -1):
        for combo in combinations(items, size):
            yield [*combo]

print(
    grouping_sets( # explicit grouping sets groups
        ldf,
        groups=[['center_id', 'location', 'service_type'], ['location'], []],
        exprs=[pl.col('cpu_usage').mean(), pl.col('mem_usage').mean()]
    )
    .collect(),
    # shape: (15, 7)
    # ┌───────────┬──────────┬─────────────────┬───────────┬───────────┬─────────────────────────────────┬──────────┐
    # │ center_id ┆ location ┆ service_type    ┆ cpu_usage ┆ mem_usage ┆ groupings                       ┆ group_id │
    # │ ---       ┆ ---      ┆ ---             ┆ ---       ┆ ---       ┆ ---                             ┆ ---      │
    # │ str       ┆ str      ┆ str             ┆ f64       ┆ f64       ┆ list[str]                       ┆ i32      │
    # ╞═══════════╪══════════╪═════════════════╪═══════════╪═══════════╪═════════════════════════════════╪══════════╡
    # │ Center_0  ┆ West     ┆ Web Hosting     ┆ 65.743955 ┆ 40.348164 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_1  ┆ East     ┆ Web Hosting     ┆ 53.60846  ┆ 27.885183 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_1  ┆ West     ┆ Cloud Storage   ┆ 67.4061   ┆ 32.278916 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_0  ┆ West     ┆ Data Processing ┆ 54.954341 ┆ 21.972752 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_0  ┆ East     ┆ Web Hosting     ┆ 49.510622 ┆ 31.492153 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ …         ┆ …        ┆ …               ┆ …         ┆ …         ┆ …                               ┆ …        │
    # │ Center_1  ┆ West     ┆ Web Hosting     ┆ 49.162421 ┆ 35.982805 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_0  ┆ West     ┆ Cloud Storage   ┆ 59.67797  ┆ 39.92657  ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ null      ┆ West     ┆ null            ┆ 55.707917 ┆ 33.647801 ┆ ["location"]                    ┆ 1        │
    # │ null      ┆ East     ┆ null            ┆ 52.259052 ┆ 30.260479 ┆ ["location"]                    ┆ 1        │
    # │ null      ┆ null     ┆ null            ┆ 54.155928 ┆ 32.123506 ┆ []                              ┆ 2        │
    # └───────────┴──────────┴─────────────────┴───────────┴───────────┴─────────────────────────────────┴──────────┘

    grouping_sets( # rollup pattern
        ldf,
        groups=rollup('center_id', 'location', 'service_type'),
        exprs=[pl.col('cpu_usage').mean(), pl.col('mem_usage').mean()]
    )
    .collect(),
    # shape: (22, 7)
    # ┌───────────┬──────────┬─────────────────┬───────────┬───────────┬─────────────────────────────────┬──────────┐
    # │ center_id ┆ location ┆ service_type    ┆ cpu_usage ┆ mem_usage ┆ groupings                       ┆ group_id │
    # │ ---       ┆ ---      ┆ ---             ┆ ---       ┆ ---       ┆ ---                             ┆ ---      │
    # │ str       ┆ str      ┆ str             ┆ f64       ┆ f64       ┆ list[str]                       ┆ i32      │
    # ╞═══════════╪══════════╪═════════════════╪═══════════╪═══════════╪═════════════════════════════════╪══════════╡
    # │ Center_1  ┆ West     ┆ Web Hosting     ┆ 49.162421 ┆ 35.982805 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_0  ┆ East     ┆ Web Hosting     ┆ 49.510622 ┆ 31.492153 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_0  ┆ West     ┆ Cloud Storage   ┆ 59.67797  ┆ 39.92657  ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_1  ┆ East     ┆ Data Processing ┆ 52.559742 ┆ 35.97784  ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_0  ┆ East     ┆ Data Processing ┆ 38.898137 ┆ 31.946376 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ …         ┆ …        ┆ …               ┆ …         ┆ …         ┆ …                               ┆ …        │
    # │ null      ┆ East     ┆ Cloud Storage   ┆ 54.270371 ┆ 29.515165 ┆ ["location", "service_type"]    ┆ 1        │
    # │ null      ┆ null     ┆ Data Processing ┆ 47.754155 ┆ 29.978704 ┆ ["service_type"]                ┆ 2        │
    # │ null      ┆ null     ┆ Cloud Storage   ┆ 58.605781 ┆ 32.230542 ┆ ["service_type"]                ┆ 2        │
    # │ null      ┆ null     ┆ Web Hosting     ┆ 54.883262 ┆ 33.790533 ┆ ["service_type"]                ┆ 2        │
    # │ null      ┆ null     ┆ null            ┆ 54.155928 ┆ 32.123506 ┆ []                              ┆ 3        │
    # └───────────┴──────────┴─────────────────┴───────────┴───────────┴─────────────────────────────────┴──────────┘

    grouping_sets( # cube pattern
        ldf,
        groups=cube('center_id', 'location', 'service_type'),
        exprs=[pl.col('cpu_usage').mean(), pl.col('mem_usage').mean()]
    )
    .collect(),
    # shape: (36, 7)
    # ┌───────────┬──────────┬─────────────────┬───────────┬───────────┬─────────────────────────────────┬──────────┐
    # │ center_id ┆ location ┆ service_type    ┆ cpu_usage ┆ mem_usage ┆ groupings                       ┆ group_id │
    # │ ---       ┆ ---      ┆ ---             ┆ ---       ┆ ---       ┆ ---                             ┆ ---      │
    # │ str       ┆ str      ┆ str             ┆ f64       ┆ f64       ┆ list[str]                       ┆ i32      │
    # ╞═══════════╪══════════╪═════════════════╪═══════════╪═══════════╪═════════════════════════════════╪══════════╡
    # │ Center_1  ┆ East     ┆ Cloud Storage   ┆ 58.068076 ┆ 26.131858 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_1  ┆ West     ┆ Cloud Storage   ┆ 67.4061   ┆ 32.278916 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_0  ┆ East     ┆ Cloud Storage   ┆ 48.573814 ┆ 34.590125 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_1  ┆ West     ┆ Data Processing ┆ 40.087963 ┆ 35.195383 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_1  ┆ West     ┆ Web Hosting     ┆ 49.162421 ┆ 35.982805 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ …         ┆ …        ┆ …               ┆ …         ┆ …         ┆ …                               ┆ …        │
    # │ null      ┆ East     ┆ null            ┆ 52.259052 ┆ 30.260479 ┆ ["location"]                    ┆ 5        │
    # │ null      ┆ null     ┆ Data Processing ┆ 47.754155 ┆ 29.978704 ┆ ["service_type"]                ┆ 6        │
    # │ null      ┆ null     ┆ Web Hosting     ┆ 54.883262 ┆ 33.790533 ┆ ["service_type"]                ┆ 6        │
    # │ null      ┆ null     ┆ Cloud Storage   ┆ 58.605781 ┆ 32.230542 ┆ ["service_type"]                ┆ 6        │
    # │ null      ┆ null     ┆ null            ┆ 54.155928 ┆ 32.123506 ┆ []                              ┆ 7        │
    # └───────────┴──────────┴─────────────────┴───────────┴───────────┴─────────────────────────────────┴──────────┘

    sep='\n',
)

l1t1 · 2024-10-18T06:32:52Z

add the native group by cube/rollup/grouping setssql syntax is more convinent and necessary.

avimallu added the enhancement New feature or an improvement of an existing feature label Apr 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GROUPING SETS feature in Polars #7948

GROUPING SETS feature in Polars #7948

avimallu commented Apr 3, 2023 •

edited

Loading

cmdlineluser commented Apr 3, 2023

avimallu commented Apr 3, 2023

cmdlineluser commented Apr 4, 2023 •

edited

Loading

ritchie46 commented Apr 4, 2023

jangorecki commented Mar 20, 2024

nobites commented Apr 26, 2024

camriddell commented Jun 24, 2024

l1t1 commented Oct 18, 2024

GROUPING SETS feature in Polars #7948

GROUPING SETS feature in Polars #7948

Comments

avimallu commented Apr 3, 2023 • edited Loading

Problem description

cmdlineluser commented Apr 3, 2023

avimallu commented Apr 3, 2023

cmdlineluser commented Apr 4, 2023 • edited Loading

ritchie46 commented Apr 4, 2023

jangorecki commented Mar 20, 2024

nobites commented Apr 26, 2024

camriddell commented Jun 24, 2024

l1t1 commented Oct 18, 2024

avimallu commented Apr 3, 2023 •

edited

Loading

cmdlineluser commented Apr 4, 2023 •

edited

Loading