Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GROUPING SETS feature in Polars #7948

Open
avimallu opened this issue Apr 3, 2023 · 8 comments
Open

GROUPING SETS feature in Polars #7948

avimallu opened this issue Apr 3, 2023 · 8 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@avimallu
Copy link
Contributor

avimallu commented Apr 3, 2023

Problem description

Recently, I needed to filter data to look at rolled up aggregates at different levels in Python.

I was quite fond of data.tables implementation of it in R, and I searched for something similar in Polars, but it wasn't there. Thankfully, DuckDB's Polars integration came to the rescue here, but I would love to see it implemented in Polars if it isn't too much effort.

Handy reference for GROUPINGSETS, ROLLUP and CUBE: PostgresSQL, Oracle and SQL server.

As a distinct feature which would be difficult in plain SQL, the issue of using null as the column value for higher level rollups can be replaced with an argument on how to represent it (such as "Total").

Similar to #6645 but not as a feature request. and not as a year/month/day rollup. The solution there works fine for fewer columns, but not for larger combinations:

groupingsets=[[pl.col('date').dt.year().alias("year"), pl.col('date').dt.month().alias("month")],
              [pl.col('date').dt.year().alias("year")],
              [pl.lit(None).cast(pl.Int32()).alias('year')]] 
pl.concat(
    [exdf.groupby(x).agg(pl.col('x').mean()) for x in groupingsets], how='diagonal'
    ).sort(['year','month'])
@avimallu avimallu added the enhancement New feature or an improvement of an existing feature label Apr 3, 2023
@cmdlineluser
Copy link
Contributor

Are you looking for window functions?

df.with_columns(
   per_brand = pl.sum("sales").over("brand"),
   per_size = pl.sum("sales").over("size"),
   total = pl.sum("sales")
)
shape: (4, 6)
┌───────┬──────┬───────┬───────────┬──────────┬───────┐
│ brand ┆ size ┆ sales ┆ per_brand ┆ per_size ┆ total │
│ ---   ┆ ---  ┆ ---   ┆ ---       ┆ ---      ┆ ---   │
│ str   ┆ str  ┆ i64   ┆ i64       ┆ i64      ┆ i64   │
╞═══════╪══════╪═══════╪═══════════╪══════════╪═══════╡
│ Foo   ┆ L    ┆ 10    ┆ 30        ┆ 15       ┆ 50    │
│ Foo   ┆ M    ┆ 20    ┆ 30        ┆ 35       ┆ 50    │
│ Bar   ┆ M    ┆ 15    ┆ 20        ┆ 35       ┆ 50    │
│ Bar   ┆ L    ┆ 5     ┆ 20        ┆ 15       ┆ 50    │
└───────┴──────┴───────┴───────────┴──────────┴───────┘

Using the example from the postgres docs:

import duckdb
import polars as pl

df = pl.DataFrame({
   "brand": ["Foo", "Foo", "Bar", "Bar"],
   "size": ["L", "M", "M", "L"],
   "sales": [10, 20, 15, 5]
})
duckdb.sql("""
FROM df
SELECT brand, size, sum(sales) 
GROUP BY GROUPING SETS ((brand), (size), ())
""")
┌─────────┬─────────┬────────────┐
│  brand  │  size   │ sum(sales) │
│ varchar │ varchar │   int128   │
├─────────┼─────────┼────────────┤
│ Foo     │ NULL    │         30 │
│ Bar     │ NULL    │         20 │
│ NULL    │ L       │         15 │
│ NULL    │ M       │         35 │
│ NULL    │ NULL    │         50 │
└─────────┴─────────┴────────────┘

@avimallu
Copy link
Contributor Author

avimallu commented Apr 3, 2023

No to the first - not looking for Window functions.

Yes to the second - the DuckDB method is what I ended up using on the Polars DataFrame that I wanted aggregated. I was asking if Polars can natively implement this feature, something like this:

import polars as pl

df = pl.DataFrame({
   "brand": ["Foo", "Foo", "Bar", "Bar"],
   "size": ["L", "M", "M", "L"],
   "sales": [10, 20, 15, 5]
})

(
  df.
  grouping_sets((brand), (size), (), replace_null_as="Total")
  .agg(
     pl.col("sales").sum(),
     # Additional aggregations
     pl.col("sales").mean())
)

To produce a table like:

┌─────────┬─────────┬────────────┐
│  brand  │  size   │ sum(sales) │
│ varchar │ varchar │   int128   │
├─────────┼─────────┼────────────┤
│ Foo     │ Total   │         30 │
│ Bar     │ Total   │         20 │
│ Total   │ L       │         15 │
│ Total   │ M       │         35 │
│ Total   │ Total   │         50 │
└─────────┴─────────┴────────────┘

The groupingsets API can also have options for performing either a ROLLUP or CUBE variant of the GROUPING SETS function.

@cmdlineluser
Copy link
Contributor

cmdlineluser commented Apr 4, 2023

I was curious to see if it was possible to create such a table with current polars syntax:

(
   df
   .select(structs = 
      pl.struct(
         pl.struct("brand", pl.sum("sales").over("brand")),
         pl.struct("size",  pl.sum("sales").over("size")),
      ))
   .unnest("structs")
   .melt()
   .unnest("value")
   .unique(maintain_order=True)
   .drop("variable")
)
shape: (4, 3)
┌───────┬───────┬──────┐
│ brand ┆ sales ┆ size │
│ ---   ┆ ---   ┆ ---  │
│ str   ┆ i64   ┆ str  │
╞═══════╪═══════╪══════╡
│ Foo   ┆ 30    ┆ null │
│ Bar   ┆ 20    ┆ null │
│ null  ┆ 15    ┆ L    │
│ null  ┆ 35    ┆ M    │
└───────┴───────┴──────┘

An attempt at a general approach:

grouping_sets(
    df, 
    groups=("brand", "size"), 
    exprs=(
       pl.col("sales").sum().alias("sum"), 
       pl.col("sales").mean().alias("mean")
    )
)
shape: (5, 4)
┌───────┬─────┬──────┬──────┐
│ brand ┆ sum ┆ mean ┆ size │
│ ---   ┆ --- ┆ ---  ┆ ---  │
│ str   ┆ i64 ┆ f64  ┆ str  │
╞═══════╪═════╪══════╪══════╡
│ Foo   ┆ 30  ┆ 15.0 ┆ null │
│ Bar   ┆ 20  ┆ 10.0 ┆ null │
│ null  ┆ 15  ┆ 7.5  ┆ L    │
│ null  ┆ 35  ┆ 17.5 ┆ M    │
│ null  ┆ 50  ┆ 12.5 ┆ null │
└───────┴─────┴──────┴──────┘

Code:

def grouping_sets(df, groups, exprs, with_total=True):
    structs = (
       pl.struct(
          pl.struct(
             group, 
             *(expr.over(group).alias(expr.meta.output_name()) 
               for expr in exprs)
          )
          for group in groups
       )
    )

    result = (
       df.select(structs=structs)
         .unnest("structs")
         .melt()
         .unnest("value")
         .unique(maintain_order=True)
         .drop("variable")
    )

    if with_total:
       null = result.select(
          pl.lit(None).alias(col).cast(dtype) 
          for col, dtype in result.schema.items()
       )

       result = pl.concat([
          result,
          null.with_columns(df.select(*exprs))
       ])

    return result

with_total is used in the example to represent the () behaviour.

@ritchie46
Copy link
Member

I was curious to see if it was possible to create such a table with current polars syntax:

I believe you need to concat. They are described as unions over groupby aggregations if I am not mistaken

@jangorecki
Copy link

I was curious to see if it was possible to create such a table with current polars syntax:

I believe you need to concat. They are described as unions over groupby aggregations if I am not mistaken

Yes, as coded by @avimallu in the first post. Grouping sets are simply concatenated sets of groupby queries meant to calculate aggregations on measures for different combination of dimensions (grouping levels).

@nobites
Copy link

nobites commented Apr 26, 2024

I fully agreee with @avimallu. Group By Cube, Rollup, Grouping Sets are common in SQL and also in PySpark. I also miss it in Polars and a workaround requires several line of code, as @cmdlineluser shoed above. Please implement. 👍

@camriddell
Copy link

The combinatoric groups can be created at the Python level quite readily with itertools then Polars can perform heavy lifting for evaluation during .collect

import polars as pl
from itertools import islice, combinations
from pandas import Timestamp, to_timedelta
from numpy.random import default_rng


rng = default_rng(0)

center_ids = [f"Center_{i}" for i in range(2)]
locations = ["East", "West"]
service_types = ["Web Hosting", "Data Processing", "Cloud Storage"]

ldf = pl.DataFrame({
    "center_id":    rng.choice(center_ids, size=(size := 100)),
    "location":     rng.choice(locations, size=size),
    "service_type": rng.choice(service_types, size=size),
    "timestamp":    Timestamp.now() - to_timedelta(rng.integers(0, 3_650, size=size), unit='D'),
    "cpu_usage":    rng.uniform(0, 100, size=size),
    "mem_usage":    rng.uniform(0, 64, size=size),
}).lazy()

def grouping_sets(ldf, groups, exprs):
    """generic implementation of grouping sets functionality

    Evaluates exprs within each of the passed groups.
    """
    frames = []
    for i, gs in enumerate(groups):
        if not gs:
            query = ldf.select(exprs)
        else:
            query = ldf.group_by(gs).agg(exprs)
        frames.append(
            query.with_columns(
                # columns to track groupings
                groupings=pl.lit(gs),
                group_id=pl.lit(i),
            )
        )
    return pl.concat(frames, how='diagonal')

# combinatoric helpers
def rollup(*items):
    """produce shrinking subsets of items"""
    return (
        [*islice(items, i, None)] for i in range(len(items)+1)
    )
def cube(*items):
    """reversed version of itertools powerset recipe"""
    for size in range(len(items)+1, -1, -1):
        for combo in combinations(items, size):
            yield [*combo]

print(
    grouping_sets( # explicit grouping sets groups
        ldf,
        groups=[['center_id', 'location', 'service_type'], ['location'], []],
        exprs=[pl.col('cpu_usage').mean(), pl.col('mem_usage').mean()]
    )
    .collect(),
    # shape: (15, 7)
    # ┌───────────┬──────────┬─────────────────┬───────────┬───────────┬─────────────────────────────────┬──────────┐
    # │ center_id ┆ location ┆ service_type    ┆ cpu_usage ┆ mem_usage ┆ groupings                       ┆ group_id │
    # │ ---       ┆ ---      ┆ ---             ┆ ---       ┆ ---       ┆ ---                             ┆ ---      │
    # │ str       ┆ str      ┆ str             ┆ f64       ┆ f64       ┆ list[str]                       ┆ i32      │
    # ╞═══════════╪══════════╪═════════════════╪═══════════╪═══════════╪═════════════════════════════════╪══════════╡
    # │ Center_0  ┆ West     ┆ Web Hosting     ┆ 65.743955 ┆ 40.348164 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_1  ┆ East     ┆ Web Hosting     ┆ 53.60846  ┆ 27.885183 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_1  ┆ West     ┆ Cloud Storage   ┆ 67.4061   ┆ 32.278916 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_0  ┆ West     ┆ Data Processing ┆ 54.954341 ┆ 21.972752 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_0  ┆ East     ┆ Web Hosting     ┆ 49.510622 ┆ 31.492153 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ …         ┆ …        ┆ …               ┆ …         ┆ …         ┆ …                               ┆ …        │
    # │ Center_1  ┆ West     ┆ Web Hosting     ┆ 49.162421 ┆ 35.982805 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_0  ┆ West     ┆ Cloud Storage   ┆ 59.67797  ┆ 39.92657  ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ null      ┆ West     ┆ null            ┆ 55.707917 ┆ 33.647801 ┆ ["location"]                    ┆ 1        │
    # │ null      ┆ East     ┆ null            ┆ 52.259052 ┆ 30.260479 ┆ ["location"]                    ┆ 1        │
    # │ null      ┆ null     ┆ null            ┆ 54.155928 ┆ 32.123506 ┆ []                              ┆ 2        │
    # └───────────┴──────────┴─────────────────┴───────────┴───────────┴─────────────────────────────────┴──────────┘

    grouping_sets( # rollup pattern
        ldf,
        groups=rollup('center_id', 'location', 'service_type'),
        exprs=[pl.col('cpu_usage').mean(), pl.col('mem_usage').mean()]
    )
    .collect(),
    # shape: (22, 7)
    # ┌───────────┬──────────┬─────────────────┬───────────┬───────────┬─────────────────────────────────┬──────────┐
    # │ center_id ┆ location ┆ service_type    ┆ cpu_usage ┆ mem_usage ┆ groupings                       ┆ group_id │
    # │ ---       ┆ ---      ┆ ---             ┆ ---       ┆ ---       ┆ ---                             ┆ ---      │
    # │ str       ┆ str      ┆ str             ┆ f64       ┆ f64       ┆ list[str]                       ┆ i32      │
    # ╞═══════════╪══════════╪═════════════════╪═══════════╪═══════════╪═════════════════════════════════╪══════════╡
    # │ Center_1  ┆ West     ┆ Web Hosting     ┆ 49.162421 ┆ 35.982805 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_0  ┆ East     ┆ Web Hosting     ┆ 49.510622 ┆ 31.492153 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_0  ┆ West     ┆ Cloud Storage   ┆ 59.67797  ┆ 39.92657  ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_1  ┆ East     ┆ Data Processing ┆ 52.559742 ┆ 35.97784  ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_0  ┆ East     ┆ Data Processing ┆ 38.898137 ┆ 31.946376 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ …         ┆ …        ┆ …               ┆ …         ┆ …         ┆ …                               ┆ …        │
    # │ null      ┆ East     ┆ Cloud Storage   ┆ 54.270371 ┆ 29.515165 ┆ ["location", "service_type"]    ┆ 1        │
    # │ null      ┆ null     ┆ Data Processing ┆ 47.754155 ┆ 29.978704 ┆ ["service_type"]                ┆ 2        │
    # │ null      ┆ null     ┆ Cloud Storage   ┆ 58.605781 ┆ 32.230542 ┆ ["service_type"]                ┆ 2        │
    # │ null      ┆ null     ┆ Web Hosting     ┆ 54.883262 ┆ 33.790533 ┆ ["service_type"]                ┆ 2        │
    # │ null      ┆ null     ┆ null            ┆ 54.155928 ┆ 32.123506 ┆ []                              ┆ 3        │
    # └───────────┴──────────┴─────────────────┴───────────┴───────────┴─────────────────────────────────┴──────────┘

    grouping_sets( # cube pattern
        ldf,
        groups=cube('center_id', 'location', 'service_type'),
        exprs=[pl.col('cpu_usage').mean(), pl.col('mem_usage').mean()]
    )
    .collect(),
    # shape: (36, 7)
    # ┌───────────┬──────────┬─────────────────┬───────────┬───────────┬─────────────────────────────────┬──────────┐
    # │ center_id ┆ location ┆ service_type    ┆ cpu_usage ┆ mem_usage ┆ groupings                       ┆ group_id │
    # │ ---       ┆ ---      ┆ ---             ┆ ---       ┆ ---       ┆ ---                             ┆ ---      │
    # │ str       ┆ str      ┆ str             ┆ f64       ┆ f64       ┆ list[str]                       ┆ i32      │
    # ╞═══════════╪══════════╪═════════════════╪═══════════╪═══════════╪═════════════════════════════════╪══════════╡
    # │ Center_1  ┆ East     ┆ Cloud Storage   ┆ 58.068076 ┆ 26.131858 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_1  ┆ West     ┆ Cloud Storage   ┆ 67.4061   ┆ 32.278916 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_0  ┆ East     ┆ Cloud Storage   ┆ 48.573814 ┆ 34.590125 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_1  ┆ West     ┆ Data Processing ┆ 40.087963 ┆ 35.195383 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ Center_1  ┆ West     ┆ Web Hosting     ┆ 49.162421 ┆ 35.982805 ┆ ["center_id", "location", "ser… ┆ 0        │
    # │ …         ┆ …        ┆ …               ┆ …         ┆ …         ┆ …                               ┆ …        │
    # │ null      ┆ East     ┆ null            ┆ 52.259052 ┆ 30.260479 ┆ ["location"]                    ┆ 5        │
    # │ null      ┆ null     ┆ Data Processing ┆ 47.754155 ┆ 29.978704 ┆ ["service_type"]                ┆ 6        │
    # │ null      ┆ null     ┆ Web Hosting     ┆ 54.883262 ┆ 33.790533 ┆ ["service_type"]                ┆ 6        │
    # │ null      ┆ null     ┆ Cloud Storage   ┆ 58.605781 ┆ 32.230542 ┆ ["service_type"]                ┆ 6        │
    # │ null      ┆ null     ┆ null            ┆ 54.155928 ┆ 32.123506 ┆ []                              ┆ 7        │
    # └───────────┴──────────┴─────────────────┴───────────┴───────────┴─────────────────────────────────┴──────────┘

    sep='\n',
)

@l1t1
Copy link

l1t1 commented Oct 18, 2024

add the native group by cube/rollup/grouping setssql syntax is more convinent and necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

7 participants