Inefficiency of over expression #10063

paladin158 · 2023-07-24T22:30:13Z

Research

I have searched the above Polars tags on Stack Overflow for similar questions.
I have asked my usage related question on Stack Overflow.

Link to question on Stack Overflow

https://stackoverflow.com/questions/76757987/polars-inefficiency-of-over-expression

Question about Polars

I found out that at least for the scenario below, doing over is much slower (2~3x) than doing groupby/agg + explode. And, the results are exactly the same.

Based on this finding, I have the following questions:

Is such behaviour as expected? If so, should we always do a 2-step procedure (groupby/agg + explode) instead of using over directly?
Or, does this mean that there may be some room to optimize over?
Or, the performance between these two approaches really depends on the problem setup and users should try to see which approach fits the better?

import time

import numpy as np
import polars as pl
from polars.testing import assert_frame_equal

## setup
rng = np.random.default_rng(1)

nrows = 20_000_000
df = pl.DataFrame(
    dict(
        id=rng.integers(1, 50, nrows),
        id2=rng.integers(1, 500, nrows),
        v=rng.normal(0, 1, nrows),
        v1=rng.normal(0, 1, nrows),
        v2=rng.normal(0, 1, nrows),
        v3=rng.normal(0, 1, nrows),
        v4=rng.normal(0, 1, nrows),
        v5=rng.normal(0, 1, nrows),
        v6=rng.normal(0, 1, nrows),
        v7=rng.normal(0, 1, nrows),
        v8=rng.normal(0, 1, nrows),
        v9=rng.normal(0, 1, nrows),
        v10=rng.normal(0, 1, nrows),
    )
)

## over
start = time.perf_counter()
res = (
    df.lazy()
    .select(
        [
            "id",
            "id2",
            *[
                (pl.col(f"v{i}") - pl.col(f"v{i}").mean().over(["id", "id2"]))
                / pl.col(f"v{i}").std().over(["id", "id2"])
                for i in range(1, 11)
            ],
        ]
    )
    .collect()
)
time.perf_counter() - start
# 8.541702497983351

## groupby/agg + explode
start = time.perf_counter()
res2 = (
    df.lazy()
    .groupby(["id", "id2"])
    .agg(
        [
            (pl.col(f"v{i}") - pl.col(f"v{i}").mean()) / pl.col(f"v{i}").std()
            for i in range(1, 11)
        ],
    )
    .explode(pl.exclude(["id", "id2"]))
    .collect()
)
time.perf_counter() - start
# 3.1841439900454134

## compare results
assert_frame_equal(res.sort(["id", "id2"]), res2.sort(["id", "id2"])[res.columns])

The text was updated successfully, but these errors were encountered:

marius-mather · 2023-07-25T04:03:11Z

One suggestion for improving the over() performance: I think you can do it like this to combine it into a single over() expression:

(pl.col(f"v{i}") - pl.col(f"v{i}").mean().truediv(pl.col(f"v{i}").std()).over(["id", "id2"]))

in my quick testing this reduces the difference but doesn't eliminate it.

paladin158 · 2023-07-25T04:50:40Z

Even if removing the denominator part entirely for both approaches, there is still about 2x performance difference for my benchmarking.

cmdlineluser · 2023-07-25T10:28:41Z

There was a @cbilot answer on SO that discussed/benchmarked the overhead of window expressions, but I cannot seem to find it anymore.

While searching, I did find:

"Note that window functions are very powerful, but also relatively expensive."

https://stackoverflow.com/a/71554447/

But I'm not sure if this statement is in relation to a comparison against an equivalent .groupby() operation.

Perhaps another question to ask is can Polars rewrite the .over() version into the .groupby().explode() for you?

I'm not sure on the technical details, so maybe this is already happening internally.

ritchie46 · 2023-07-25T13:03:02Z

Then choose an agg + explode. The assumptions are different, therefore the aggregation can parallelize over all aggregation functions.

We cannot expect different queries that hit different code paths to have equal performance. Especially not if the constraints are different.

paladin158 · 2023-09-25T05:09:25Z

Then choose an agg + explode. The assumptions are different, therefore the aggregation can parallelize over all aggregation functions.

We cannot expect different queries that hit different code paths to have equal performance. Especially not if the constraints are different.

Is it possible to optimize when multiple expressions using the same over window so that all expressions can share the same window and computations can parallelize over all functions?

paladin158 added the question label Jul 24, 2023

stinodego closed this as completed Aug 22, 2023

cmdlineluser mentioned this issue May 2, 2024

Cached over/groupby result #12284

Open

cmdlineluser mentioned this issue Sep 5, 2024

group_by+explode more than 3x faster than over #18556

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficiency of over expression #10063

Inefficiency of over expression #10063

paladin158 commented Jul 24, 2023

marius-mather commented Jul 25, 2023

paladin158 commented Jul 25, 2023

cmdlineluser commented Jul 25, 2023

ritchie46 commented Jul 25, 2023

paladin158 commented Sep 25, 2023

Inefficiency of over expression #10063

Inefficiency of over expression #10063

Comments

paladin158 commented Jul 24, 2023

Research

Link to question on Stack Overflow

Question about Polars

marius-mather commented Jul 25, 2023

paladin158 commented Jul 25, 2023

cmdlineluser commented Jul 25, 2023

ritchie46 commented Jul 25, 2023

paladin158 commented Sep 25, 2023