Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficiency of over expression #10063

Closed
2 tasks done
paladin158 opened this issue Jul 24, 2023 · 5 comments
Closed
2 tasks done

Inefficiency of over expression #10063

paladin158 opened this issue Jul 24, 2023 · 5 comments

Comments

@paladin158
Copy link

Research

  • I have searched the above Polars tags on Stack Overflow for similar questions.

  • I have asked my usage related question on Stack Overflow.

Link to question on Stack Overflow

https://stackoverflow.com/questions/76757987/polars-inefficiency-of-over-expression

Question about Polars

I found out that at least for the scenario below, doing over is much slower (2~3x) than doing groupby/agg + explode. And, the results are exactly the same.

Based on this finding, I have the following questions:

  • Is such behaviour as expected? If so, should we always do a 2-step procedure (groupby/agg + explode) instead of using over directly?
  • Or, does this mean that there may be some room to optimize over?
  • Or, the performance between these two approaches really depends on the problem setup and users should try to see which approach fits the better?
import time

import numpy as np
import polars as pl
from polars.testing import assert_frame_equal

## setup
rng = np.random.default_rng(1)

nrows = 20_000_000
df = pl.DataFrame(
    dict(
        id=rng.integers(1, 50, nrows),
        id2=rng.integers(1, 500, nrows),
        v=rng.normal(0, 1, nrows),
        v1=rng.normal(0, 1, nrows),
        v2=rng.normal(0, 1, nrows),
        v3=rng.normal(0, 1, nrows),
        v4=rng.normal(0, 1, nrows),
        v5=rng.normal(0, 1, nrows),
        v6=rng.normal(0, 1, nrows),
        v7=rng.normal(0, 1, nrows),
        v8=rng.normal(0, 1, nrows),
        v9=rng.normal(0, 1, nrows),
        v10=rng.normal(0, 1, nrows),
    )
)

## over
start = time.perf_counter()
res = (
    df.lazy()
    .select(
        [
            "id",
            "id2",
            *[
                (pl.col(f"v{i}") - pl.col(f"v{i}").mean().over(["id", "id2"]))
                / pl.col(f"v{i}").std().over(["id", "id2"])
                for i in range(1, 11)
            ],
        ]
    )
    .collect()
)
time.perf_counter() - start
# 8.541702497983351

## groupby/agg + explode
start = time.perf_counter()
res2 = (
    df.lazy()
    .groupby(["id", "id2"])
    .agg(
        [
            (pl.col(f"v{i}") - pl.col(f"v{i}").mean()) / pl.col(f"v{i}").std()
            for i in range(1, 11)
        ],
    )
    .explode(pl.exclude(["id", "id2"]))
    .collect()
)
time.perf_counter() - start
# 3.1841439900454134

## compare results
assert_frame_equal(res.sort(["id", "id2"]), res2.sort(["id", "id2"])[res.columns])
@marius-mather
Copy link

One suggestion for improving the over() performance: I think you can do it like this to combine it into a single over() expression:

(pl.col(f"v{i}") - pl.col(f"v{i}").mean().truediv(pl.col(f"v{i}").std()).over(["id", "id2"]))

in my quick testing this reduces the difference but doesn't eliminate it.

@paladin158
Copy link
Author

Even if removing the denominator part entirely for both approaches, there is still about 2x performance difference for my benchmarking.

@cmdlineluser
Copy link
Contributor

There was a @cbilot answer on SO that discussed/benchmarked the overhead of window expressions, but I cannot seem to find it anymore.

While searching, I did find:

"Note that window functions are very powerful, but also relatively expensive."

https://stackoverflow.com/a/71554447/

But I'm not sure if this statement is in relation to a comparison against an equivalent .groupby() operation.

Perhaps another question to ask is can Polars rewrite the .over() version into the .groupby().explode() for you?

I'm not sure on the technical details, so maybe this is already happening internally.

@ritchie46
Copy link
Member

Then choose an agg + explode. The assumptions are different, therefore the aggregation can parallelize over all aggregation functions.

We cannot expect different queries that hit different code paths to have equal performance. Especially not if the constraints are different.

@paladin158
Copy link
Author

Then choose an agg + explode. The assumptions are different, therefore the aggregation can parallelize over all aggregation functions.

We cannot expect different queries that hit different code paths to have equal performance. Especially not if the constraints are different.

Is it possible to optimize when multiple expressions using the same over window so that all expressions can share the same window and computations can parallelize over all functions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants