Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cumulative calculation based on previous row's result #7658

Closed
cmdlineluser opened this issue Mar 20, 2023 · 8 comments
Closed

cumulative calculation based on previous row's result #7658

cmdlineluser opened this issue Mar 20, 2023 · 8 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@cmdlineluser
Copy link
Contributor

Problem description

Taken from this stackoverflow question.

I'm not sure if there is an actual name for this:

"Where a row-based calculation depends on the result from the previous row."

import polars as pl

start = pl.lit("2023-01-01").str.strptime(pl.Date)

df = pl.DataFrame({
    "A": [1, 2, 3, 4, 5],
    "B": [6, 7, 8, 9, 10],
    "C": [5, 4, 3, 2, 1]
}).with_columns(
    date = pl.date_range(start, start + pl.duration(days=4)),
    open = 0,
    close = 0
)
shape: (5, 6)
┌─────┬─────┬─────┬────────────┬──────┬───────┐
│ A   ┆ B   ┆ C   ┆ date       ┆ open ┆ close │
│ --- ┆ --- ┆ --- ┆ ---        ┆ ---  ┆ ---   │
│ i64 ┆ i64 ┆ i64 ┆ date       ┆ i32  ┆ i32   │
╞═════╪═════╪═════╪════════════╪══════╪═══════╡
│ 1   ┆ 6   ┆ 5   ┆ 2023-01-01 ┆ 0    ┆ 0     │
│ 2   ┆ 7   ┆ 4   ┆ 2023-01-02 ┆ 0    ┆ 0     │
│ 3   ┆ 8   ┆ 3   ┆ 2023-01-03 ┆ 0    ┆ 0     │
│ 4   ┆ 9   ┆ 2   ┆ 2023-01-04 ┆ 0    ┆ 0     │
│ 5   ┆ 10  ┆ 1   ┆ 2023-01-05 ┆ 0    ┆ 0     │
└─────┴─────┴─────┴────────────┴──────┴───────┘

Basically when you need to resort to a for loop in Python:

initial_value = 10
running_total = initial_value

df_new = df.clone()

for idx, row in enumerate(df.iter_rows(named=True)):
    if idx > 0:
       row["open"] = running_total
       running_total *= row["A"] + row["B"]
       running_total -= row["C"]

    row["close"] = running_total

    df_new[idx, "open"] = row["open"]
    df_new[idx, "close"] = row["close"]
┌─────┬─────┬─────┬────────────┬───────┬────────┐
│ A   ┆ B   ┆ C   ┆ date       ┆ open  ┆ close  │
│ --- ┆ --- ┆ --- ┆ ---        ┆ ---   ┆ ---    │
│ i64 ┆ i64 ┆ i64 ┆ date       ┆ i32   ┆ i32    │
╞═════╪═════╪═════╪════════════╪═══════╪════════╡
│ 1   ┆ 6   ┆ 5   ┆ 2023-01-01 ┆ 0     ┆ 10     │
│ 2   ┆ 7   ┆ 4   ┆ 2023-01-02 ┆ 10    ┆ 86     │
│ 3   ┆ 8   ┆ 3   ┆ 2023-01-03 ┆ 86    ┆ 943    │
│ 4   ┆ 9   ┆ 2   ┆ 2023-01-04 ┆ 943   ┆ 12257  │
│ 5   ┆ 10  ┆ 1   ┆ 2023-01-05 ┆ 12257 ┆ 183854 │
└─────┴─────┴─────┴────────────┴───────┴────────┘

It sort of looks like a .cumfold operation - but then then result is also brought forward to the next row.

Would it be possible to encode this behaviour in polars to essentially have the for loop executed in rust?

Hypothetical syntax:

pl.running_total(
   initial_value = pl.lit(10),
   old = "open",
   new = "close",
   exprs = 
      pl.col("C") - (pl.element() * (pl.col("A") + pl.col("B")))
)

Where you could choose a target for the pre-calculation result, post-calculation, and then define a "formula" for the calculation itself.

pl.element() is used here as a placeholder for the "running total" value.

Thanks.

@cmdlineluser cmdlineluser added the enhancement New feature or an improvement of an existing feature label Mar 20, 2023
@ritchie46
Copy link
Member

This would be much slower than a python loop. As it must run every single element on our expression interpreter.

This is optimized for columns, not single elements.

@cmdlineluser
Copy link
Contributor Author

cmdlineluser commented Mar 21, 2023

@ritchie46 Thanks for the feedback.

Could it really be that much slower than looping in Python?

As an experiment - I attempted to modify .cumulative_eval to have it process each row instead. (I'm probably breaking all sorts of coding standards here, as I don't really know what I'm doing in rust - apologies.)

https://gist.github.com/cmdlineluser/f9d657bdf04ba64e2ab9a63cdefa2dc4

With the df from earlier enlarged to 25000 rows:

df = pl.concat([df] * 5000)

The rust version takes 1.21s

Wall time: 1.21 s

Python for loop takes 7.92s

Wall time: 7.92 s

The gap continues to widen as the number of rows increases.

I used values in the struct as placeholders:

(df.with_columns(_initial_value = 1.0, _running_total = 0.0)
   .with_columns(total = pl.struct(pl.all())
   .cumulative_eval(
      (pl.element().struct.field("_running_total") 
       * (pl.element().struct.field("A") - pl.element().struct.field("B")))
       - pl.element().struct.field("C"))))

I tried to return a struct of {"before", "after"} values from the function but .into_static() complained - so it just returns the "after" value.

@cmdlineluser
Copy link
Contributor Author

Experimenting with this further, it seems like creating a single row of structs is a decently fast operation:

>>> df.slice(1).select(cols = pl.struct(pl.col("A", "B", "C")).list().arr.to_struct()).unnest("cols")
shape: (1, 4)
┌───────────┬───────────┬───────────┬───────────┐
│ field_0   ┆ field_1   ┆ field_2   ┆ field_3   │
│ ---       ┆ ---       ┆ ---       ┆ ---       │
│ struct[3] ┆ struct[3] ┆ struct[3] ┆ struct[3] │
╞═══════════╪═══════════╪═══════════╪═══════════╡
│ {2,7,4}   ┆ {3,8,3}   ┆ {4,9,2}   ┆ {5,10,1}  │
└───────────┴───────────┴───────────┴───────────┘

We can then pl.cumfold

df.with_columns(
   df.slice(1)
     .select(cols = pl.struct(pl.col("A", "B", "C")).list().arr.to_struct())
     .unnest("cols")
     .select(tally = 
        pl.cumfold(
           acc = initial_value, 
           exprs = pl.all(),
           function = lambda total, col: 
              (total * (col.struct["A"] + col.struct["B"])) - col.struct["C"]))
     .unnest("tally")
     .select(tally = 
        pl.concat_list(pl.lit(initial_value), pl.all())
          .reshape((-1, 1))
          .flatten())
)
shape: (5, 6)
┌─────┬─────┬─────┬──────┬───────┬────────┐
│ A   ┆ B   ┆ C   ┆ open ┆ close ┆ tally  │
│ --- ┆ --- ┆ --- ┆ ---  ┆ ---   ┆ ---    │
│ i64 ┆ i64 ┆ i64 ┆ i32  ┆ i32   ┆ i64    │
╞═════╪═════╪═════╪══════╪═══════╪════════╡
│ 1   ┆ 6   ┆ 5   ┆ 0    ┆ 0     ┆ 10     │
│ 2   ┆ 7   ┆ 4   ┆ 0    ┆ 0     ┆ 86     │
│ 3   ┆ 8   ┆ 3   ┆ 0    ┆ 0     ┆ 943    │
│ 4   ┆ 9   ┆ 2   ┆ 0    ┆ 0     ┆ 12257  │
│ 5   ┆ 10  ┆ 1   ┆ 0    ┆ 0     ┆ 183854 │
└─────┴─────┴─────┴──────┴───────┴────────┘

It runs in 3.7s with the 25000 row example which is ~50% faster than the python loop version.

Maybe it can be improved?

@buckleyc
Copy link

I am pleased that you asked this issue here, @cmdlineluser. I am the person that posted that original question at StackOverflow.
I really like the potential of Polars. I also realize that the common user base for Polars and Pandas have been focused on data mining of previous generated data. However, for the code I have written, I have been trying to simulate a future timeline. When given a starting balance at a given date, I want to extrapolate gains and losses every period (e.g., monthly) based upon the opening balance of that period. This is easy enough when one is only adding or subtracting other unrelated numbers, but quickly slows down when the numbers are dependent on the opening balance. Examples include the return amount of an investment gain (which is obviously dependent on how much money one starts with), and the amount of taxes paid on any withdrawals (either during the current period or within a previous year).
I realize that any row-wise calculation would slow Polars down, but having a native function to do this would seem to provide a more optimized (and faster) solution than resorting to a Python loop.
On a side note, I have spent a few days looking into Rust coding. But I quickly realized that this would involve a notable time-commitment for me to get up-to-speed enough to make any effort at coding this in Rust. So, you have my vote if someone wants to take on this enhancement within Polars. Many thanks.

@cmdlineluser
Copy link
Contributor Author

cmdlineluser commented Apr 14, 2023

@buckleyc Hello there!

Did you happen to try the .struct / .cumfold approach to see how it performed for your particular use-case?

Edit: It appears that I wasn't testing with large enough data. The .arr.to_struct() starts to take a long time and makes it unusable. (e.g. with 1 million rows)

@buckleyc
Copy link

@cmdlineluser, I finally got around to testing usability of this method, and it is daunting. The issue quickly becomes the complexity/difficulty of trying to shove more than a simple financial calculation within the .cum_fold(). For this financial simulation, I can pre-calculate various columns across the life of a sample timeline (e.g., cumulative inflation, cumulative cost of living adj, potential period expenses). Polars is great for this.
The running tally requires taking an opening balance, calculating any investment gain, adding any additional incomes, then subtracting any period expenses, subtracting any taxes paid (based on incomes and capital gains for the previous period or year), and then carrying this tally over for the next period/row.
That seems like a lot to do within the interior of this cum_fold(), especially since the simulation would need to call various functions along the way (e.g., income_tax_liability()). I am not sure how to cobble various function calls into the interior of that cum_fold.

    return df.with_columns(
       df.slice(1)
         .select(cols = pl.struct(pl.col("earning_period", "ssa_benefit", "expense")).list().arr.to_struct())
         .unnest("cols")
         .select(tally = 
            pl.cumfold(
               acc = initial_value, 
               exprs = pl.all(),
               function = lambda total, col: 
                  (
                      (total * (1 + col.struct["earning_period"])) + col.struct["ssa_benefit"] - col.struct["expense"]
                  ).round(2)
            ))
         .unnest("tally")
         .select(tally = 
            pl.concat_list(pl.lit(initial_value), pl.all())
              .reshape((-1, 1))
              .flatten())
    )

Any ideas how best to approach this complexity within Polars, or is this level of row complexity best left to a different tool?

@cmdlineluser
Copy link
Contributor Author

@buckleyc Yeah, apologies.

It was perhaps improper of me to make that suggestion as it's not a actual solution to the problem, and could probably be considered a misuse of polars syntax.

"preoccupied with whether they could, they didn't stop to think if they should" - as they say.

I thought a generalized "accumulate" type function (similar to .cumulative_eval) would be possible.

@deoqc
Copy link

deoqc commented Aug 9, 2023

Hello @buckleyc @cmdlineluser

I have be digging around for something like this also. Unfortunately @ritchie46 already pinged w/ discouraging thoughts.

But I will leave my use case and comments here in case this changes.

Feature requested:

  • fold/reduce vertically, across rows. Something like foldv bellow:
df.with_columns(
    pl.col("a").foldv(lambda acc, x: acc + x, 100, acc_initial=0).alias("cumsum"),
    pl.col("a").foldv(lambda acc, x: acc*x, 100, acc_initial=1).alias("cumprod"),
)

My use case:

  • Also financial industry, needs to run a few dozes complicated function in a loop over few data sets of ~1 million rows.
  • I already have most of regular data pipeline done in polars but lack of being able to carry the result from one row to the next left the most important part of pipe lacking. The expressiveness of the polars DSL would deal w/ everything in pure polars, except this part.

Originally posted here https://discord.com/channels/908022250106667068/911186243465904178/1138871006065344522

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

4 participants