cumulative calculation based on previous row's result #7658

cmdlineluser · 2023-03-20T14:56:19Z

Problem description

I'm not sure if there is an actual name for this:

"Where a row-based calculation depends on the result from the previous row."

import polars as pl

start = pl.lit("2023-01-01").str.strptime(pl.Date)

df = pl.DataFrame({
    "A": [1, 2, 3, 4, 5],
    "B": [6, 7, 8, 9, 10],
    "C": [5, 4, 3, 2, 1]
}).with_columns(
    date = pl.date_range(start, start + pl.duration(days=4)),
    open = 0,
    close = 0
)

shape: (5, 6)
┌─────┬─────┬─────┬────────────┬──────┬───────┐
│ A   ┆ B   ┆ C   ┆ date       ┆ open ┆ close │
│ --- ┆ --- ┆ --- ┆ ---        ┆ ---  ┆ ---   │
│ i64 ┆ i64 ┆ i64 ┆ date       ┆ i32  ┆ i32   │
╞═════╪═════╪═════╪════════════╪══════╪═══════╡
│ 1   ┆ 6   ┆ 5   ┆ 2023-01-01 ┆ 0    ┆ 0     │
│ 2   ┆ 7   ┆ 4   ┆ 2023-01-02 ┆ 0    ┆ 0     │
│ 3   ┆ 8   ┆ 3   ┆ 2023-01-03 ┆ 0    ┆ 0     │
│ 4   ┆ 9   ┆ 2   ┆ 2023-01-04 ┆ 0    ┆ 0     │
│ 5   ┆ 10  ┆ 1   ┆ 2023-01-05 ┆ 0    ┆ 0     │
└─────┴─────┴─────┴────────────┴──────┴───────┘

Basically when you need to resort to a for loop in Python:

initial_value = 10
running_total = initial_value

df_new = df.clone()

for idx, row in enumerate(df.iter_rows(named=True)):
    if idx > 0:
       row["open"] = running_total
       running_total *= row["A"] + row["B"]
       running_total -= row["C"]

    row["close"] = running_total

    df_new[idx, "open"] = row["open"]
    df_new[idx, "close"] = row["close"]

┌─────┬─────┬─────┬────────────┬───────┬────────┐
│ A   ┆ B   ┆ C   ┆ date       ┆ open  ┆ close  │
│ --- ┆ --- ┆ --- ┆ ---        ┆ ---   ┆ ---    │
│ i64 ┆ i64 ┆ i64 ┆ date       ┆ i32   ┆ i32    │
╞═════╪═════╪═════╪════════════╪═══════╪════════╡
│ 1   ┆ 6   ┆ 5   ┆ 2023-01-01 ┆ 0     ┆ 10     │
│ 2   ┆ 7   ┆ 4   ┆ 2023-01-02 ┆ 10    ┆ 86     │
│ 3   ┆ 8   ┆ 3   ┆ 2023-01-03 ┆ 86    ┆ 943    │
│ 4   ┆ 9   ┆ 2   ┆ 2023-01-04 ┆ 943   ┆ 12257  │
│ 5   ┆ 10  ┆ 1   ┆ 2023-01-05 ┆ 12257 ┆ 183854 │
└─────┴─────┴─────┴────────────┴───────┴────────┘

It sort of looks like a .cumfold operation - but then then result is also brought forward to the next row.

Would it be possible to encode this behaviour in polars to essentially have the for loop executed in rust?

Hypothetical syntax:

pl.running_total(
   initial_value = pl.lit(10),
   old = "open",
   new = "close",
   exprs = 
      pl.col("C") - (pl.element() * (pl.col("A") + pl.col("B")))
)

Where you could choose a target for the pre-calculation result, post-calculation, and then define a "formula" for the calculation itself.

pl.element() is used here as a placeholder for the "running total" value.

Thanks.

The text was updated successfully, but these errors were encountered:

ritchie46 · 2023-03-20T16:32:58Z

This would be much slower than a python loop. As it must run every single element on our expression interpreter.

This is optimized for columns, not single elements.

cmdlineluser · 2023-03-21T06:32:19Z

@ritchie46 Thanks for the feedback.

Could it really be that much slower than looping in Python?

As an experiment - I attempted to modify .cumulative_eval to have it process each row instead. (I'm probably breaking all sorts of coding standards here, as I don't really know what I'm doing in rust - apologies.)

https://gist.github.com/cmdlineluser/f9d657bdf04ba64e2ab9a63cdefa2dc4

With the df from earlier enlarged to 25000 rows:

df = pl.concat([df] * 5000)

The rust version takes 1.21s

Wall time: 1.21 s

Python for loop takes 7.92s

Wall time: 7.92 s

The gap continues to widen as the number of rows increases.

I used values in the struct as placeholders:

(df.with_columns(_initial_value = 1.0, _running_total = 0.0)
   .with_columns(total = pl.struct(pl.all())
   .cumulative_eval(
      (pl.element().struct.field("_running_total") 
       * (pl.element().struct.field("A") - pl.element().struct.field("B")))
       - pl.element().struct.field("C"))))

I tried to return a struct of {"before", "after"} values from the function but .into_static() complained - so it just returns the "after" value.

cmdlineluser · 2023-04-02T15:19:06Z

Experimenting with this further, it seems like creating a single row of structs is a decently fast operation:

>>> df.slice(1).select(cols = pl.struct(pl.col("A", "B", "C")).list().arr.to_struct()).unnest("cols")
shape: (1, 4)
┌───────────┬───────────┬───────────┬───────────┐
│ field_0   ┆ field_1   ┆ field_2   ┆ field_3   │
│ ---       ┆ ---       ┆ ---       ┆ ---       │
│ struct[3] ┆ struct[3] ┆ struct[3] ┆ struct[3] │
╞═══════════╪═══════════╪═══════════╪═══════════╡
│ {2,7,4}   ┆ {3,8,3}   ┆ {4,9,2}   ┆ {5,10,1}  │
└───────────┴───────────┴───────────┴───────────┘

We can then pl.cumfold

df.with_columns(
   df.slice(1)
     .select(cols = pl.struct(pl.col("A", "B", "C")).list().arr.to_struct())
     .unnest("cols")
     .select(tally = 
        pl.cumfold(
           acc = initial_value, 
           exprs = pl.all(),
           function = lambda total, col: 
              (total * (col.struct["A"] + col.struct["B"])) - col.struct["C"]))
     .unnest("tally")
     .select(tally = 
        pl.concat_list(pl.lit(initial_value), pl.all())
          .reshape((-1, 1))
          .flatten())
)

shape: (5, 6)
┌─────┬─────┬─────┬──────┬───────┬────────┐
│ A   ┆ B   ┆ C   ┆ open ┆ close ┆ tally  │
│ --- ┆ --- ┆ --- ┆ ---  ┆ ---   ┆ ---    │
│ i64 ┆ i64 ┆ i64 ┆ i32  ┆ i32   ┆ i64    │
╞═════╪═════╪═════╪══════╪═══════╪════════╡
│ 1   ┆ 6   ┆ 5   ┆ 0    ┆ 0     ┆ 10     │
│ 2   ┆ 7   ┆ 4   ┆ 0    ┆ 0     ┆ 86     │
│ 3   ┆ 8   ┆ 3   ┆ 0    ┆ 0     ┆ 943    │
│ 4   ┆ 9   ┆ 2   ┆ 0    ┆ 0     ┆ 12257  │
│ 5   ┆ 10  ┆ 1   ┆ 0    ┆ 0     ┆ 183854 │
└─────┴─────┴─────┴──────┴───────┴────────┘

It runs in 3.7s with the 25000 row example which is ~50% faster than the python loop version.

Maybe it can be improved?

buckleyc · 2023-04-11T14:41:18Z

I am pleased that you asked this issue here, @cmdlineluser. I am the person that posted that original question at StackOverflow.
I really like the potential of Polars. I also realize that the common user base for Polars and Pandas have been focused on data mining of previous generated data. However, for the code I have written, I have been trying to simulate a future timeline. When given a starting balance at a given date, I want to extrapolate gains and losses every period (e.g., monthly) based upon the opening balance of that period. This is easy enough when one is only adding or subtracting other unrelated numbers, but quickly slows down when the numbers are dependent on the opening balance. Examples include the return amount of an investment gain (which is obviously dependent on how much money one starts with), and the amount of taxes paid on any withdrawals (either during the current period or within a previous year).
I realize that any row-wise calculation would slow Polars down, but having a native function to do this would seem to provide a more optimized (and faster) solution than resorting to a Python loop.
On a side note, I have spent a few days looking into Rust coding. But I quickly realized that this would involve a notable time-commitment for me to get up-to-speed enough to make any effort at coding this in Rust. So, you have my vote if someone wants to take on this enhancement within Polars. Many thanks.

cmdlineluser · 2023-04-14T13:43:21Z

@buckleyc Hello there!

Did you happen to try the .struct / .cumfold approach to see how it performed for your particular use-case?

Edit: It appears that I wasn't testing with large enough data. The .arr.to_struct() starts to take a long time and makes it unusable. (e.g. with 1 million rows)

buckleyc · 2023-04-20T13:18:34Z

@cmdlineluser, I finally got around to testing usability of this method, and it is daunting. The issue quickly becomes the complexity/difficulty of trying to shove more than a simple financial calculation within the .cum_fold(). For this financial simulation, I can pre-calculate various columns across the life of a sample timeline (e.g., cumulative inflation, cumulative cost of living adj, potential period expenses). Polars is great for this.
The running tally requires taking an opening balance, calculating any investment gain, adding any additional incomes, then subtracting any period expenses, subtracting any taxes paid (based on incomes and capital gains for the previous period or year), and then carrying this tally over for the next period/row.
That seems like a lot to do within the interior of this cum_fold(), especially since the simulation would need to call various functions along the way (e.g., income_tax_liability()). I am not sure how to cobble various function calls into the interior of that cum_fold.

    return df.with_columns(
       df.slice(1)
         .select(cols = pl.struct(pl.col("earning_period", "ssa_benefit", "expense")).list().arr.to_struct())
         .unnest("cols")
         .select(tally = 
            pl.cumfold(
               acc = initial_value, 
               exprs = pl.all(),
               function = lambda total, col: 
                  (
                      (total * (1 + col.struct["earning_period"])) + col.struct["ssa_benefit"] - col.struct["expense"]
                  ).round(2)
            ))
         .unnest("tally")
         .select(tally = 
            pl.concat_list(pl.lit(initial_value), pl.all())
              .reshape((-1, 1))
              .flatten())
    )

Any ideas how best to approach this complexity within Polars, or is this level of row complexity best left to a different tool?

cmdlineluser · 2023-04-23T12:19:45Z

@buckleyc Yeah, apologies.

It was perhaps improper of me to make that suggestion as it's not a actual solution to the problem, and could probably be considered a misuse of polars syntax.

"preoccupied with whether they could, they didn't stop to think if they should" - as they say.

I thought a generalized "accumulate" type function (similar to .cumulative_eval) would be possible.

deoqc · 2023-08-09T19:32:31Z

Hello @buckleyc @cmdlineluser

I have be digging around for something like this also. Unfortunately @ritchie46 already pinged w/ discouraging thoughts.

But I will leave my use case and comments here in case this changes.

Feature requested:

fold/reduce vertically, across rows. Something like foldv bellow:

df.with_columns(
    pl.col("a").foldv(lambda acc, x: acc + x, 100, acc_initial=0).alias("cumsum"),
    pl.col("a").foldv(lambda acc, x: acc*x, 100, acc_initial=1).alias("cumprod"),
)

My use case:

Also financial industry, needs to run a few dozes complicated function in a loop over few data sets of ~1 million rows.
I already have most of regular data pipeline done in polars but lack of being able to carry the result from one row to the next left the most important part of pipe lacking. The expressiveness of the polars DSL would deal w/ everything in pure polars, except this part.

Originally posted here https://discord.com/channels/908022250106667068/911186243465904178/1138871006065344522

cmdlineluser added the enhancement New feature or an improvement of an existing feature label Mar 20, 2023

cmdlineluser mentioned this issue Oct 2, 2023

fix: add polars to requirements.txt for derive_expression example pola-rs/pyo3-polars#28

Merged

cmdlineluser mentioned this issue Nov 1, 2023

Support of vertical fold and scan #12165

Open

cmdlineluser closed this as completed Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cumulative calculation based on previous row's result #7658

cumulative calculation based on previous row's result #7658

cmdlineluser commented Mar 20, 2023

ritchie46 commented Mar 20, 2023

cmdlineluser commented Mar 21, 2023 •

edited

Loading

cmdlineluser commented Apr 2, 2023

buckleyc commented Apr 11, 2023

cmdlineluser commented Apr 14, 2023 •

edited

Loading

buckleyc commented Apr 20, 2023

cmdlineluser commented Apr 23, 2023

deoqc commented Aug 9, 2023

cumulative calculation based on previous row's result #7658

cumulative calculation based on previous row's result #7658

Comments

cmdlineluser commented Mar 20, 2023

Problem description

ritchie46 commented Mar 20, 2023

cmdlineluser commented Mar 21, 2023 • edited Loading

cmdlineluser commented Apr 2, 2023

buckleyc commented Apr 11, 2023

cmdlineluser commented Apr 14, 2023 • edited Loading

buckleyc commented Apr 20, 2023

cmdlineluser commented Apr 23, 2023

deoqc commented Aug 9, 2023

cmdlineluser commented Mar 21, 2023 •

edited

Loading

cmdlineluser commented Apr 14, 2023 •

edited

Loading