Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional temporal arithmetic #696

Merged
merged 28 commits into from
Aug 29, 2023

Conversation

billylanchantin
Copy link
Contributor

Description

This PR attempts to round off work started in #683 by adding support for the following operations:

  • date - date
  • date + duration
  • date - duration
  • duration + date
  • duration * integer
  • duration * float
  • duration / integer
  • duration / float
  • integer * duration
  • float * duration

Example

alias Explorer.DataFrame, as: DF
require DF

df = DF.new(aug_20: [~D[2023-08-20]], aug_21: [~D[2023-08-21]])
#=> #Explorer.DataFrame<
#=>   Polars[1 x 2]
#=>   aug_20 date [2023-08-20]
#=>   aug_21 date [2023-08-21]
#=> >

DF.mutate(df, diff: aug_20 - aug_21)
#=> #Explorer.DataFrame<
#=>   Polars[1 x 3]
#=>   aug_20 date [2023-08-20]
#=>   aug_21 date [2023-08-21]
#=>   diff duration[ms] [-1d]
#=> >

Notes

I handled both of those items as best I could, but I don't feel too strongly about my choices. Happy to run with any feedback :)

Copy link
Contributor Author

@billylanchantin billylanchantin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some specific call outs:

lib/explorer/series.ex Show resolved Hide resolved
Comment on lines 2597 to 2603
defp maybe_swap_args_add([left, right]) do
case {dtype(left), dtype(right)} do
# `duration + date` is not supported by polars for some reason.
{{:duration, _}, :date} -> [right, left]
_ -> [left, right]
end
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprisingly, polars doesn't support duration + date even though it does support duration + datetime and date + duration. I'm thinking this is an oversight? Regardless, I chose to swap the arguments rather than disallow because add/2 should definitely be commutative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any backend specific logic should be in the backend, not here. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by:

Comment on lines 2713 to 2717
# Polars currently returns inconsistent dtypes, e.g.:
# * `integer * duration -> duration` when `integer` is a scalar
# * `integer * duration -> integer` when `integer` is a series
# We need to return duration in these cases, so we need an additional cast.
|> cast(dtype)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still found the need to post-cast in these specific scenarios (same with divide). The root cause is the inconsistency I noted here: #683 (comment).

This means that there is currently a discrepancy between what the lazy operations return and this one. Example (continuing from above):

Explorer.Series.multiply(2, df["diff"])
#=> #Explorer.Series<
#=>   Polars[1]
#=>   duration[ms] [-2d]
#=> >

DF.mutate(df, mul: 2 * diff)        
#=> #Explorer.DataFrame<
#=>   Polars[1 x 4]
#=>   aug_20 date [2023-08-20]
#=>   aug_21 date [2023-08-21]
#=>   diff duration[ms] [-1d]
#=>   mul integer [-172800000]
#=> >

Thoughts about how to handle?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd need to handle this in the backend. We also need to make sure we will handle this for polars lazy queries.

Quick question, does it help if we swap the arguments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it help if we swap the arguments?

No, I checked :(

From #683:

let diff = Series::new("diff", &dt2 - &dt1);
let ten = Series::new("diff", [10]);
println!("{:?}", Series::new("prod3", &ten * &diff));    // [864000000]
println!("{:?}", Series::new("prod4", &diff * &ten));    // [864000000]

We'd need to handle this in the backend. We also need to make sure we will handle this for polars lazy queries.

👍

Comment on lines 220 to 225
@doc """
The return dtype for the basic arithmetic operations: add, subtract, multiply, and divide.

This function assumes that the inputs have already had the highest precision enforced.
"""
def dtype_for_basic_arithmetic(operation, left_dtype, right_dtype) do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seemed maybe out of place?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. My suggestion is to move the cast_to_* functions here and keep this one where it is used. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by: 093e9ce

Comment on lines +546 to +552
test "fractional parts of floats work (roughly) as expected" do
# This test is not exhaustive. Rather, its purpose is to give us a reasonable confidence that
# multiplying durations by floats is fairly accurate.
#
# The exact answers we see here are subject to implementation details outside our control.
# If we find that this test breaks unexpectedly (e.g. from a dependency update), then we may
# wish to remove it.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included this test mostly because I wanted to see what we could expect from float * duration. It is quite brittle and I'm happy to remove it.


def subtract(left, right), do: basic_numeric_operation(:subtract, left, right)
@doc false
def cast_to_subtract(:integer, :integer), do: :integer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid @doc false functions because if the user does import Explorer.Series, they will be imported into their namespace. One option is to underscore them, but my suggestion is to move them to Explorer.Shared.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by: 093e9ce

Comment on lines 696 to 703
# Arg-swapping does NOT work.
# [src/dataframe.rs:507] &mutations = [
# [(col("sub")) + ([(col("aug_20")) + (col("sub"))])].alias("add2"),
# ]
df2 = DF.mutate(df, add2: sub + (aug_20 + sub))
assert df2["add2"].dtype == :date
assert Series.to_list(df2["add2"]) == [~D[2023-08-22]]
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey so I'm a little stuck at this point.

My workaround for Explorer.Series.add(duration, date) was to swap the arguments. I tried that same thing with expressions, but now I don't think that will work. There's no way to "see through" the parentheses in expressions like:

[(col("sub")) + ([(col("aug_20")) + (col("sub"))])]

meaning I miss situations where I ought to swap.

Perhaps there's some way to salvage this approach? Or perhaps there's another way to achieve what I'm going for? Any advice would be appreciated!

P.S., here's a pseudo-call-stack when calling DF.mutate/2:

Click to show/hide
Explorer.DataFrame.mutate(df, mutations) # macro
  Explorer.DataFrame.mutate_with(unquote(df), Explorer.Query.query(unquote(mutations)))
    Explorer.Shared.apply_impl(df, :mutate_with, [df_out, column_pairs])
      Explorer.PolarsBackend.DataFrame.mutate_with(df, out_df, column_pairs)
        Explorer.PolarsBackend.Shared.apply_dataframe(df, out_df, :df_mutate_with_exprs, [exprs, df.groups])
          Explorer.PolarsBackend.Native.df_mutate_with_exprs(df, exprs, groups)
            native/explorer/src/dataframe.rs::df_mutate_with_exprs

I found it helpful when trying to figure out a good place to intervene.

args
end

{%Explorer.Backend.LazySeries{op: :column, args: [col_name]} = last, rest} ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the best place to swap args is in expression.ex. In there we convert all lazy frame operations into Polars expressions, so you can override add, multiply, and friends in there to both swap and post-cast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried there initially, but lazy series didn't seem to contain the dtype information I needed to make decisions about whether to swap, cast, etc.

E.g. if we add this custom to_expr clause for add:

  def to_expr(%LazySeries{op: :add, args: [left, right]}) do
    left |> IO.inspect(label: :left, width: 120)
    right |> IO.inspect(label: :right, width: 120)

    Native.expr_add(to_expr(left), to_expr(right))
  end

We see the following after running df1 = DF.mutate(df, add1: sub + aug_20):

left: %Explorer.Backend.LazySeries{op: :column, args: ["aug_20"], aggregation: false}
right: %Explorer.Backend.LazySeries{op: :column, args: ["sub"], aggregation: false}

I don't know the dtypes of those args, so I don't know if I should swap or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GAH. Does it work in expression.rs? Can we fetch the dtype there and swap there?

If not, we would need to start storing the dtype then... it is more work though, so it probably makes sense to break this apart, add the dtype field to lazy series, and then come back to this. If we indeed can't do it in Rust, then we should merge a subset of this PR that doesn't require the arg swapping, and work on the other fix later.

I guess there is always the option of fixing Polars too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work in expression.rs? Can we fetch the dtype there and swap there?

I don't think so? I'll try to check more thoroughly later today tho. Polars expressions weren't super well documented so I'll have to read some source code.

If we indeed can't do it in Rust, then we should merge a subset of this PR that doesn't require the arg swapping, and work on the other fix later.

Do you mean supporting Explorer.Series.add(duration, date) a letting a lazy expression possibly panic? Or dropping duration + date support altogether until we can address the issue?

I guess there is always the option of fixing Polars too.

You read my mind...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean supporting Explorer.Series.add(duration, date) a letting a lazy expression possibly panic?

I meant adding a field called dtype to Explorer.Backend.LazySeries. If we do this though, I would make it so we change all Explorer.Backends.Series callbacks to receive the output dtype as argument (or an out_series, which has the correct result type, like we do for dataframes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I meant what should we do for this PR? Definitely coming back later and adding the dtype info we need to make decisions sounds good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR, let's only support the dtype pairs we can implement today (without changing polars or without changing lazy series). Or do you mean we cannot support any of them? :D

Copy link
Member

@josevalim josevalim Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, let's keep the arg swap in Explorer.Series for now. It is not ideal but it is semantically fine. We can add a TODO to them.

For the integer vs duration, which requires the post cast, we can pre-cast the integer to duration, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, let's keep the arg swap in Explorer.Series for now. It is not ideal but it is semantically fine. We can add a TODO to them.

👍

For the integer vs duration, which requires the post cast, we can pre-cast the integer to duration, right?

No, duration * duration isn't supported. I think only way to do it is post-cast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly think opening up a PR in Polars might be the least painful way accomplish this 😅

@billylanchantin
Copy link
Contributor Author

@josevalim Recap:

Some of the new operations have fixes in the PolarsBackend.Series module. I haven't found a good way to do a similar thing for expressions without input/output dtype information to lazy series. So here's the state of affairs:

Operation Eager Lazy Problem Fix (Eager)
date - date -- --
date - duration -- --
date + duration -- --
duration + date 🛠️ Polars panics Arg-swap
duration * integer 🛠️ Scalar != element-wise Post-cast
duration / integer 🛠️ Scalar != element-wise Post-cast
duration * float 🛠️ Scalar != element-wise Post-cast
duration / float 🛠️ Scalar != element-wise Post-cast
integer * duration 🛠️ Scalar != element-wise Post-cast
float * duration 🛠️ Scalar != element-wise Post-cast
  • ✅ = OK
  • 🛠️ = Fixed
  • ❌ = Still broken

We're gonna come back and add that dtype information in a 2nd pass.

@josevalim josevalim merged commit 48415da into elixir-explorer:main Aug 29, 2023
4 checks passed
@josevalim
Copy link
Member

💚 💙 💜 💛 ❤️

@josevalim
Copy link
Member

@billylanchantin thank you so much. I will take a quick stab at the problem and try some ideas, I will let you know how it goes. :)

@josevalim
Copy link
Member

Hi @billylanchantin, I tried a couple ideas, none of them worked, but I have a new insight.

The LazySeries actually has the dtype indirectly, because the lazy series receives and returns Explorer.Series (which has the dtype field). Therefore, if we change LazySeries to keep Series as arguments, then we can propagate this information forward. In other words, we could change:

defp series_or_lazy_series!(%Series{} = series), do: data!(series)

to

defp series_or_lazy_series!(%Series{} = series), do: series

And then we unpack it when building the expression, which would also get access to the dtypes, allowing us to hopefully fix the problem.

Then we could later change Series to compute the dtype and pass it as argument to callbacks, just so we don't have to compute it again in LazySeries :)

@billylanchantin
Copy link
Contributor Author

Hi @josevalim,

Sounds like a plan! One consideration:

I ran into an issue where I needed to know both the input and output dtypes of the operations because of expressions like duration + (date + duration). There we had (roughly):

%LazySeries{
  op: :add,
  args: [
    :duration,
    %LazySeries{ # <-- this evaluates to :date
      op: :add,
      args: [:date, :duration]
    }
  ]
}

This never matched the duration + date case I was looking for. I would need to know the output dtype of the inner lazy series. Would we always be able to determine that?

Even if we didn't, though, we could annotate the expression tree with the output dtypes we know and leave the others as :unknown or something. Then even if we didn't know every output dtype, we'd (probably) know it when it matters.

@josevalim
Copy link
Member

The goal is that such tree would look like this:

%Series{
  dtype: :foo,
  data: %LazySeries{
    op: :add,
    args: [
      %Series{
        dtype: :bar,
        data: %LazySeries{op: ...},
      },
      %Series{
        dtype: :foo,
        data: %LazySeries{op: ...},
      }
    ]
  }
}

In fact, that's already how it looks like for the root node, but we unpack the inner nodes as we compose them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants