-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add filter
and filter_with
to Series
#728
Add filter
and filter_with
to Series
#728
Conversation
So clean! 😍 The biggest question is the API. Personally, I am not the biggest fan of the macro version in this case, but I will gladly concede it is about taste. Then we need to decide on the names. Today the names we use for DF and Series do not necessarily match. We have @billylanchantin / @cigrainger / @philss, so we need to decide:
Please let me know your thoughts! |
@josevalim Beautifully summarized! I like being able to write: Since the later decisions hinge on including macros or not, I'll leave it it at that for now. I think we can decide on the names once we reach a decision on macros. |
This is great work. My 2 cents:
|
Ah I've just put my finger on what bugs me about a macro here: it's too close to Enum and feels like the functions should apply to each element. But they apply to the entire series. I think it's confusing. I know it's a bit more verbose but I honestly don't think it's that ugly to do |
So map/filter/sort and then transform which actually goes element by element (as it works today)? map should be straightforward to implement but sort will require adding support to nulls_first in arrange (but it should be straightforward). |
So I'm gonna try to make the case for macros. Sorry for the text wall, but I wanna give it a fair shake! The goals of macros on
Used well (and sparingly), macros are more readable. Here's a good example from this project: DF.mutate(df, c: a + b)
# vs.
DF.mutate_with(df, &[c: Series.add(&1["a"], &1["b"])]) The If we were to support macros on Series, we'd get that same benefit: S.mutate(s, x: 1 + x + x**2/2)
# vs.
S.mutate_with(s, &(&1 |> Series.add(1) |> Series.add(&1 |> Series.pow(2) |> Series.divide(2)))) It's also consistent with what's available in DataFrames. DataFrames allow this syntax, and it seems like the lack of availability on Series was driven mostly by what Polars happened to make easy. I think there's agreement on this point:
Certainly when I started using the library, I wondered why macros were restricted to DataFrames.
I feel like this is the strongest objection. But I'd argue less that # I'd be fine with this too, though it's not evocative of the `DF.mutate` syntax.
filter(s, n, n > 2)
# Less preferred: `filter_with` also uses anonymous functions and I think that's confusing.
filter(s, &(&1 > 2)) Also, I think a premise here is that an "Elixir purist" feel has some friction. That's why, I assume, macros were introduced to DataFrames in the first place (and same for Nx): long, math-y computations are hard to read when they're made up of multiple, piped function calls.
I think I see this too, though I'm less bothered by it. There is a learning curve when you start using Explorer with the difference between I'll end by saying that, while obviously this is my first choice, I'll be happy with either decision! This is all good stuff and I don't think we can go wrong either way :) |
Well that is a really convincing argument and I think I'm coming around. Thanks for laying it out so clearly @billylanchantin. I'm actually starting to come around to the @josevalim @philss are there dangers lurking with macros here? For example, what happens if we use a macro Series.map inside a macro DF.mutate.
Yep I think that's the cleanest. |
Re: macros. In this example: DF.mutate(df, c: a + b)
# vs.
DF.mutate_with(df, &[c: Series.add(&1["a"], &1["b"])]) The ugliest part of mutate comes from accessing the columns. For example, if we could somehow magically bind the function argument names to columns, we could write this instead: DF.mutate_with(df, fn a, b -> [c: Series.add(a, b)] end) Which is a bit more acceptable. The issue is that series have no name to access, so we need to introduce an artificial name: S.mutate(s, x: 1 + x + x**2/2) In the example above, A more Elixirish approach would be:
Or mirroring Ecto:
The last one is syntactically cleaner, IMO, but The other aspect of the macros are operator conveniences. I don't disagree the operator conveniences help but for series they have limited use because I can only perform series operations against myself. So only operators that work against myself are useful (and they are not that many). Although I can't deny doing this sort of stuff with If this is really a concern, we can always document the approach used in this PR and say "hey, you want to do crazy stuff, convert it to a DF like this".
Series.map/filter/sort/transform for a lazy data frame should raise, so that's not a concern here. |
Btw, awesome input on the discussion @billylanchantin. You definitely brought up good points. |
That made my morning. You all are very nice to work with :) Ok, this is my takeaway: macros might be nice, but the syntax is a sticking point. There doesn't seem to be a way to do it without being confusing or non-idiomatic. Assuming
|
I thought about the |
Ok, if we don't think there's a way to syntactically introduce a name for binding (a hard requirement) without leading to confusion, then I concede on macros :)
I'll wait for consensus on names, etc. before I make any changes to the PR. |
At the risk of popping my head into an area wherein I am a novice, is this true? My sense is that the
If that's the case then the parallel for series seems actually rather clear. The name is more "anonymous" than a column but in both cases you are:
Right. But isn't that just down to the nature of the data structure? The ordinary expectation of operating on a series is that you are talking about quasi-anonymous values
and it's just the case that there is some variance in the structures between what counstitutes EDIT: OK I see the flaw in my argument. In dataframes, the binding of column names to variables happens regardless of what goes before the |
It is an assignment after the fact but not for the current operation. The point is that in a dataframe, all bound names have been given before. There is no such thing for series, hence the need to bind and assign. For example, I would prefer something like this |
Another idea. Since series are not named, we can do this:
where |
Scala has entered the chat... (also I'm totally down 😄)
Yep this is the heart of it. Whatever we pick will introduce a new syntax. So the question to answer is: is the feature worth it? You get some cool stuff, but that's the price of admission. |
Of all syntaxes for filter proposed so far, which one is your favorite? Maybe we pick one and then put it for voting. |
Could we set a default name with the |
@cigrainger we can but to me all names are arbitrary unless we use a special token (such as |
That's fair. I'm actually on board with that even if a bit begrudgingly supporting macros here generally. I think it's familiar enough from other languages and everything. Would it not cause potential issues if someone inevitably pushes things too far and tries to write something longer that might have pattern matching? |
Yeah okay I can get behind |
I don't think we can have pattern matching inside queries, so that wouldn't be a concern. |
I'm on board then. I think it's the cleanest and most straightforward to use |
To recap:
Yes, with
Now this is tricky. We already have |
I'm also on board with:
I think it's a clean way to get around the ambiguous variable issue. (Plus it wins code golf! 😄)
Agreed. I'll make the change for the |
Two nits then feel free to merge it! |
Co-authored-by: José Valim <jose.valim@gmail.com>
@josevalim Small hiccup: I can't merge PRs yet 😬 |
Please try again. |
Description
Adds
Series.filter
Series.filter_with
As was discussed in:
Series.mask
vs.Series.filter
#726Support macros and
_with
variants?If/how to include macro and
_with
-callback versions of functions is up for debate. Here's how I'm proposing it would work forfilter
:Happy to discuss :)