Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should there be namespace.col for filtering? #229

Closed
MarcoGorelli opened this issue Aug 21, 2023 · 16 comments
Closed

Should there be namespace.col for filtering? #229

MarcoGorelli opened this issue Aug 21, 2023 · 16 comments

Comments

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Aug 21, 2023

TL;DR

Add namespace.col, to allow for lazy columns / lazy column reductions to work seamlessly in lazy implementations

Then:

  • use col = df.get_column_by_name('a') if you want an eager column 'a', on which you can call reductions and comparisons (like col.mean(), col > 0). Not necessarily available for all implementations (e.g. Ibis)
  • use namespace.col('a') if you want a (possibly lazy) expression which you intend to use for filtering, like df.get_rows_by_mask((namespace.col('a') - namespace.col('a').mean()) > 0)

Longer proposal

I'm a bit concerned about some operations just not being available in lazy cases in general, such as

mask = df2.get_column_by_name('a') > 0
df1.get_rows_by_mask(mask)

Polars Ibis at least doesn't allow this, filtering by a mask is only allowed if the mask is a column which comes from the same dataframe that's being filtered on - so

mask = df1.get_column_by_name('a') > 0
df1.get_rows_by_mask(mask)

would be allowed instead. Such comparisons are intentionally not implemented in polars, see pola-rs/polars#10274 for a related issue (__eq__ between two different dataframes)

I'd like to suggest, therefore, that we introduce namespace.col, so that the above could become

mask = namespace.col('a') > 0
df1.get_rows_by_mask(mask)

And then there doesn't need to be any documentation like "this won't work for some implementations if the mask was created from a different dataframe"

This would also feel more familiar to developers:

  • pyspark already has this:
    from pyspark.sql.functions import col
    df.filter(col("state") == "OH")
  • polars already has this (same as above, but from polars import col)
  • pandas may add it (see EuroScipy 2023 lightning talk by Joris)
  • ibis has _
  • siuba also has _

EDIT: have tried clarifying the issue. Have also replaced Polars with Ibis in the example as the Consortium seems more interested in that

@kkraus14
Copy link
Collaborator

Polars at least doesn't allow this. In the implementation I've put together, filtering by a mask is only allowed if the mask is a column which comes from the same dataframe that's being filtered on

This feels like an arbitrary limitation of Polars that I haven't seen in other implementations where my inclination would be to say that this should be addressed in Polars as opposed to having the limitation built into the standard.

From my perspective, there shouldn't be two APIs to return a Column where someone should be able to write code in a single way that plays nicely in both eager and lazy executions paradigms.

@rgommers
Copy link
Member

I'd like to suggest, therefore, that we introduce namespace.col, so that the above could become

It looks a lot less obvious to me what this would do, namespace.col('a') > 0 is meaningless by itself.

This feels like an arbitrary limitation of Polars that I haven't seen in other implementations

It seems like that to me too. Of course if there is a good reason for this, it'd be great to surface that so we can think about how to take it into account.

From my perspective, there shouldn't be two APIs to return a Column where someone should be able to write code in a single way that plays nicely in both eager and lazy executions paradigm.

+1 for this principle. That may mean forbidding some things that don't work in lazy mode, or whatever else is needed to make things uniform across execution paradigms (eager/lazy, and distributed too).

@MarcoGorelli
Copy link
Contributor Author

my inclination would be to say that this should be addressed in Polars

It's intentionally not implemented, see here pola-rs/polars#10274 (comment)

That may mean forbidding some things that don't work in lazy mode

As in forbidding them across the board, or noting that they may not work in lazy cases?

If the former, then I'd suggest getting rid of Column and only having namespace.col - it's a lot cleaner to work with.

If the latter, then I'll document what the expectations should be for Column in lazy cases

@rgommers
Copy link
Member

It's intentionally not implemented, see here pola-rs/polars#10274 (comment)

That is about this not being supported:

>>> df1 = DataFrame({'a': [1,2,3], 'b': [4,5,6]})
>>> df2 = DataFrame({'a': [1,2,3], 'b': [4,5,7]})
>>> df1 == df2

That seems like a very different and broader limitation (perhaps with the same root cause?). It looks to me like we need a good description from Polars here about the types of operations that it doesn't want to support in lazy mode. Because these are not things that can't be supported in lazy mode, only choices - the reasons for which may be interesting to see.

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Aug 22, 2023

perhaps with the same root cause

yup, this is it - we're trying to compare objects from different dataframes, and polars-lazy wants us to join them beforehand

limitation of Polars that I haven't seen in other implementations

Polars takes query optimisation to another level compared with other implementations. In dask, for example:

# not actual dask syntax, just pseudocode
df = read_file(...)
df = df.select_columns(...)
df = df.rename(...)

column selection can be pushed down to the reading stage, but all you need to do is swap operations around and the optimisation no longer happens, it now has to read the entire file:

# still not actual dask syntax, just pseudocode
df = read_file(...)
df = df.rename(...)
df = df.select_columns(...)

Whereas as in polars, you can chain together dozens of calls and get serious optimisations - I'd suggest this talk if you're interested https://www.youtube.com/live/aEaa1uI64zE?feature=share&t=6073

So, it may look like an arbitrary limitation, but it allows for optimisations that otherwise wouldn't be possible

we need a good description from Polars here about the types of operations that it doesn't want to support in lazy mode

There's no Series / Column in lazy mode. If you want to use a column for filtering, or for creating a new column, or to sort by, you need to use the expressions API. Polars initially tried copying the pandas API, but then changed course because they found it limiting

In my implementation, I've worked around this by returning pl.col('a') for df.get_column_by_name('a') and keeping track of the id of the dataframe it was taken from. This is fine for now, but should probably be documented

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Aug 22, 2023

From my perspective, there shouldn't be two APIs to return a Column

Sorry to nitpick, but just to clarify - namespace.col wouldn't return a Column, but an expression (which you can think of as a function which takes a dataframe as input and returns a column as output)

I haven't seen in other implementations

I just tried Ibis out, and it looks like they don't allow it either?

In [1]: >>> import ibis
   ...: >>> from ibis import _
   ...: >>> ibis.options.interactive = True
   ...: >>> con = ibis.sqlite.connect("geography.db")
   ...: >>> con.tables
Out[1]:
Tables
------
- countries
- gdp
- independence

In [2]: countries = con.tables.countries

In [3]: gdp = con.tables.gdp

In [4]: countries = con.tables.countries.head(5)

In [5]: gdp = con.tables.gdp.head(5)

In [6]: countries.filter(gdp.country_code=='ABW')
---------------------------------------------------------------------------
RelationError: Predicate doesn't share any roots with table

But I'm quite new to Ibis - if I've misunderstood how to use it, then sorry, and please do correct me, thanks pretty sure I have understood

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Aug 23, 2023

@kkraus14 here you'd brought up lack of row ordering / sortedness in some implementations?

If rows are unordered, then surely (as far as I understand) columns from different dataframes can't be compared, and need to have been joined beforehand?

@kkraus14
Copy link
Collaborator

@kkraus14 here you'd brought up lack of row ordering / sortedness in some implementations?

If rows are unordered, then surely (as far as I understand) columns from different dataframes can't be compared, and need to have been joined beforehand?

If it was allowed it would be undefined behavior since there's no guaranteed ordering. Disallowing this is a reasonable behavior.

I see the point you're making though where even if there's no defined or guaranteed row ordering, within a DataFrame there is a guarantee that the columns have the same row ordering.

Doing something like constructing two columns from lists which then implies ordering in both situations and comparing them is quite common where I'm not sure we can get rid of that as a use case to support.

@rgommers
Copy link
Member

rgommers commented Aug 24, 2023

Such comparisons are intentionally not implemented in polars, see pola-rs/polars#10274 for a related issue (__eq__ between two different dataframes)

The rationale given there is "They would need to block predicate and slice pushdown. I don't want that complexity in the query optimizer. Data should be joined first before it can be compared" - which is more than a little too terse for me to understand the real reason. However, I thought about it some more and translated it to a shape requirement, which I hope is correct.

For comparisons, shapes should match exactly. For lazy dataframes, the number of rows is in general unknown until the previous computation graph is executed. However, it's not like nothing is know about shapes. I think they can have 3 states:

  1. unknown shape
  2. unknown shape, but with a known relation to the shape of other dataframes
  3. known shape

Once dataframes are joined, we go from (1) to (2). And only once .collect()/.compute() is called we go to (3).

Here the check needed is "are the shapes the same", which can be done with (2). And it's a needed check, because the library has to raise an exception if there's a shape mismatch. That exception could also be delayed though, I'm not quite sure what would be wrong with that.

So, it may look like an arbitrary limitation, but it allows for optimisations that otherwise wouldn't be possible

Based on experience with PyTorch, which went way further down this path: it's not that it otherwise wouldn't be possible, but more that it'd be more effort. It can be understood from first principles what's actually not possible vs. what works in principle but may be difficult. This kind of thing should be in the latter bucket, while things like "bool() forces me to have an actual Python scalar" are in the former bucket.

That doesn't mean I'm saying that Polars is doing anything wrong here - not at all, they are doing very interesting and meaningful work. But we do have to recognize that a decision like raising an exception on == is made for pragmatic reasons rather than a fundamental "this really cannot work in lazy mode". And it still matters for designing a standard to understand the difference between these things.

Doing something like constructing two columns from lists which then implies ordering in both situations and comparing them is quite common where I'm not sure we can get rid of that as a use case to support.

This is indeed common. I'm curious about the recommended solution for that in Polars, for the simple df1 == df2 case. What's the actual .join or other operation to apply there?

I have a sense that I have a lack of intuition for "dataframe as a SQL front-end", while I do have it for "a 2-D labeled array with per-column dtypes".

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Aug 24, 2023

within a DataFrame there is a guarantee

@kkraus14 I'm asking about the case when the columns are not within the same dataframe.
Do you agree that that's not possible for ibis?

If so, we can agree that we need to either document what the guarantees are, or
consider changing the API?

This is why I'm suggesting the namespace.col syntax

It makes it clear that the columns need to be part of the same dataframe, and
that if they're not, you need to do a join

Note that Ibis also has this syntax (but for them it's _):

df.filter(_.ymax == _.zmax)

@kkraus14
Copy link
Collaborator

@kkraus14 I'm asking about the case when the columns are not within the same dataframe.

Yes I understand, but there's also cases when the columns are not part of any DataFrame and are just standalone columns.

I.E.

import polars as pl

# Assume these aren't just nicely defined like this but something yielded you Python lists
my_list_1 = ['a', 'b', 'c']
my_list_2 = ['c', 'b', 'a']

out = pl.Series(my_list_1) == pl.Series(my_list_2)

I've commonly seen this type of pattern in the wild and even Polars supports it in the eager implementation.

Do you agree that that's not possible for ibis?

Yes, this is not possible for Ibis currently because Ibis is an API and not an actual implementation. It was historically built for interfacing to SQL databases which don't have a concept of constructing dataframes / columns the same way that DataFrame libraries typically do. This has changed over time with Ibis backends for Pandas, Polars, DataFusion, etc.

If so, we can agree that we need to either document what the guarantees are, or consider changing the API?

This is why I'm suggesting the namespace.col syntax

It makes it clear that the columns need to be part of the same dataframe, and that if they're not, you need to do a join

I don't think we should constrain columns to having to belong to a DataFrame, nor should a column only be able to belong to a single DataFrame. I think the idea behind the namespace.col functionality you're proposing makes sense, but it would definitely add non-trivial complexity to the standard where I'd love some other opinions here.

@MarcoGorelli
Copy link
Contributor Author

Thanks for your response

I've commonly seen this type of pattern in the wild

Sure, is that a reason to include something? I've also seen GroupBy.__iter__ - not just in the wild, but in libraries which I used to consider potential users of the Standard - but you veto-ed it

This bring us back to #201 . I'd like to suggest:

  • level 0: only high-performance methods which are guaranteed to work for lazy or eager engines. No Column, no DataFrame.__eq__ or other dataframe comparisons, column operations can only be done with namespace.col
  • level 1: only high-performance methods, not guaranteed for all dataframes. So, there is Column, there are dataframe comparisons, and you can filter one dataframe based on the columns of another
  • level 2: not-necessarily high-performance methods, like Groupby.__iter__

So then support could be:

  • level 0: everything
  • level 1: excludes polars-lazy, ibis
  • level 2: possibly only includes polars-eager and pandas

There would also need to be some way of moving between levels - for example, to go from level 0 to level 1, polars-lazy could call .collect. To go from level 1 to level 2, cudf could interchange to polars-eager.


There's something I really need to get off my chest: I sense a general attitude of "we know what's best, we'll define the API, and if this goes against some library's design, then that's their problem". I thought the goal was to agree on a minimal API which all dataframe libraries could support, and was hoping for a more collaborative attitude

You asked pandas to support the API. I was collaborative, and have driven the progress which has happened over the last 6 months:

  • first tagged version
  • EuroScipy2023 talk
  • entrypoints to the standard available in both pandas and Polars
  • fully-tested implementations available for both pandas and Polars
  • libraries (scikit-learn and skrub) trying it out

You couldn't have done this without me. You need a pandas maintainer (it's been stated many times that whatever pandas does, other libraries will follow) and I'm the only pandas maintainer you've been able to find who's been willing to enthusiastically drive progress.
It would not have made it into pandas without me - this isn't one of those "everybody's replaceable" tasks.
But my enthusiasm is at risk of waning.

I'm also a Polars maintainer, and am asking the API be designed in such a way that polars-lazy can support it. I was expecting a similarly collaborative attitude, but instead the response has generally been

  • calling some decisions "arbitrary limitations"
  • saying that polars-lazy should just implement things because the Consortium has decided that they need to be in the Standard
  • being OK with polars-lazy just not being Standard-compliant

I'm not saying that my API suggestions are the right ones. I might be wrong about everything.

But I am expecting a collaborative attitude, which is inclusive of Polars, especially if you're expecting that I keep driving progress

@kkraus14
Copy link
Collaborator

Sure, is that a reason to include something? I've also seen GroupBy.__iter__ - not just in the wild, but in libraries which I used to consider potential users of the Standard - but you veto-ed it

I shared my concerns and opinions, but I do not think I nor anyone else really has veto power. There's been multiple instances where I proposed things or expressed my opinions / experience and a decision was made to go in a different direction. If there's consensus that things like GroupBy.__iter__ should be in the standard, so be it.

There's something I really need to get off my chest: I sense a general attitude of "we know what's best, we'll define the API, and if this goes against some library's design, then that's their problem". I thought the goal was to agree on a minimal API which all dataframe libraries could support, and was hoping for a more collaborative attitude

I can't speak for others, but that is not my intended mentality or attitude in working with this group. I have been doing my best to share my experience in building performance-oriented DataFrame libraries that take advantage of hardware acceleration and all of the API challenges that I've encountered in doing so. This often comes across as me pushing back against a lot of APIs that are somewhat common, but I really just hope to drive a future in the ecosystem where hardware acceleration of DataFrames can be similar to that of arrays.

Building a minimal API that "all" (there's always going to be outliers...) DataFrame libraries could reasonably support is the goal, but in my opinion it's reasonable to push on libraries to make changes / enhancements when possible in order to deliver what we as a group think is the best user experience in using this API.

I'm also a Polars maintainer, and am asking the API be designed in such a way that polars-lazy can support it. I was expecting a similarly collaborative attitude, but instead the response has generally been

  • calling some decisions "arbitrary limitations"
  • saying that polars-lazy should just implement things because the Consortium has decided that they need to be in the Standard
  • being OK with polars-lazy just not being Standard-compliant

I apologize if the way I communicated things came across as making statements as opposed to my intention of asking questions / gathering information. I obviously don't have the same background / understanding of Polars and Polars-lazy that you do. A lot of the back and forth we've had has been around me trying to understand the boundaries of where things don't work because they haven't been implemented yet versus where things can't work today because of the current design of Polars-lazy versus where things will never be able to work because of the nature of lazy evaluation / computation.

Do you think it's unreasonable for things to be included in the API if there's not support in Polars-lazy today, but there's a reasonably clear path to implement them? I'm specifically talking about things where there's functionality gaps as opposed to things that fundamentally change the design of how it functions. I believe we've encountered this same situation with a couple of other libraries, i.e. cudf, and the response was generally to address the implementation in said library.

You asked pandas to support the API. I was collaborative, and have driven the progress which has happened over the last 6 months:

  • first tagged version
  • EuroScipy2023 talk
  • entrypoints to the standard available in both pandas and Polars
  • fully-tested implementations available for both pandas and Polars
  • libraries (scikit-learn and skrub) trying it out

You couldn't have done this without me. You need a pandas maintainer (it's been stated many times that whatever pandas does, other libraries will follow) and I'm the only pandas maintainer you've been able to find who's been willing to enthusiastically drive progress. It would not have made it into pandas without me - this isn't one of those "everybody's replaceable" tasks. But my enthusiasm is at risk of waning.

I don't think anyone has said it nearly enough, but thank you for all of the work you're doing related to this effort @MarcoGorelli. You are absolutely correct this wouldn't be possible without you and everything you're doing.

@kkraus14
Copy link
Collaborator

This bring us back to #201 . I'd like to suggest:

  • level 0: only high-performance methods which are guaranteed to work for lazy or eager engines. No Column, no DataFrame.__eq__ or other dataframe comparisons, column operations can only be done with namespace.col
  • level 1: only high-performance methods, not guaranteed for all dataframes. So, there is Column, there are dataframe comparisons, and you can filter one dataframe based on the columns of another
  • level 2: not-necessarily high-performance methods, like Groupby.__iter__

So then support could be:

  • level 0: everything
  • level 1: excludes polars-lazy, ibis
  • level 2: possibly only includes polars-eager and pandas

There would also need to be some way of moving between levels - for example, to go from level 0 to level 1, polars-lazy could call .collect. To go from level 1 to level 2, cudf could interchange to polars-eager.

I have concern we're going to be adding a ton of complexity for users of this API if we go this route and it will result in either people only using level 0 and then moving to a library of choice like Pandas or Polars when they need to break out of level 0, or writing code that only works with level 2 libraries.

I took a pass and these are the APIs I identified where there's maybe some question of how to handle for a lazy implementation:

DataFrame APIs:

  • shape --> Looks like this isn't supported by Polars-lazy today? It looks like the number of columns isn't lazy but the number of rows is. Any ideas here? I think we could imagine this as a Tuple of "LazyScalar" objects?
  • get_column_by_name / col --> Can we support both lazy and eager execution in a way we're happy with in a single function?
  • insert_column --> Polars-lazy allows doing this via with_columns, but as far as I can tell only allows doing it with expressions from the same table, literals, or pl.Series objects which aren't lazy?
  • get_column_names --> Outside of Polars-lazy, you could imagine a lazy library choosing to have actual column names be lazily resolved as well. Do we wish to support that? I.E. something like one hot encoding yields a data dependent number and names of columns.
  • sorted_indices / unique_indices / any_rowwise / all_rowwise / etc. --> What type should this return? Polars-lazy doesn't have a Column class today and returns a 1 column DataFrame.
  • __eq__ (and other binary operators) --> Polars-lazy and Ibis don't support these today across different tables, could they / should they? How much value do these operations add at the DataFrame level versus the column level?
  • to_array_object --> Any concerns here? Does this trigger computation for something like Polars-lazy which presumably doesn't have a corresponding lazy array implementation (please correct me if I'm wrong)?

Column APIs:

  • __len__ --> Does Python allow ducktyping this? Looks like Polars-lazy has a len() function that returns a 1 row column in this case currently? Would feel very reasonable to me to abandon __len__ if it can't be ducktyped and just have a len() method or just use shape.
  • get_value --> Some kind of "LazyScalar"? Looks like Polars-lazy returns a 1 row column in this case currently (via pl.Expr.take)?
  • any (and other reductions) --> Some kind of "LazyScalar"? Looks like Polars-lazy returns a 1 row column in this case currently?
  • to_array_object --> Same as DataFrame

Based on this list, the biggest things that stick out to me are:

  • How to handle scalars, whether we want an explicit collect type of method, or whether they should allow computation to implicitly happen through methods like __int__ and __bool__. Lazy and eager implementations starkly differ in this regard. We could also just punt on Scalars entirely and just yield Columns and force the user to call explicit APIs to get a scalar out of the Column.
  • Inserting Columns into DataFrames / constructing DataFrames from Columns, where it looks like Lazy and Eager DataFrames differ quite a bit in this regard again. I personally think the restrictions that Polars-lazy and Ibis enforce here are quite reasonable and are very friendly from a performance perspective, but there's definitely a lot of code out there that uses the eager paradigms.

@MarcoGorelli
Copy link
Contributor Author

Thanks for your response

If there's no support in Polars lazy for something, and there is a reasonable path forwards, then it's reasonable to include it. For example, I'm working on a PR to sort out the return dtypes for __pow__.

What I don't think is acceptable is to include something which has been explicitly rejected by Polars. The most glaring example is LazyFrame.__eq__ : pola-rs/polars#10274 (comment), so I have a PR to remove this from the Standard here: #242

Here's what rubbed me the wrong way: if the Consortium wants to dispute a library's decision to intentionally not implement a feature, then the onus is on the Consortium to articulate why that library should implement that feature. Not the other way round.

What's the benefit of allowing df1 == df2? Do we have a single use-case in mind?

I suggest we start by resolving this one. Then, we can move on to the rest

@MarcoGorelli
Copy link
Contributor Author

it would definitely add non-trivial complexity to the standard

I've tried putting together a proposal: #247

It's a lot simpler than I was expecting, and adds very little complexity. I'll give a demo at today's call, just sharing in case anyone wanted to take a look beforehand

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants