-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should there be namespace.col
for filtering?
#229
Comments
This feels like an arbitrary limitation of Polars that I haven't seen in other implementations where my inclination would be to say that this should be addressed in Polars as opposed to having the limitation built into the standard. From my perspective, there shouldn't be two APIs to return a Column where someone should be able to write code in a single way that plays nicely in both eager and lazy executions paradigms. |
It looks a lot less obvious to me what this would do,
It seems like that to me too. Of course if there is a good reason for this, it'd be great to surface that so we can think about how to take it into account.
+1 for this principle. That may mean forbidding some things that don't work in lazy mode, or whatever else is needed to make things uniform across execution paradigms (eager/lazy, and distributed too). |
It's intentionally not implemented, see here pola-rs/polars#10274 (comment)
As in forbidding them across the board, or noting that they may not work in lazy cases? If the former, then I'd suggest getting rid of If the latter, then I'll document what the expectations should be for |
That is about this not being supported: >>> df1 = DataFrame({'a': [1,2,3], 'b': [4,5,6]})
>>> df2 = DataFrame({'a': [1,2,3], 'b': [4,5,7]})
>>> df1 == df2 That seems like a very different and broader limitation (perhaps with the same root cause?). It looks to me like we need a good description from Polars here about the types of operations that it doesn't want to support in lazy mode. Because these are not things that can't be supported in lazy mode, only choices - the reasons for which may be interesting to see. |
yup, this is it - we're trying to compare objects from different dataframes, and polars-lazy wants us to join them beforehand
Polars takes query optimisation to another level compared with other implementations. In dask, for example: # not actual dask syntax, just pseudocode
df = read_file(...)
df = df.select_columns(...)
df = df.rename(...) column selection can be pushed down to the reading stage, but all you need to do is swap operations around and the optimisation no longer happens, it now has to read the entire file: # still not actual dask syntax, just pseudocode
df = read_file(...)
df = df.rename(...)
df = df.select_columns(...) Whereas as in polars, you can chain together dozens of calls and get serious optimisations - I'd suggest this talk if you're interested https://www.youtube.com/live/aEaa1uI64zE?feature=share&t=6073 So, it may look like an arbitrary limitation, but it allows for optimisations that otherwise wouldn't be possible
There's no In my implementation, I've worked around this by returning |
Sorry to nitpick, but just to clarify -
I just tried Ibis out, and it looks like they don't allow it either? In [1]: >>> import ibis
...: >>> from ibis import _
...: >>> ibis.options.interactive = True
...: >>> con = ibis.sqlite.connect("geography.db")
...: >>> con.tables
Out[1]:
Tables
------
- countries
- gdp
- independence
In [2]: countries = con.tables.countries
In [3]: gdp = con.tables.gdp
In [4]: countries = con.tables.countries.head(5)
In [5]: gdp = con.tables.gdp.head(5)
In [6]: countries.filter(gdp.country_code=='ABW')
---------------------------------------------------------------------------
RelationError: Predicate doesn't share any roots with table
|
If it was allowed it would be undefined behavior since there's no guaranteed ordering. Disallowing this is a reasonable behavior. I see the point you're making though where even if there's no defined or guaranteed row ordering, within a DataFrame there is a guarantee that the columns have the same row ordering. Doing something like constructing two columns from lists which then implies ordering in both situations and comparing them is quite common where I'm not sure we can get rid of that as a use case to support. |
The rationale given there is "They would need to block predicate and slice pushdown. I don't want that complexity in the query optimizer. Data should be joined first before it can be compared" - which is more than a little too terse for me to understand the real reason. However, I thought about it some more and translated it to a shape requirement, which I hope is correct. For comparisons, shapes should match exactly. For lazy dataframes, the number of rows is in general unknown until the previous computation graph is executed. However, it's not like nothing is know about shapes. I think they can have 3 states:
Once dataframes are joined, we go from (1) to (2). And only once Here the check needed is "are the shapes the same", which can be done with (2). And it's a needed check, because the library has to raise an exception if there's a shape mismatch. That exception could also be delayed though, I'm not quite sure what would be wrong with that.
Based on experience with PyTorch, which went way further down this path: it's not that it otherwise wouldn't be possible, but more that it'd be more effort. It can be understood from first principles what's actually not possible vs. what works in principle but may be difficult. This kind of thing should be in the latter bucket, while things like "bool() forces me to have an actual Python scalar" are in the former bucket. That doesn't mean I'm saying that Polars is doing anything wrong here - not at all, they are doing very interesting and meaningful work. But we do have to recognize that a decision like raising an exception on
This is indeed common. I'm curious about the recommended solution for that in Polars, for the simple I have a sense that I have a lack of intuition for "dataframe as a SQL front-end", while I do have it for "a 2-D labeled array with per-column dtypes". |
@kkraus14 I'm asking about the case when the columns are not within the same dataframe. If so, we can agree that we need to either document what the guarantees are, or This is why I'm suggesting the It makes it clear that the columns need to be part of the same dataframe, and Note that Ibis also has this syntax (but for them it's df.filter(_.ymax == _.zmax) |
Yes I understand, but there's also cases when the columns are not part of any DataFrame and are just standalone columns. I.E.
I've commonly seen this type of pattern in the wild and even Polars supports it in the eager implementation.
Yes, this is not possible for Ibis currently because Ibis is an API and not an actual implementation. It was historically built for interfacing to SQL databases which don't have a concept of constructing dataframes / columns the same way that DataFrame libraries typically do. This has changed over time with Ibis backends for Pandas, Polars, DataFusion, etc.
I don't think we should constrain columns to having to belong to a DataFrame, nor should a column only be able to belong to a single DataFrame. I think the idea behind the |
Thanks for your response
Sure, is that a reason to include something? I've also seen This bring us back to #201 . I'd like to suggest:
So then support could be:
There would also need to be some way of moving between levels - for example, to go from level 0 to level 1, polars-lazy could call There's something I really need to get off my chest: I sense a general attitude of "we know what's best, we'll define the API, and if this goes against some library's design, then that's their problem". I thought the goal was to agree on a minimal API which all dataframe libraries could support, and was hoping for a more collaborative attitude You asked pandas to support the API. I was collaborative, and have driven the progress which has happened over the last 6 months:
You couldn't have done this without me. You need a pandas maintainer (it's been stated many times that whatever pandas does, other libraries will follow) and I'm the only pandas maintainer you've been able to find who's been willing to enthusiastically drive progress. I'm also a Polars maintainer, and am asking the API be designed in such a way that polars-lazy can support it. I was expecting a similarly collaborative attitude, but instead the response has generally been
I'm not saying that my API suggestions are the right ones. I might be wrong about everything. But I am expecting a collaborative attitude, which is inclusive of Polars, especially if you're expecting that I keep driving progress |
I shared my concerns and opinions, but I do not think I nor anyone else really has veto power. There's been multiple instances where I proposed things or expressed my opinions / experience and a decision was made to go in a different direction. If there's consensus that things like
I can't speak for others, but that is not my intended mentality or attitude in working with this group. I have been doing my best to share my experience in building performance-oriented DataFrame libraries that take advantage of hardware acceleration and all of the API challenges that I've encountered in doing so. This often comes across as me pushing back against a lot of APIs that are somewhat common, but I really just hope to drive a future in the ecosystem where hardware acceleration of DataFrames can be similar to that of arrays. Building a minimal API that "all" (there's always going to be outliers...) DataFrame libraries could reasonably support is the goal, but in my opinion it's reasonable to push on libraries to make changes / enhancements when possible in order to deliver what we as a group think is the best user experience in using this API.
I apologize if the way I communicated things came across as making statements as opposed to my intention of asking questions / gathering information. I obviously don't have the same background / understanding of Polars and Polars-lazy that you do. A lot of the back and forth we've had has been around me trying to understand the boundaries of where things don't work because they haven't been implemented yet versus where things can't work today because of the current design of Polars-lazy versus where things will never be able to work because of the nature of lazy evaluation / computation. Do you think it's unreasonable for things to be included in the API if there's not support in Polars-lazy today, but there's a reasonably clear path to implement them? I'm specifically talking about things where there's functionality gaps as opposed to things that fundamentally change the design of how it functions. I believe we've encountered this same situation with a couple of other libraries, i.e. cudf, and the response was generally to address the implementation in said library.
I don't think anyone has said it nearly enough, but thank you for all of the work you're doing related to this effort @MarcoGorelli. You are absolutely correct this wouldn't be possible without you and everything you're doing. |
I have concern we're going to be adding a ton of complexity for users of this API if we go this route and it will result in either people only using level 0 and then moving to a library of choice like Pandas or Polars when they need to break out of level 0, or writing code that only works with level 2 libraries. I took a pass and these are the APIs I identified where there's maybe some question of how to handle for a lazy implementation: DataFrame APIs:
Column APIs:
Based on this list, the biggest things that stick out to me are:
|
Thanks for your response If there's no support in Polars lazy for something, and there is a reasonable path forwards, then it's reasonable to include it. For example, I'm working on a PR to sort out the return dtypes for What I don't think is acceptable is to include something which has been explicitly rejected by Polars. The most glaring example is Here's what rubbed me the wrong way: if the Consortium wants to dispute a library's decision to intentionally not implement a feature, then the onus is on the Consortium to articulate why that library should implement that feature. Not the other way round. What's the benefit of allowing I suggest we start by resolving this one. Then, we can move on to the rest |
I've tried putting together a proposal: #247 It's a lot simpler than I was expecting, and adds very little complexity. I'll give a demo at today's call, just sharing in case anyone wanted to take a look beforehand |
TL;DR
Add
namespace.col
, to allow for lazy columns / lazy column reductions to work seamlessly in lazy implementationsThen:
col = df.get_column_by_name('a')
if you want an eager column 'a', on which you can call reductions and comparisons (likecol.mean()
,col > 0
). Not necessarily available for all implementations (e.g. Ibis)namespace.col('a')
if you want a (possibly lazy) expression which you intend to use for filtering, likedf.get_rows_by_mask((namespace.col('a') - namespace.col('a').mean()) > 0)
Longer proposal
I'm a bit concerned about some operations just not being available in lazy cases in general, such as
PolarsIbis at least doesn't allow this, filtering by a mask is only allowed if the mask is a column which comes from the same dataframe that's being filtered on - sowould be allowed instead.
Such comparisons are intentionally not implemented in polars, see pola-rs/polars#10274 for a related issue (__eq__
between two different dataframes)I'd like to suggest, therefore, that we introduce
namespace.col
, so that the above could becomeAnd then there doesn't need to be any documentation like "this won't work for some implementations if the mask was created from a different dataframe"
This would also feel more familiar to developers:
pyspark
already has this:polars
already has this (same as above, butfrom polars import col
)pandas
may add it (see EuroScipy 2023 lightning talk by Joris)_
_
EDIT: have tried clarifying the issue. Have also replaced Polars with Ibis in the example as the Consortium seems more interested in that
The text was updated successfully, but these errors were encountered: