-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: get_column_by_name
impairs readability
#287
Comments
The dataframe standard seems to graft a backwards approach onto dataframe compatibility to me. In a sanely planned out world (that didn't develop, for good reasons of iterative development and other factors), I would think that most dataframe libraries would have verbose names like In that case, the dataframe_standard would be in an easier place... Write your syntantic sugar library on top of multiple dataframe backends. What has actually seemed to develop is, the dataframe_standard is built on different library's syntantic sugar. It is cumbersome to use the dataframe_standard library. For some applications, like plotting libraries or parts of buckaroo (core table translation, pluggable_analysis_framework), and a lot of data engineering tasks, the dataframe_standard api is workable. However, when you want to write actual analysis code like actual analysis/sumary stats functions in buckaroo, or skrub-data code, the dataframe_standard api becomes very cumbersome. The dataframe_standard api looks like java code to me. What would it look like to split polars into polars core which exposes the dataframe_standard api, then polars_analysis which is built on top of the polars_core (or any other implementing dataframe_standard library)? I could be way off with some of the assumptions I have made here. Eveyone involved is tackling a hard problem and doing good work, I'm not trying to denigrate any of it. |
First, to address @Vincent-Maladiere's issue: I am in agreement: the Second, to comment on what @paddymul said above: I think it's an accurate summary of the state of things. The standard API is "workable" (certainly not easy or pretty) if you're a library developer wanting to write your code in a way that people can use it with multiple DataFrame implementations. If you're an end user or data scientist, you are likely not going to like using the standard API (because it is too minimal and sometimes verbose). For better or for worse, developing a standard focused on library developers rather than the data scientist is the goal this group has aligned on. All this being said, the point is taken that to the extent possible, we should endeavor to make the standard API as friendly as we can. Like any API, I expect that will be an iterative process. |
Thanks all for your feedback Library developers count as users - they value strictness more than flexibility (compared with analysts), but nobody wants messy code such as that presented in #286
Someone else reached out to me on LinkedIn to express the same feeling. Wit that said, I'm -1 on anything longer than |
I don't think that If the variable referring to the dataframe is particularly long and one wishes to improve readability, an alias would work just fine. For example, in Pandas it's not unreasonable to do the following: # instead of:
ans = (
my_long_variable_name['column_name'].max() -
my_long_variable_name['column_name'].min()
)/(my_long_variable_name['column_name'].mean())
# write:
vals = my_long_variable_name['column_name']
ans = (vals.max() - vals.min()) / vals.mean() |
Yes, and this is also a fairly common complaint about pandas syntax :) Anyway, happy to agree to disagree on this point, so long as we can move forwards - let's discuss |
Just to add to the discussion though - it's not just about readability, but performance as well The cuDF API is limited here (because the pandas one is): df
df = df.filter(df.col('a') > df.col('b') + 1) means you have to do several passes through the dataframe:
Whereas in Polars, even for eager dataframe, you can still express the above using
as |
Absolutely - materialization of intermediate results is a fact of life when you have expressions like Still, I think it is the implementation, not necessarily the API, that should worry about lazy semantics or materialization of intermediates For example, Dask (I believe) can do the above lazily without having needed a change to the Pandas API for indexing.
Incidentally, cuDF does have the equivalent of Pandas' |
Say you have df = df.filter(df.col('a') > df.col('b') + 1) How would your implementation evaluate this without the extra materialisation steps? |
To be clear, our implementation wouldn't evaluate this without extra materialization. Although an implementation could. Conceptually (I hope you'll forgive my shortcut of using a library for expression manipulation): from pymbolic.primitives import Variable, Comparison
import cudf
class MyColumn(Variable):
def __gt__(self, other):
return Comparison(self, ">", other)
class MyDataFrame(cudf.DataFrame):
def col(self, name):
return MyColumn(name)
def filter(self, expr):
return super().query(str(expr)) In [2]: df = MyDataFrame({'a': [1, 10, 5], 'b': [1, 2, 3]})
In [3]: df
Out[3]:
a b
0 1 1
1 10 2
2 5 3
In [4]: df.col('a')
Out[4]: MyColumn('a')
In [5]: df.col('a') > df.col('b') + 1
Out[5]: Comparison(MyColumn('a'), '>', Sum((MyColumn('b'), 1)))
In [6]: df.filter(df.col('a') > df.col('b') + 1)
Out[6]:
a b
1 10 2
2 5 3 |
thanks for showing that example - could you also show how such an implementation would deal with:
? |
On way to do it is for the column to keep track of its originating class MyColumn(Variable):
def __init__(self, df, name):
self._df = df
super().__init__(name)
def __gt__(self, other):
return Comparison(self, ">", other)
def to_array(self):
return self._df[self.name].values
def mean(self):
# could return a `MyScalar` here instead:
return self._df[self.name].mean()
class MyDataFrame(cudf.DataFrame):
def col(self, name):
return MyColumn(self, name)
def filter(self, expr):
return self.query(str(expr)) In [2]: df = MyDataFrame({'a': [1, 10, 5], 'b': [1, 2, 3]})
In [3]: df.col('a').mean()
Out[3]: 5.333333333333333
In [4]: df.col('a').to_array()
Out[4]: array([ 1, 10, 5]) If you prefer, we can move this discussion to a separate issue as we seem to be getting considerably further away from the readability of |
Yup, that's my conclusion too - which strengthens my desire to disallow comparisons between columns which originate from different dataframes. It's not just a lazy execution thing. I'll open a PR, we can discuss there |
To be clear, I think it's an implementation detail that the Column needs to do this. But, if we want to discuss making it explicitly part of the spec for a Column, I'm open to that. |
agree that it's an implementation detail - and to make it possible for an implementation like the one you've shown to be standard-compliant, I think we need to state that comparisons between columns from different dataframes are outside the remit of the standard, #298 |
Hey dataframe-api team!
First off, thank you for this effort. I'm a developer of skrub, a data-wrangling package. We aim to support many dataframes, starting with Pandas and Polars, and currently have duplicate logic.
We think the dataframe-API is the way for us (see this discussion) and hope to make the switch soon.
I followed the discussion of #249 and I'm pretty upset about having to use
df.get_column_by_name("a")
instead of a simplerdf["a"]
orcol("a")
. This will obfuscate our code and impair readability, and therefore we may consider keeping our duplicate logic.To keep an acceptable UX, could we consider having at least a shorter alias to
get_column_by_name
?The text was updated successfully, but these errors were encountered: