Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: get_column_by_name impairs readability #287

Closed
Vincent-Maladiere opened this issue Oct 20, 2023 · 14 comments · Fixed by #290
Closed

Feature request: get_column_by_name impairs readability #287

Vincent-Maladiere opened this issue Oct 20, 2023 · 14 comments · Fixed by #290

Comments

@Vincent-Maladiere
Copy link

Hey dataframe-api team!

First off, thank you for this effort. I'm a developer of skrub, a data-wrangling package. We aim to support many dataframes, starting with Pandas and Polars, and currently have duplicate logic.

We think the dataframe-API is the way for us (see this discussion) and hope to make the switch soon.

I followed the discussion of #249 and I'm pretty upset about having to use df.get_column_by_name("a") instead of a simpler df["a"] or col("a"). This will obfuscate our code and impair readability, and therefore we may consider keeping our duplicate logic.

To keep an acceptable UX, could we consider having at least a shorter alias to get_column_by_name?

@paddymul
Copy link

The dataframe standard seems to graft a backwards approach onto dataframe compatibility to me.

In a sanely planned out world (that didn't develop, for good reasons of iterative development and other factors), I would think that most dataframe libraries would have verbose names like get_column_by_name. I could then see different layers of syntatic sugar built for different preferences of iterative data science col("a"), df["a"], df.loc[:,"a"] built on top of the core verbose syntax.

In that case, the dataframe_standard would be in an easier place... Write your syntantic sugar library on top of multiple dataframe backends. What has actually seemed to develop is, the dataframe_standard is built on different library's syntantic sugar. It is cumbersome to use the dataframe_standard library.

For some applications, like plotting libraries or parts of buckaroo (core table translation, pluggable_analysis_framework), and a lot of data engineering tasks, the dataframe_standard api is workable. However, when you want to write actual analysis code like actual analysis/sumary stats functions in buckaroo, or skrub-data code, the dataframe_standard api becomes very cumbersome. The dataframe_standard api looks like java code to me.

What would it look like to split polars into polars core which exposes the dataframe_standard api, then polars_analysis which is built on top of the polars_core (or any other implementing dataframe_standard library)?

I could be way off with some of the assumptions I have made here. Eveyone involved is tackling a hard problem and doing good work, I'm not trying to denigrate any of it.

@shwina
Copy link
Contributor

shwina commented Oct 20, 2023

First, to address @Vincent-Maladiere's issue: I am in agreement: the get_column_by_name function (in particular) is somewhat verbose and we should consider shortening the name of such a commonly used API. There are reasons for and against doing that, but at the very least, we should reconsider.

Second, to comment on what @paddymul said above: I think it's an accurate summary of the state of things. The standard API is "workable" (certainly not easy or pretty) if you're a library developer wanting to write your code in a way that people can use it with multiple DataFrame implementations. If you're an end user or data scientist, you are likely not going to like using the standard API (because it is too minimal and sometimes verbose). For better or for worse, developing a standard focused on library developers rather than the data scientist is the goal this group has aligned on.

All this being said, the point is taken that to the extent possible, we should endeavor to make the standard API as friendly as we can. Like any API, I expect that will be an iterative process.

@MarcoGorelli
Copy link
Contributor

Thanks all for your feedback

Library developers count as users - they value strictness more than flexibility (compared with analysts), but nobody wants messy code such as that presented in #286

and therefore we may consider keeping our duplicate logic

Someone else reached out to me on LinkedIn to express the same feeling.
I was hoping we'd be able to disprove this comment but I wouldn't take it for granted


Wit that said, I'm -1 on anything longer than df.col (even that's too long really, as it requires you to repeat the dataframe name over and over again...)

@shwina
Copy link
Contributor

shwina commented Oct 23, 2023

as it requires you to repeat the dataframe name over and over again

I don't think that namespace.col() is the right answer to this particular problem. The same problem is true for any Python object you wish to index multiple times in the same expression (e.g., a list).

If the variable referring to the dataframe is particularly long and one wishes to improve readability, an alias would work just fine. For example, in Pandas it's not unreasonable to do the following:

# instead of:
ans = (
    my_long_variable_name['column_name'].max() -
    my_long_variable_name['column_name'].min()
)/(my_long_variable_name['column_name'].mean())

# write:
vals = my_long_variable_name['column_name']
ans = (vals.max() - vals.min()) / vals.mean()

@MarcoGorelli
Copy link
Contributor

Yes, and this is also a fairly common complaint about pandas syntax :)

Anyway, happy to agree to disagree on this point, so long as we can move forwards - let's discuss DataFrame.col in the next call

@MarcoGorelli
Copy link
Contributor

Just to add to the discussion though - it's not just about readability, but performance as well

The cuDF API is limited here (because the pandas one is):

df
df = df.filter(df.col('a') > df.col('b') + 1)

means you have to do several passes through the dataframe:

  • first, you need to materialise df.col('a')
  • then, you need to materialise df.col('b')
  • then you need to materialise df.col('b') + 1
  • then you need to materialise df.col('a') > df.col('b') + 1
  • finally, you can do your filter using a boolean mask

Whereas in Polars, even for eager dataframe, you can still express the above using pl.col (a "lazy column") without having to materialise intermediate results:

df
df = df.filter(pl.col('a') > pl.col('b') + 1)

as pl.col('a') > pl.col('b') + 1 stays lazy

@shwina
Copy link
Contributor

shwina commented Oct 24, 2023

Absolutely - materialization of intermediate results is a fact of life when you have expressions like a + b + c + .... This motivates the development of libraries such as numexpr, and a significant appeal of lazy evaluation is that it can avoid the problem.

Still, I think it is the implementation, not necessarily the API, that should worry about lazy semantics or materialization of intermediates

For example, Dask (I believe) can do the above lazily without having needed a change to the Pandas API for indexing.

The cuDF API is limited here (because the pandas one is):

Incidentally, cuDF does have the equivalent of Pandas' .apply(), .eval(), and .where() methods, all of which solve this problem.

@MarcoGorelli
Copy link
Contributor

Say you have df which is an (eager) cudf dataframe, and a user writes

df = df.filter(df.col('a') > df.col('b') + 1)

How would your implementation evaluate this without the extra materialisation steps?

@shwina
Copy link
Contributor

shwina commented Oct 24, 2023

How would your implementation evaluate this without the extra materialisation steps?

To be clear, our implementation wouldn't evaluate this without extra materialization. Although an implementation could.

Conceptually (I hope you'll forgive my shortcut of using a library for expression manipulation):

from pymbolic.primitives import Variable, Comparison
import cudf

class MyColumn(Variable):
    def __gt__(self, other):
        return Comparison(self, ">", other)

class MyDataFrame(cudf.DataFrame):
    def col(self, name):
        return MyColumn(name)

    def filter(self, expr):
        return super().query(str(expr))
In [2]: df = MyDataFrame({'a': [1, 10, 5], 'b': [1, 2, 3]})

In [3]: df
Out[3]:
    a  b
0   1  1
1  10  2
2   5  3

In [4]: df.col('a')
Out[4]: MyColumn('a')

In [5]: df.col('a') > df.col('b') + 1
Out[5]: Comparison(MyColumn('a'), '>', Sum((MyColumn('b'), 1)))

In [6]: df.filter(df.col('a') > df.col('b') + 1)
Out[6]:
    a  b
1  10  2
2   5  3

@MarcoGorelli
Copy link
Contributor

thanks for showing that example - could you also show how such an implementation would deal with:

  • df.col('a').mean()
  • df.col('a').to_array()

?

@shwina
Copy link
Contributor

shwina commented Oct 24, 2023

On way to do it is for the column to keep track of its originating DataFrame:

class MyColumn(Variable):
    def __init__(self, df, name):
        self._df = df
        super().__init__(name)

    def __gt__(self, other):
        return Comparison(self, ">", other)

    def to_array(self):
        return self._df[self.name].values

    def mean(self):
        # could return a `MyScalar` here instead:
        return self._df[self.name].mean()


class MyDataFrame(cudf.DataFrame):
    def col(self, name):
        return MyColumn(self, name)

    def filter(self, expr):
        return self.query(str(expr))
In [2]: df = MyDataFrame({'a': [1, 10, 5], 'b': [1, 2, 3]})

In [3]: df.col('a').mean()
Out[3]: 5.333333333333333

In [4]: df.col('a').to_array()
Out[4]: array([ 1, 10,  5])

If you prefer, we can move this discussion to a separate issue as we seem to be getting considerably further away from the readability of get_column_by_name()

@MarcoGorelli
Copy link
Contributor

On way to do it is for the column to keep track of its originating DataFrame:

Yup, that's my conclusion too - which strengthens my desire to disallow comparisons between columns which originate from different dataframes. It's not just a lazy execution thing. I'll open a PR, we can discuss there

@shwina
Copy link
Contributor

shwina commented Oct 24, 2023

To be clear, I think it's an implementation detail that the Column needs to do this. But, if we want to discuss making it explicitly part of the spec for a Column, I'm open to that.

@MarcoGorelli
Copy link
Contributor

agree that it's an implementation detail - and to make it possible for an implementation like the one you've shown to be standard-compliant, I think we need to state that comparisons between columns from different dataframes are outside the remit of the standard, #298

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants