Add GroupBy.aggregate (and tpch-1 query to examples) #286

MarcoGorelli · 2023-10-17T16:56:36Z

closes #274

the gist of the PR is that it lets you write

    result = (
        lineitem.filter(mask)
        .group_by(["l_returnflag", "l_linestatus"])
        .aggregate(
            namespace.Aggregation.sum("l_quantity").rename("sum_qty"),
            namespace.Aggregation.sum("l_extendedprice").rename("sum_base_price"),
            namespace.Aggregation.sum("l_disc_price").rename("sum_disc_price"),
            namespace.Aggregation.sum("change").rename("sum_charge"),
            namespace.Aggregation.mean("l_quantity").rename("avg_qty"),
            namespace.Aggregation.mean("l_discount").rename("avg_disc"),
            namespace.Aggregation.size().rename("count_order"),
        )
        .sort(["l_returnflag", "l_linestatus"])
    )

MarcoGorelli · 2023-10-17T20:20:14Z

spec/API_specification/examples/tpch/q1.py

+    lineitem = lineitem.assign(
+        [
+            (
+                lineitem.get_column_by_name("l_extended_price")
+                * (1 - lineitem.get_column_by_name("l_discount"))
+            ).rename("l_disc_price"),
+            (
+                lineitem.get_column_by_name("l_extended_price")
+                * (1 - lineitem.get_column_by_name("l_discount"))
+                * (1 + lineitem.get_column_by_name("l_tax"))
+            ).rename("l_charge"),
+        ]
+    )


this syntax though...

Hi @shwina - are you ok with this syntax?

Personally I think it's worse than what's in any existing dataframe library, and I can't imagine any user ever wanting to write code like this

but maybe it's just me

Please correct me if I'm wrong, but I thought the goal of the standard right now is to provide an API focused on third-party library developers (not end users). This is why we have been comfortable sacrificing syntactic crispness or an expressive API in favor of being the "lowest common denominator" that all libraries can implement.

I think this necessarily means the API isn't quite as nice to work with for the end-user.

For example, changing get_column_by_name to just [ ] in the code above would be a massive boost in readability, but we explicitly decided against it because (IIRC) we wanted library authors to have the freedom to decide what [ ] should mean for their library

That being said, I agree with you 100% that this looks a mess. It's a question whether library developers are going to be OK with dealing with a messy API to get cross-library compatibility in return...

I agree with you 100% that this looks a mess

Well I'm glad we could find some common ground 😄

Let's discuss more next week - I'm genuinely interested in finding a solution that works for everybody

My current prediction is that, unless the standard drastically improves, that libraries will just support pandas and Polars and ignore the standard completely

The end result for cudf will be that you'll be no better off than you are now

As I was saying...(emphasis mine)

I'm pretty upset about having to use df.get_column_by_name("a") instead of a simpler df["a"] or col("a"). This will obfuscate our code and impair readability, and therefore we may consider keeping our duplicate logic

#287

That's fair. We should shorten the name.

any suggestions?

Being addressed in #290

spec/API_specification/dataframe_api/groupby_object.py

kkraus14

LGTM other than the move to df.col

MarcoGorelli · 2023-10-26T10:18:25Z

thanks for your review

we can rename to col if/when the other PR is in

MarcoGorelli added the API design label Oct 17, 2023

MarcoGorelli commented Oct 17, 2023

View reviewed changes

MarcoGorelli mentioned this pull request Oct 23, 2023

Feature request: get_column_by_name impairs readability #287

Closed

MarcoGorelli added 4 commits October 23, 2023 14:46

add Aggregation API

21be6ff

fixup

e0681ab

add q1

4ac1d5d

note what happens if rename isnt called

abc3092

MarcoGorelli marked this pull request as draft October 23, 2023 15:38

MarcoGorelli force-pushed the group-by-agg branch from 977a80d to fdc1c55 Compare October 23, 2023 15:58

MarcoGorelli marked this pull request as ready for review October 23, 2023 15:58

MarcoGorelli force-pushed the group-by-agg branch from fdc1c55 to 21be6ff Compare October 23, 2023 15:59

MarcoGorelli requested review from rgommers, jorisvandenbossche and kkraus14 October 23, 2023 16:01

MarcoGorelli mentioned this pull request Oct 23, 2023

GroupBy.aggregate #274

Closed

kkraus14 reviewed Oct 23, 2023

View reviewed changes

spec/API_specification/dataframe_api/groupby_object.py Outdated Show resolved Hide resolved

kkraus14 approved these changes Oct 23, 2023

View reviewed changes

MarcoGorelli added 4 commits October 24, 2023 08:31

Merge remote-tracking branch 'upstream/main' into group-by-agg

32256ed

Merge remote-tracking branch 'upstream/main' into group-by-agg

06681eb

typing

e55ebd8

fixup;

5112b12

MarcoGorelli merged commit 1addc2a into data-apis:main Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add GroupBy.aggregate (and tpch-1 query to examples) #286

Add GroupBy.aggregate (and tpch-1 query to examples) #286

Uh oh!

MarcoGorelli commented Oct 17, 2023 •

edited

Loading

Uh oh!

MarcoGorelli Oct 17, 2023

Uh oh!

MarcoGorelli Oct 18, 2023

Uh oh!

shwina Oct 19, 2023 •

edited

Loading

Uh oh!

shwina Oct 19, 2023

Uh oh!

MarcoGorelli Oct 19, 2023

Uh oh!

MarcoGorelli Oct 20, 2023

Uh oh!

shwina Oct 20, 2023

Uh oh!

MarcoGorelli Oct 20, 2023

Uh oh!

kkraus14 Oct 23, 2023

Uh oh!

Uh oh!

kkraus14 left a comment

Uh oh!

MarcoGorelli commented Oct 26, 2023

Uh oh!

Uh oh!

Add GroupBy.aggregate (and tpch-1 query to examples) #286

Add GroupBy.aggregate (and tpch-1 query to examples) #286

Uh oh!

Conversation

MarcoGorelli commented Oct 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shwina Oct 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kkraus14 left a comment

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli commented Oct 26, 2023

Uh oh!

Uh oh!

MarcoGorelli commented Oct 17, 2023 •

edited

Loading

shwina Oct 19, 2023 •

edited

Loading