Skip to content

Add GroupBy.aggregate (and tpch-1 query to examples) #286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Oct 26, 2023

Conversation

MarcoGorelli
Copy link
Contributor

@MarcoGorelli MarcoGorelli commented Oct 17, 2023

closes #274

the gist of the PR is that it lets you write

    result = (
        lineitem.filter(mask)
        .group_by(["l_returnflag", "l_linestatus"])
        .aggregate(
            namespace.Aggregation.sum("l_quantity").rename("sum_qty"),
            namespace.Aggregation.sum("l_extendedprice").rename("sum_base_price"),
            namespace.Aggregation.sum("l_disc_price").rename("sum_disc_price"),
            namespace.Aggregation.sum("change").rename("sum_charge"),
            namespace.Aggregation.mean("l_quantity").rename("avg_qty"),
            namespace.Aggregation.mean("l_discount").rename("avg_disc"),
            namespace.Aggregation.size().rename("count_order"),
        )
        .sort(["l_returnflag", "l_linestatus"])
    )

Comment on lines 12 to 24
lineitem = lineitem.assign(
[
(
lineitem.get_column_by_name("l_extended_price")
* (1 - lineitem.get_column_by_name("l_discount"))
).rename("l_disc_price"),
(
lineitem.get_column_by_name("l_extended_price")
* (1 - lineitem.get_column_by_name("l_discount"))
* (1 + lineitem.get_column_by_name("l_tax"))
).rename("l_charge"),
]
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this syntax though...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shwina - are you ok with this syntax?

Personally I think it's worse than what's in any existing dataframe library, and I can't imagine any user ever wanting to write code like this

but maybe it's just me

Copy link
Contributor

@shwina shwina Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I'm wrong, but I thought the goal of the standard right now is to provide an API focused on third-party library developers (not end users). This is why we have been comfortable sacrificing syntactic crispness or an expressive API in favor of being the "lowest common denominator" that all libraries can implement.

I think this necessarily means the API isn't quite as nice to work with for the end-user.

For example, changing get_column_by_name to just [ ] in the code above would be a massive boost in readability, but we explicitly decided against it because (IIRC) we wanted library authors to have the freedom to decide what [ ] should mean for their library

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, I agree with you 100% that this looks a mess. It's a question whether library developers are going to be OK with dealing with a messy API to get cross-library compatibility in return...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you 100% that this looks a mess

Well I'm glad we could find some common ground 😄

Let's discuss more next week - I'm genuinely interested in finding a solution that works for everybody

My current prediction is that, unless the standard drastically improves, that libraries will just support pandas and Polars and ignore the standard completely

The end result for cudf will be that you'll be no better off than you are now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I was saying...(emphasis mine)

I'm pretty upset about having to use df.get_column_by_name("a") instead of a simpler df["a"] or col("a"). This will obfuscate our code and impair readability, and therefore we may consider keeping our duplicate logic

#287

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. We should shorten the name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any suggestions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being addressed in #290

@MarcoGorelli MarcoGorelli marked this pull request as draft October 23, 2023 15:38
@MarcoGorelli MarcoGorelli marked this pull request as ready for review October 23, 2023 15:58
@MarcoGorelli MarcoGorelli mentioned this pull request Oct 23, 2023
Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than the move to df.col

@MarcoGorelli
Copy link
Contributor Author

thanks for your review

we can rename to col if/when the other PR is in

@MarcoGorelli MarcoGorelli merged commit 1addc2a into data-apis:main Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GroupBy.aggregate
3 participants