Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No query or describe - call out common but missing methods? #440

Closed
ianozsvald opened this issue Nov 29, 2023 · 5 comments
Closed

No query or describe - call out common but missing methods? #440

ianozsvald opened this issue Nov 29, 2023 · 5 comments
Milestone

Comments

@ianozsvald
Copy link

Having just experimented with dask-expr (it is nice!), might it be worth noting that whilst query isn't available, a mask works as expected and that works, e.g. ddf.query("customer_years < 3 and customer_type=='b'").compute() is the same as

mask = (ddfx['customer_years'] < 3) & (ddfx['customer_type'] == 'b')
ddfx[mask].compute()

Also mean isn't listed as being available on the homepage (outside of a groupby/resample) but it worked ok for me on a column (e.g. ddfx[['d_0']].mean()).

It would be lovely to see .describe() just because it is used early in a workflow, so it is likely that new users will hit its absence (which presumably means you'd need median, percentile etc).

@phofl
Copy link
Collaborator

phofl commented Nov 29, 2023

Thx for the report!

I will add mean to the list of available methods, thx for pointing it out.

We are making a push to fill out the API at the moment, so I am hopeful that this will change very very soon. Describe isn't as trivial as the others unfortunately, since we need to do some groundwork there, but query is easy. You are correct with your assumption about median and percentile, bot of them require array integration that isn't build yet (should change relatively soon as well)

@phofl
Copy link
Collaborator

phofl commented Nov 29, 2023

I've added query and pushed out a new release (was due anyway), so it is available if you are on 0.2.5 (#441)

Just a note: query will block optimisations for now since we don't know which columns you are accessing for now until #386 is available

@gwvr
Copy link

gwvr commented Nov 29, 2023

@phofl I'm working with @ianozsvald and I've just started experimenting with dask-expr, replicating some well understood computations that we've also implemented with Dask & Polars.
I'm in the habit of method chaining my computations, so tried to work around the lack of query using both pipe & map_partitions. The first isn't implemented and the second seems to hurt performance, presumably (and unsurprisingly) because it breaks pushdown?

ETA - having worked around the lack of query, performance is very good 👍

@phofl
Copy link
Collaborator

phofl commented Nov 29, 2023

ETA - having worked around the lack of query, performance is very good 👍

That sounds good! Feel free to let me know if you find anything that' slower than you would expect.

The first isn't implemented and the second seems to hurt performance, presumably (and unsurprisingly) because it breaks pushdown?

Yes that's correct, that is a little bit annoying but introspection on Python UDFs is not reliable unfortunately, that's my main motivation for #386 (which is basically the col expression form Spark and Polars). This expression type would solve that problem for UDFs for all of these methods.

@phofl phofl added this to the 1.0 milestone Dec 8, 2023
@phofl
Copy link
Collaborator

phofl commented Jan 4, 2024

We added describe as well now, will push out a release later today, so closing here

@phofl phofl closed this as completed Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants