-
-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dim-expression support in Dataset.select #3920
Conversation
Adds new `compute` and `keep_index` kwargs to the `values` method of all data interfaces. For dask data structures, compute controls whether value is evaluated before being returned. For pandas/dask Series, keep_index controls whether the values are returned as a series (keep_index=True) or array (keep_index=False).
Makes it possible to filter a dataset using a pandas or dask series with index that matches the index of the element's `data` data frame.
This may be set to a dim predicate express indicating which rows should kept. squash into 433640
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! selection_expr
is quite wordy, though. expr
, maybe?
In any case, is there a strong reason dim
objects can't simply be accepted the same as any other selection specifier, as the first argument? That would seem much more natural and expected, to me. Are there cases where that would somehow be ambiguous?
The current (without this PR) API signature for def select(self, selection_specs=None, **selection):
... So right now, selection are only specified as kwargs, and the first positional argument is treated as a selection specification that I believe configures the nested application of selections. I went with |
Ok, that's a good reason for having a longer name, but isn't |
No selection_spec specifies which object the selection applies to, selection_expr applies to the actual selected objects. |
Ah, ok; I had to go back and look at examples to see what was up here; I was only remembering part of how to use To make that work, it seems like all we'd need to do is to make |
I'd be happier having One thing that we could do would be to make @philippjfr @jlstevens Do you have strong feelings about the compatibility/convenience trade-off here? |
I think that's too much magic to happen silently, but it seems reasonable if accompanied by a warning message saying that people need to use the |
Add error message if first arg is not a dim expression, referring users to the selection_specs keyword argument.
Alright, in 6c1937a I moved The first example above may now be written as: ds.select((dim('b') >= 20) & (dim('a') % 2 == 0)) |
I think usage of |
But with the simpler names, won't it conflict with possible dimension names as Jon mentioned? Seems likely there will be data sources with columns named |
Oh right, missed that and yes that's a good point. |
Reading the above before looking at the code, I am happy to see that all the suggestions I was thinking of were proposed and accepted as I was thinking them up. I agree that usage of |
@@ -477,7 +477,7 @@ def initialize_unbounded(obj, dimensions, key): | |||
""" | |||
select = dict(zip([d.name for d in dimensions], key)) | |||
try: | |||
obj.select([DynamicMap], **select) | |||
obj.select(selection_specs=[DynamicMap], **select) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I count only four times where selection_specs
had to be specified as a keyword instead of by position! If that is how often it was used that positional argument in our own codebase, I am pretty certain users barely used it (if at all).
Looks good to me! |
It looks like the appveyor failures are independent of this PR? Seems like they need to be addressed in any case. Otherwise this seems mergeable. |
Thanks for the feedback all. Merging! |
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
This PR adds a new
selection_expr
argument to theDataset.select
method. This argument may be set to aholoviews.dim
expression as a way to select rows from a dataset using symbolic expressions.For example:
As long as the expressions are compatible with Dask DataFrames, this also works for
Dataset
s backed by Dask dataframes.Not all expressions currently work with Dask data structures. This is something I'd like to improve, but I think that should be in a separate PR.
@jlstevens @philippjfr