-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bincount fix slicing #7391
Bincount fix slicing #7391
Conversation
The CI is currently failing because in my attempt to fix the failing cupy test, I've used the numpy array creation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @GenevieveBuckley , I can confirm the test passes on my end as well. The fact that we now need NumPy>=1.20 is technically a regression, but probably a small one, I'm fine with telling users they need to upgrade. But could you add the line below to test_bincount
to ensure it doesn't fail on older NumPy versions?
@pytest.mark.skipif(not _numpy_120, reason="NEP-35 is not available")
…s using older versions of numpy
Thanks @GenevieveBuckley for finding this! In xhistogram (xgcm/xhistogram#27) we are experiencing problems with Specifically: import dask.array as dsa
ones = dsa.ones(shape=(100,), chunks=(110), dtype='int')
print(dsa.bincount(ones, minlength=2))
# -> dask.array<_bincount_agg-aggregate, shape=(), dtype=int64, chunksize=(), chunktype=numpy.ndarray> the output of It looks like this PR will fix that. So 🙌 for your work! |
It's nice to hear that @rabernat - this has been an unexpectedly high-impact change. I only knew about the problems in scikit-image when I started looking at it. |
The cupy bincount test passes locally now (I have cupy version 9.0.0b3 and numpy version 1.20.1
|
I think this is ready to merge @dask/maintenance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GenevieveBuckley thanks for the latest changes, I added another suggestion on making this a bit safer, not sure if that's really possible to happen in Dask but I guess there's no harm in doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work on this @GenevieveBuckley!
Since axis
is always (0,)
we can streamline some of the logic around constructing chunks
tuples (see the diff below). My hope is that will make reasoning about the logic here easier in the future.
Diff:
diff --git a/dask/array/routines.py b/dask/array/routines.py
index 8d53541e..69f56b63 100644
--- a/dask/array/routines.py
+++ b/dask/array/routines.py
@@ -645,7 +645,6 @@ def bincount(x, weights=None, minlength=0, split_every=None):
if weights.chunks != x.chunks:
raise ValueError("Chunks of input array x and weights must match.")
- axis = (0,)
token = tokenize(x, weights, minlength)
args = [x, "i"]
if weights is not None:
@@ -655,32 +654,27 @@ def bincount(x, weights=None, minlength=0, split_every=None):
meta = array_safe(np.bincount([]), x._meta)
if minlength == 0:
- output_size = np.nan
+ output_size = (np.nan,)
else:
- output_size = minlength
+ output_size = (minlength,)
chunked_counts = blockwise(
partial(np.bincount, minlength=minlength), "i", *args, token=token, meta=meta
)
- chunked_counts._chunks = tuple(
- (output_size,) * len(c) if i in axis else c
- for i, c in enumerate(chunked_counts.chunks)
- )
+ chunked_counts._chunks = (output_size * len(chunked_counts.chunks[0]), *chunked_counts.chunks[1:])
from .reductions import _tree_reduce
output = _tree_reduce(
chunked_counts,
aggregate=partial(_bincount_agg, dtype=meta.dtype),
- axis=axis,
+ axis=(0,),
keepdims=True,
dtype=meta.dtype,
split_every=split_every,
concatenate=False,
)
- output._chunks = tuple(
- (output_size,) if i in axis else c for i, c in enumerate(chunked_counts.chunks)
- )
+ output._chunks = (output_size, *chunked_counts.chunks[1:])
output._meta = meta
return output
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
Thank you for the feedback @pentschev and @jrbourbeau |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @GenevieveBuckley ! 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your work on this @GenevieveBuckley and for reviewing @pentschev! This is in
@pentschev could you clarify when NumPy>=1.20 is needed? I noticed we're only skipping |
@jrbourbeau I hope this helps explain it: If we use
|
Great, thanks for that very clear explanation @GenevieveBuckley |
* Add VDS creation command-line tool * Add Dask image binner command-line tool, with options for single-image, multi-image sweep and multi-image pump-probe binning. * Add command line tool to inspect cue messages * Avoid Dask v2021.03.0 due to a regression in that release — the dask.array.bincount function does not permit slicing in that version (see dask/dask#7391). * Add more cue message info and tidy up the functions for discovering the timestamps of chosen cues
We've found problems trying to slice the output of
dask.array.bincount()
following #7183Passes
black dask
/flake8 dask
Tests added / passed
dask/array/test/test_routines.py
to check the output ofda.bincount
can be slicedpytest skimage/filters/tests/test_thresholding.py::test_thresholds_dask_compatibility -vv
Relevant issues:
bincount
)Summary:
When
dask.array.bincount
was modified in #7183 we lost the ability to slice the output returned by this function. The outputshape
andchunks
attributes were empty tuples instead of what we expected.We want to keep the benefits of PR 7183 because it's a lot better at managing memory when given large datasets.
So to add back in the ability to slice the output from
bincount
, I've added some of the extra logic normally handled bydask.array.reductions.reduction()
to keep track of the array shape & chunk sizes.