-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/dask poc #192
Feature/dask poc #192
Conversation
…ut sizing mismatch error
for more information, see https://pre-commit.ci
Codecov Report
@@ Coverage Diff @@
## master #192 +/- ##
==========================================
+ Coverage 83.90% 84.19% +0.29%
==========================================
Files 12 13 +1
Lines 1379 1474 +95
==========================================
+ Hits 1157 1241 +84
- Misses 222 233 +11
Continue to review full report at Codecov.
|
ome_zarr/scale.py
Outdated
) | ||
if isinstance(out, dask.array.Array): | ||
new_stack = dask.array.zeros( | ||
(*shape_dims, out.shape[0], out.shape[1]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the use of out.shape here is questionable because the result of resize was delayed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is now resolved because the dask chunks were set appropriately
…rr-py into feature/dask-poc # Conflicts: # ome_zarr/scale.py # ome_zarr/writer.py
for more information, see https://pre-commit.ci
after the last couple of commits, the writer seems to be working in my simple test case (only the "nearest" method has been modified to use dask in the Scaler) |
calling compute_mip before trying to store anything out is potentially problematic for very large data even in dask land. |
cc: @jakirkham here just in case he has any https://data-apis.org/ insight. |
the delayed create_mip is definitely slow with dask and is one of those steps where it appears like nothing is happening in between writing out levels. something seems inefficient, like maybe we could process all levels for each plane while we have it in memory, but then the write pattern changes completely. on the other hand, maybe this is as good as it gets? |
I tested a simple case using sample code at https://ome-zarr.readthedocs.io/en/stable/python.html#writing-ome-ngff-images I tried a more 'real' example, loading data from "https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.1/6001247.zarr" (see below). This takes a while and I end up with the
I haven't looked at any resize issues yet. @toloudis do you want to share your code you're using to test this? |
The code I am using to test is inside of aicsimageio's feature/zarrwriter branch https://github.com/AllenCellModeling/aicsimageio/tree/feature/zarrwriter The key files are One interesting thing is that the OmeZarrWriter (which calls this branch of ome-zarr-py) selects chunk sizes using a selectable limit (I chose 16MB for testing but could be parameterized) |
cc: @aeisenbarth if there's anything from the https://github.com/aeisenbarth/ngff-writer front |
@will-moore, I think you're missing data is due to the fact that you're testing with an v0.1 dataset that uses |
|
Thanks to @joshmoore for pointing out the errors in my script above (wrong That script is here (it takes ~4 minutes to rewrite the zarr, using this PR): write_dask_to_omezarr.py
To compare to ngff-writer-from-url.py
It might take a bit of work to adopt the approach that @aeisenbarth uses, but it's certainly what's needed! |
Looking into this, I noticed that |
However, if I also tell diff
|
Another issue with the current writing to disk is that the |
I am unable to work on this over the next couple of weeks. I will return to it when I can but please feel free to move forward. |
resize() gets messed-up, so pyramid gets bigger instead of smaller
That's a fair point - there are certainly cases where the data may be delayed, but not necessarily suffers the performance hit we saw. So we should probably have tests for dask in all those write multiscale/image/labels.
|
In my tests, I use a similar pattern for testing multiple array implementations: @pytest.mark.parametrize("array_constructor", [np.array, da.from_array, zarr.array])
def test_func(array_constructor):
...
data = array_constructor(data) |
# Conflicts: # tests/test_writer.py
for more information, see https://pre-commit.ci
Re-launched workflows. @toloudis, anything outstanding from your side? |
There is probably some work to be done to consolidate different methods inside of scaler (by_plane vs the dask codepath that does something different), and, possibly related, I still have performance concerns for non- |
@@ -302,6 +315,7 @@ def write_multiscales_metadata( | |||
dict( | |||
version=fmt.version, | |||
datasets=_validate_datasets(datasets, ndim, fmt), | |||
name=name if name else group.name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line might be too heavyhanded, as the spec says that name "SHOULD" be provided but not "MUST". Is it ok for this code to write out group.name? I believe there were tests that depended on name having been written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting proposal. I would also tend to agree that writers should attempt to be as compliant with the recommended specification as possible.
I also suspect that using the group name is already the fallback behavior of consumers like napari when no name
metadata is found. In the case of HCS specification (and other nested layout), this behavior might lead to uninformative names (like 0
, 1
...) but I don't know that it leaves us in a worse place than now.
The thing I said about refining the Scaler is probably good for a separate PR and not this one. Merging this PR should not be disruptive to the non-dask workflow, and only enables dask arrays to be passed in. |
Is there anything else needed here before merging? Sounds like no, but wanted to check 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few changes requested inline but nothing major.
Coming back to the API questions raised in #192 (review), I think the following commits addressed most of my questions but the resolution state is a bit unclear. Would it possible to have a short summary of the list of APIs which have been updated to support dask
either as a comment or as an update of the PR description? This can also be used directly in the release notes.
A few additional question regarding integration & release:
- what is the status in terms of functional testing with the latest state of this proposed change? Note Best practices for generating multiscale zarr data? #215 might also provide a potential use case for the functionality
- are there any follow-up todos planned on top of this work before they can be included in an upcoming
0.x.0
release? - as this is probably the biggest ongoing piece of work on this library, are we happy to merge once all outstanding comments have been cleared and possibly release as a alpha/beta/release candidate to gain some experience before making a final release?
@@ -176,6 +178,7 @@ def write_multiscale( | |||
axes: Union[str, List[str], List[Dict[str, str]]] = None, | |||
coordinate_transformations: List[List[Dict[str, Any]]] = None, | |||
storage_options: Union[JSONDict, List[JSONDict]] = None, | |||
name: str = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused that this was not already implemented since write_multiscale_labels
delegates name=name
but I think this is handled via the **metadata
kwargs.
I am unaware of any issue with being explicit but the API doc of write_multiscale
will need to be reviewed and updated accordingly. For the same API, the docstring of pyramid
could also be specified to be explicit about the type of arrays supported
@@ -302,6 +315,7 @@ def write_multiscales_metadata( | |||
dict( | |||
version=fmt.version, | |||
datasets=_validate_datasets(datasets, ndim, fmt), | |||
name=name if name else group.name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting proposal. I would also tend to agree that writers should attempt to be as compliant with the recommended specification as possible.
I also suspect that using the group name is already the fallback behavior of consumers like napari when no name
metadata is found. In the case of HCS specification (and other nested layout), this behavior might lead to uninformative names (like 0
, 1
...) but I don't know that it leaves us in a worse place than now.
@@ -573,7 +703,7 @@ def write_multiscale_labels( | |||
|
|||
|
|||
def write_labels( | |||
labels: np.ndarray, | |||
labels: Union[np.ndarray, da.Array], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, could use ArrayLike
if we define this alias in the class
@@ -125,6 +130,30 @@ def __create_group( | |||
series.append({"path": path}) | |||
return grp | |||
|
|||
def resize_image(self, image: ArrayLike) -> ArrayLike: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is different from all the others on Scaler
class as it returns a single Array (not a Pyramid of arrays). It is only used if the image in writer.write_image(image, group, scaler)
is a dask
array. Otherwise, mip = scaler.nearest(image)
is used to create a pyramid for non-dask data. If you pass in a scaler
instance that doesn't implement scaler.resize_image()
then write_image(dask_image, group, scaler)
will fail. This is unexpected and not documented anywhere.
When a scaler instance is passed in to write_image()
none of the other methods on it are used (gaussian(), laplacian(), local_mean(), zoom()). If you wish to use one of these methods for downsampling (non-dask only), you need to do it before passing the data to write_multiscale()
. So maybe these methods should be on a different Scaler class than the Scaler
that is used for write_image(i, g, scaler)
and the same scaling method should be used for dask and non-dask data (to return a single Image, not a pyramid)?
The write_image()
docs for scaler
parameter state If this is None, no downsampling will be performed.
but that isn't true for dask_image
data - this will fail if scaler
is None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that none of these issues is a blocker - Improving the Scaler
documentation and usage - Using the scaler.resize_image()
instead of scaler.nearest()
for dask AND non-dask data, possibly removing other methods from the class etc could come in a follow-up PR (even a follow-up release). cc @joshmoore?
Probably the most important fix should be handling write_image(dask_data, group, scaler=None)
which I think would currently fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would agree that dask versions of gaussian, laplacian, etc could be done in later PRs.
I can try to do one more pass this afternoon to make sure scaler=None
does not break with dask array.
As I'm sure is true with many others, I am splitting my time among many different projects, so thanks for living with these intermittent pieces of work on this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a quick look and I am already handling scaler=None
in _write_dask_image
: (1) max_layer set to 0 if scaler is None, and (2) don't call scaler.resize_image if scaler is None
I'm wondering what the status is on this PR. |
@toloudis thanks for re-upping the priority of this now that everyone is more or less back from holidays. The questions you bring up pretty much echo what I said in #192 (review) so I believe we are all on the same page although a bit looking at each other :) As a preamble, I am transitioning to a new role where I may or may not remain actively involved in the code base. In this context, it feels sensible to let you guys drive the next set of additions to the library and I'll make sure that Will and the rest of the OME team have the necessary permissions to do so. My personal opinion is that the dask support is demanded feature, this is a large contribution so there is interest in prioritizing its integration and planning its officual release. Both you and @will-moore have the best knowledge of the state of things. Assuming you are both happy with the current state, my inclination would be to get this merged in This leaves the wider question of the roadmap/timeline towards a final release with dask support. If that's something that cannot be decided at this time, maybe it is worth creating a |
Thanks all! It is great to see this in 😄 |
ListOfArrayLike = Union[List[da.Array], List[np.ndarray]] | ||
ArrayLike = Union[da.Array, np.ndarray] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps in the future these could be some kind of general Array (once that is more fleshed out).
This is now released in |
Admittedly I am an extremely naive dask user but I'm taking a stab at this because I really want to pass a large dask array in to the ome zarr writer here.
I wonder if anyone can chime in on how to make the resize work inside of Scaler - or indeed how to make this code more correct. This is heavily WIP just to try to get anything working at first. (For the array I'm passing in now, there's some size broadcast mismatch error when it gets to
compute
in the resizing stuff.)I am loading some data using aicsimageio's get_image_dask_data and then basically forwarding it into this writer (through the WIP aicsimageio.writers.OmeZarrWriter)
Possibly addressing #187