Store masked array to zarr #194

tcompa · 2022-11-09T10:35:36Z

(branching from #45)

tcompa · 2023-01-25T13:18:49Z

Let's better define this issue.

First requirement:

Write a 4x4 zarr array to disk (e.g. an array of all ones).
Define a mask that does not include all 4^2 pixels. For instance take the pixels (0,0), (1,0), (1,0), (1,1).
Load the array from disk and modify all pixels (e.g. assign a new value, say 2).
Write the new array to disk in such a way that the pixels not included in the mask are not overwritten. E.g. make sure that the pixel (2, 2) still has value 1.

Slightly more realistic requirement:

Store a 100x100 2D zarr array to disk.
Define two ROIs, corresponding to the pixels [0:60]x[0:60] and [40:100]x[40:100]. The two ROIs overlap.
Define, inside each ROI, a masked region, e.g. the regions [0:20]x[0:20] (in the first ROI) and the region [80:100]x[80:100] (in the second ROI). The two masked regions do not overlap.
Load a certain ROI from disk, and modify its pixels.
Write the result to disk using the region keyword, making sure that the non-masked pixels (which may belong to both ROIs) are not overwritten.

jluethi · 2023-01-25T15:33:40Z

Sounds like a very good start! If those requirements / tests work, I'm confident that the overall approach can work :) Let's see if this will be easy to implement or not

tcompa · 2023-02-14T10:43:55Z

For the record, here is a first quick attempt with dask masked arrays. TL;DR I could only make it "work" when writing to a new zarr array, not when overwriting an existing one

Expected behavior:

I can define an array with masked elements (e.g. the organoid label is 1, and all label-array elements different from 1 should be masked).
Trivial operations only act on the non-masked elements (note: what if we have more complex operations? to be checked)
Masked elements are not read, they are not modified, and they are not written to disk.
I can write back the processed array to disk, and only non-masked elements are written.

Test

import dask.array as da
import numpy as np


def cleanup_files():
    # Remove some existing files, to avoid ContainsArrayError
    import os
    import shutil
    for filename in ["x.zarr", "x_new.zarr", "y.zarr"]:
        if os.path.exists(filename):
            shutil.rmtree(filename)


def prepare_masked_array():
    cleanup_files()
    # Write a 4x4 array to disk
    da.array([[1, 1, 2, 2],
              [1, 1, 2, 2],
              [3, 3, 4, 4],
              [3, 3, 4, 4]]
             ).to_zarr("x.zarr")

    # Load the array (lazily)
    x = da.from_zarr("x.zarr")

    # Mask elements where x!=1 (x=1 defined the "organoid" label)
    mx = da.ma.masked_not_equal(x, 1)

    # Here is a (equivalent?) alternative way
    # mx = da.ma.masked_where(x != 1, x)

    return mx


def example_1_new_zarr():
    print("[example_1_new_zarr] Start")
    # Prepare and modify a masked array
    mx = prepare_masked_array()
    mx += 5
    # Write the masked array to disk
    print("[example_1_new_zarr] Now write the following array to x_new.zarr")
    print(mx.compute())
    mx.to_zarr("x_new.zarr")
    # Load array from disk and check output
    new_x = da.from_zarr("x_new.zarr").compute()
    print("[example_1_new_zarr] I loaded this array:")
    print(new_x)
    success = np.array_equal(
            new_x,
            np.array([[6, 6, 2, 2],
                      [6, 6, 2, 2],
                      [3, 3, 4, 4],
                      [3, 3, 4, 4]]
                     )
            )
    if success:
        print("[example_1_new_zarr] Success")
    else:
        print("[example_1_new_zarr] Failed")
    print()


def example_2_overwrite_zarr():
    print("[example_2_overwrite_zarr] Start")
    # Prepare and modify a masked array
    mx = prepare_masked_array()
    mx += 5
    # Write the masked array to disk
    print("[example_2_overwrite_zarr] Now write the following array to x.zarr")
    print(mx.compute())
    mx.to_zarr("x.zarr", overwrite=True)
    # Load array from disk and check output
    new_x = da.from_zarr("x.zarr").compute()
    print("[example_2_overwrite_zarr] I loaded this array:")
    print(new_x)
    success = np.array_equal(
            new_x,
            np.array([[6, 6, 2, 2],
                      [6, 6, 2, 2],
                      [3, 3, 4, 4],
                      [3, 3, 4, 4]]
                     )
            )
    if success:
        print("[example_2_overwrite_zarr] Success")
    else:
        print("[example_2_overwrite_zarr] Failed")
    print()


if __name__ == "__main__":
    example_1_new_zarr()
    example_2_overwrite_zarr()

Output:

[example_1_new_zarr] Start
[example_1_new_zarr] Now write the following array to x_new.zarr
[[6 6 -- --]
 [6 6 -- --]
 [-- -- -- --]
 [-- -- -- --]]
[example_1_new_zarr] I loaded this array:
[[6 6 2 2]
 [6 6 2 2]
 [3 3 4 4]
 [3 3 4 4]]
[example_1_new_zarr] Success

[example_2_overwrite_zarr] Start
[example_2_overwrite_zarr] Now write the following array to x.zarr
[[6 6 -- --]
 [6 6 -- --]
 [-- -- -- --]
 [-- -- -- --]]
[example_2_overwrite_zarr] I loaded this array:
[[0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]]
[example_2_overwrite_zarr] Failed

tcompa · 2023-02-14T10:46:57Z

An alternative option is to use "standard" dask arrays (that is, not masked arrays).
In that case, I think the issue is not modifying some parts of the array, but rather handling the undefined shapes/chunks in the delayed operations (e.g. array[mask] has undefined shape, until it is computed).

In to_zarr, it's worth noting that:

Raises: ValueError
    If arr has unknown chunk sizes, which is not supported by Zarr. If region is specified and url is not a zarr.Array

The solution in that case could be to use compute_chunk_sizes right before writing to disk, but it requires some testing (especially to make sure that we do not end up re-computing the whole array).

Refs:

tcompa · 2023-02-14T10:47:32Z

Self-reminder: both previous comments are not dealing with regions, which would increase complexity a bit further.

jluethi · 2023-02-14T12:32:44Z

TL;DR I could only make it "work" when writing to a new zarr array, not when overwriting an existing one

Interesting. The main use cases I would foresee are:

Creating new label masks that are contained within a ROI. e.g. we run organoid segmentation (=> find big objects). Within those big objects, we run nuclear segmentation (find small objects).
Make measurements only within certain regions (e.g. only measure objects within a certain masked region => make a measurement within the cells of a given organoid that was segmented).

I don't foresee using this approach to do masked image overwriting (also not expecting masked image writing to a new image, but less sure there).
We may hit masked label writing, but mostly to new label images (because those are cheap anyway).
And a lot of masked reading :)

Self-reminder: both previous comments are not dealing with regions, which would increase complexity a bit further.

What does it mean that they don't deal with regions? What are regions in this context?

Let's discuss the details tomorrow :)

tcompa · 2023-02-14T12:59:17Z

Briefly (and then let's clarify tomorrow):

I also guessed that masked writing is not the main goal, and that this issue was slightly out of scope. This would simplify things, I think.
Regions as in something.to_zarr(url=something, region=region, ...).

jluethi · 2023-02-14T13:01:08Z

Ah, these regions! Yes, the main thought is to use it for work with such regions.
(or to refactor the way we loop over the ROIs in a different way, if to_zarr with region becomes the roadblock)

tcompa · 2023-02-15T13:18:00Z

After discussions with @jluethi, we better defined the expected behavior.
Briefly:

We have image data on disk, in a zarr array
We perform "primary" (i.e. organoid) segmentation of the image data, and we store the results (both the label image and the label anndata table) into the zarr array
We now want to perform "secondary" (i.e. nuclear) segmentation of the image data, using the organoid labels as a filter for what should enter the segmentation functions (say this is a cellpose segmentation). More in detail:
a. Before calling a cellpose function, we should make sure that its input image array has no data (read: zero values) for elements that do not belong to an organoid.
b. The output label image will be non-zero only in the relevant region (that is, where the input was non-zero).
c. Upon storing the output label image, we should not overwrite any existing element outside the relevant region.

Because 3c is complex to achieve by directly calling something like to_zarr, we may have to resort to "backup" copies of the array. In that case, we would do something along the following lines (possibly to be optimized for efficiency):

Create a working copy of the (relevant portion of) image data
Set some elements of this array to zero, by checking the condition of organoid labels being equal to a certain label value.
Use this array as an input for cellpose.
Load the current (relevant portion of the) array of nuclear labels. Note: this is created afresh when processing the first organoid, but upon processing the other organoids it already exists and it already contains data.
Proceed with a partial update the (relevant portion of) the nuclear label data, where only the non-zero labels from the current segmentation are updated.
Write to disk, and move on to the next organoid.

Note that "relevant portion of" always refer to additional layers of partitioning, which e.g. in our case corresponds to organoid-bounding-box regions.

tcompa · 2023-02-15T13:24:53Z

To better explain the expected behavior, here is a pure-numpy example that goes in the intended direction - as far as I understand. Note that it does not include the concept of intermediate partitioning levels (like FOVs or bounding boxes), but only the logic of making temporary copies and then doing partial updates. Also: it can most likely be optimized further, but that's not the point here.

@jluethi Does the logic in this example resemble what we discussed this morning? Do you see some crucial missing steps (apart from the obvious reading-from-disk and writing-to-disk, and apart from adding zarr+dask where needed)?
If this looks reasonable, I will move towards a more realistic example.

Script:

import numpy as np


def segmentation(img_data, shift):
    return np.random.randint(1, 4, size=img_data.shape) + shift


# Image data
img = np.array([[111, 111, 222, 222],
                [111, 111, 222, 222],
                [222, 222, 222, 222],
                [222, 222, 222, 222]])

# Primary (organoid) labels
organoid_labels = np.array([[1, 1, 2, 2],
                            [1, 1, 2, 2],
                            [2, 2, 2, 2],
                            [2, 2, 2, 2]])


# Secondary (nuclear) labeling, looping over organoids
shift = 0  # useful for relabeling
nuclei_labels = np.zeros_like(organoid_labels)
for organoid_label_value in np.sort(np.unique(organoid_labels)):
    print(f"\n--- Start processing {organoid_label_value=} ---\n")

    organoid_mask = organoid_labels == organoid_label_value
    background_mask = organoid_labels != organoid_label_value
    print("Organoid mask:")
    print(organoid_mask)
    print()

    img_to_process = img.copy()
    img_to_process[background_mask] = 0 
    print("Image data to process:")
    print(img_to_process)
    print()

    tmp_nuclei_labels = segmentation(img_to_process, shift=shift)
    print("New nuclei labels: "
          "(note that values outside the organoid will be discarded)")
    print(tmp_nuclei_labels)
    print()

    nuclei_labels[organoid_mask] = tmp_nuclei_labels[organoid_mask]
    shift = np.amax(nuclei_labels)
    print("Current status of nuclei labels:")
    print(nuclei_labels)
    print()

Output:

--- Start processing organoid_label_value=1 ---

Organoid mask:
[[ True  True False False]
 [ True  True False False]
 [False False False False]
 [False False False False]]

Image data to process:
[[111 111   0   0]
 [111 111   0   0]
 [  0   0   0   0]
 [  0   0   0   0]]

New nuclei labels: (note that values outside the organoid will be discarded)
[[2 3 1 1]
 [2 2 1 3]
 [2 3 1 3]
 [1 1 2 3]]

Current status of nuclei labels:
[[2 3 0 0]
 [2 2 0 0]
 [0 0 0 0]
 [0 0 0 0]]


--- Start processing organoid_label_value=2 ---

Organoid mask:
[[False False  True  True]
 [False False  True  True]
 [ True  True  True  True]
 [ True  True  True  True]]

Image data to process:
[[  0   0 222 222]
 [  0   0 222 222]
 [222 222 222 222]
 [222 222 222 222]]

New nuclei labels: (note that values outside the organoid will be discarded)
[[5 6 4 4]
 [5 6 5 6]
 [4 4 6 4]
 [5 5 6 4]]

Current status of nuclei labels:
[[2 3 4 4]
 [2 2 5 6]
 [4 4 6 4]
 [5 5 6 4]]

jluethi · 2023-02-15T13:51:46Z

Load the current (relevant portion of the) array of nuclear labels. Note: this is created afresh when processing the first organoid, but upon processing the other organoids it already exists and it already contains data.

We only load a bounding box over a given object, e.g. a given organoid. The bounding box for the next organoid probably has very small overlap with that, so I'm not sure what you're referencing here that already contains data. Or is it the placeholder object that exists and will be overwritten?

Regarding the numpy code example:
Yes, this looks like what I'd imagine! :)

It obviously needs to work in combination with regions, i.e. only load the bounding box and only write the masked bounding box back, because otherwise too much data would need to be loaded into memory at once (especially in 3D cases). But the core logic is sound!
If we can make this work, our ROIs become much more flexibel and we can segment within ROIs, e.g. we can segment nuclei within organoids [minus the 2D vs. 3D mixing question]

tcompa · 2023-02-15T15:56:56Z

We only load a bounding box over a given object, e.g. a given organoid. The bounding box for the next organoid probably has very small overlap with that, so I'm not sure what you're referencing here that already contains data. Or is it the placeholder object that exists and will be overwritten?

My bad.
I was still thinking that we would first load FOV ROIs, and then organoid ROIs inside them. But that would make no sense, since we already have a table listing all organoid ROIs that we can loop over.

TL;DR we should never talk about FOVs here (and we should also update the variable names in cellpose_segmentation, since they do refer to FOVs)

I could get a (admittedly very hack-ish) prototype to work, in the following way:

I ran the standard cellpose example, with minor changes (adding the bounding_box_ROI_table_name="label_DAPI_bbox" argument, and removing the overlapping-bounding-box check). This produced a dataset with image data and label data (those are nuclear labels, but I'm using them as if they were the organoids)
I prepared a new custom cellpose_segmentation task (we may discuss later on the tradeoffs for having everything in the same task), which is quite similar to the standard one in the boilerplate code but has some new logic related to the pure-numpy example above - see code below. I ran it with arguments ROI_table_name="label_DAPI_bbox" and output_label_name="label_DAPI_secondary".
I replaced the cellpose segmentation with some random numbers, just to see some "secondary" labels appearing inside the "primary" ones (which in this case are nuclei, but are meant to be organoids).

The output seems reasonable, see following screenshots of (1) the standard ("primary") labels, (2) both "primary" and "secondary" labels:

(note that I stopped processing at the 100-th ROI, so that not all nuclei also have the internal structure)

Bounding boxes clearly overlap, but it seems like the background was not modified. Of course this could be by accident (it's hard to say whether random numbers come from one nucleus or from another one), and I shall make the test more compelling.

The relevant block of code (this is all in examples/05_cardio_tiny_dataset_cellpose/prototype_secondary_labeling.py, in the 194-store-masked-array-to-zarr branch) looks like

    for i_ROI, indices in enumerate(list_indices):
        logger.info(f"[{well_id}] Now processing ROI {i_ROI+1}/{num_ROIs}")

        # Define region
        s_z, e_z, s_y, e_y, s_x, e_x = indices[:]

        # Prepare input for cellpose
        input_image_array = data_zyx[s_z:e_z, s_y:e_y, s_x:e_x].compute()

        # Load current primary labels (FIXME: do not hard-code path)
        organoid_labels = da.from_zarr(f"{zarrurl}labels/label_DAPI/0")[s_z:e_z, s_y:e_y, s_x:e_x].compute()

        # Set label_value (FIXME: is this correct?)
        label_value = int(ROI_table.obs.index[i_ROI]) + 1

       # Define organoid/background masks (FIXME: make variable names more general)
        organoid_mask = organoid_labels == label_value
        background_mask = organoid_labels != label_value

        # Load current secondary labels
        old_mask = da.from_zarr(f"{zarrurl}labels/{output_label_name}/0")[s_z:e_z, s_y:e_y, s_x:e_x].compute()

        # Compute new secondary labels for current ROI
        new_mask = np.zeros_like(input_image_array)
        new_mask[organoid_mask] = np.random.randint(0, 100, size=new_mask.shape)[organoid_mask]

        # IMPORTANT STEP: restore original background 
        new_mask[background_mask] = old_mask[background_mask]

        # Compute and store 0-th level to disk
        region = (slice(s_z, e_z), slice(s_y, e_y), slice(s_x, e_x))
        da.array(new_mask).to_zarr(
            url=mask_zarr,
            region=region,
            compute=True,
        )

(needless to say: this is far from being clean..)

tcompa · 2023-02-15T16:00:24Z

A simple to-do that will clarify things [but not today ;)]:

Improve the fake segmentation, to make it clear whether we are overwriting secondary labels from other neighboring ROIs.

tcompa · 2023-02-16T13:00:09Z

Concerning previous comments: with a minor update to the prototype (and upon inspecting the labels in napari), I can confirm that "secondary" labels for neighboring "primary" labels are not overwritten (that is, only the non-background pixels are updated, and then the whole new array is written to disk).

I think this means that the core feature is now well captured by the prototype. Now we should aim at making it a more polished cellpose_segmentation_secondary task (that's the simplest way to get started, but will only exist in the current branch), and then likely integrate it into the cellpose_segmentation.

Discussion about napari_workflows_wrapper will come later.

tcompa · 2023-02-17T13:57:11Z

This version of example_06 start to look promising, apart from:

known (visualization?) bug - see Bug in cellpose_segmentation (or visualization)? Unexpected number of Z planes for labels #302
we are not directly testing some overlapping bounding boxes -- see Test data for organoid/nuclear labeling #298 (comment)
we needed to explicitly set the origin to (0,0,0) when converting the organoid-bounding-box ROIs to indices - see Do not always reset origin in convert_ROI_table_to_indices #304. In general there's nothing bad about setting a specific origin, if known, but I am not yet sure of why&where we would do it, and what origin we would set.

Other than that, we now can perform (only through the example scripts) secondary labeling inside the organoids, even if the quality is terrible and it goes nowhere close to identifying nuclei.

tcompa added this to the ROIs arbitrary objects milestone Nov 9, 2022

This was referenced Nov 9, 2022

Associate optional mask to each ROI (in tables), and use it for I/O selection #195

Closed

Establish ROIs for organoids #45

Closed

Create bounding-box ROI from label images #132

Closed

tcompa added the High Priority Current Priorities & Blocking Issues label Feb 15, 2023

tcompa mentioned this issue Feb 15, 2023

Update examples (ref #276) #292

Merged

tcompa mentioned this issue Feb 15, 2023

Test data for organoid/nuclear labeling #298

Closed

This was referenced Feb 16, 2023

Bug in cellpose_segmentation (or visualization)? Unexpected number of Z planes for labels #302

Closed

Do not always reset origin in convert_ROI_table_to_indices #304

Closed

tcompa mentioned this issue Feb 20, 2023

Enable nuclei-within-organoids cellpose labeling #306

Merged

3 tasks

tcompa linked a pull request Feb 20, 2023 that will close this issue

Enable nuclei-within-organoids cellpose labeling #306

Merged

3 tasks

tcompa mentioned this issue Feb 22, 2023

Add labels to bounding-box ROI tables #317

Closed

tcompa added Priority Important, but not the highest priority High Priority Current Priorities & Blocking Issues and removed High Priority Current Priorities & Blocking Issues labels Feb 23, 2023

tcompa self-assigned this Mar 1, 2023

tcompa removed the High Priority Current Priorities & Blocking Issues label Mar 1, 2023

tcompa closed this as completed in #306 Mar 14, 2023

tcompa added the Tables AnnData and ROI/feature tables label Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store masked array to zarr #194

Store masked array to zarr #194

tcompa commented Nov 9, 2022

tcompa commented Jan 25, 2023

jluethi commented Jan 25, 2023

tcompa commented Feb 14, 2023

tcompa commented Feb 14, 2023 •

edited

Loading

tcompa commented Feb 14, 2023

jluethi commented Feb 14, 2023

tcompa commented Feb 14, 2023

jluethi commented Feb 14, 2023

tcompa commented Feb 15, 2023

tcompa commented Feb 15, 2023

jluethi commented Feb 15, 2023

tcompa commented Feb 15, 2023 •

edited

Loading

tcompa commented Feb 15, 2023 •

edited

Loading

tcompa commented Feb 16, 2023

tcompa commented Feb 17, 2023

Store masked array to zarr #194

Store masked array to zarr #194

Comments

tcompa commented Nov 9, 2022

tcompa commented Jan 25, 2023

jluethi commented Jan 25, 2023

tcompa commented Feb 14, 2023

tcompa commented Feb 14, 2023 • edited Loading

tcompa commented Feb 14, 2023

jluethi commented Feb 14, 2023

tcompa commented Feb 14, 2023

jluethi commented Feb 14, 2023

tcompa commented Feb 15, 2023

tcompa commented Feb 15, 2023

jluethi commented Feb 15, 2023

tcompa commented Feb 15, 2023 • edited Loading

tcompa commented Feb 15, 2023 • edited Loading

tcompa commented Feb 16, 2023

tcompa commented Feb 17, 2023

tcompa commented Feb 14, 2023 •

edited

Loading

tcompa commented Feb 15, 2023 •

edited

Loading

tcompa commented Feb 15, 2023 •

edited

Loading