Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store masked array to zarr #194

Closed
tcompa opened this issue Nov 9, 2022 · 15 comments · Fixed by #306
Closed

Store masked array to zarr #194

tcompa opened this issue Nov 9, 2022 · 15 comments · Fixed by #306
Assignees
Labels
Priority Important, but not the highest priority Tables AnnData and ROI/feature tables

Comments

@tcompa
Copy link
Collaborator

tcompa commented Nov 9, 2022

(branching from #45)

@tcompa
Copy link
Collaborator Author

tcompa commented Jan 25, 2023

Let's better define this issue.

First requirement:

  • Write a 4x4 zarr array to disk (e.g. an array of all ones).
  • Define a mask that does not include all 4^2 pixels. For instance take the pixels (0,0), (1,0), (1,0), (1,1).
  • Load the array from disk and modify all pixels (e.g. assign a new value, say 2).
  • Write the new array to disk in such a way that the pixels not included in the mask are not overwritten. E.g. make sure that the pixel (2, 2) still has value 1.

Slightly more realistic requirement:

  • Store a 100x100 2D zarr array to disk.
  • Define two ROIs, corresponding to the pixels [0:60]x[0:60] and [40:100]x[40:100]. The two ROIs overlap.
  • Define, inside each ROI, a masked region, e.g. the regions [0:20]x[0:20] (in the first ROI) and the region [80:100]x[80:100] (in the second ROI). The two masked regions do not overlap.
  • Load a certain ROI from disk, and modify its pixels.
  • Write the result to disk using the region keyword, making sure that the non-masked pixels (which may belong to both ROIs) are not overwritten.

@jluethi
Copy link
Collaborator

jluethi commented Jan 25, 2023

Sounds like a very good start! If those requirements / tests work, I'm confident that the overall approach can work :) Let's see if this will be easy to implement or not

@tcompa
Copy link
Collaborator Author

tcompa commented Feb 14, 2023

For the record, here is a first quick attempt with dask masked arrays. TL;DR I could only make it "work" when writing to a new zarr array, not when overwriting an existing one

Expected behavior:

  • I can define an array with masked elements (e.g. the organoid label is 1, and all label-array elements different from 1 should be masked).
  • Trivial operations only act on the non-masked elements (note: what if we have more complex operations? to be checked)
  • Masked elements are not read, they are not modified, and they are not written to disk.
  • I can write back the processed array to disk, and only non-masked elements are written.

Test

import dask.array as da
import numpy as np


def cleanup_files():
    # Remove some existing files, to avoid ContainsArrayError
    import os
    import shutil
    for filename in ["x.zarr", "x_new.zarr", "y.zarr"]:
        if os.path.exists(filename):
            shutil.rmtree(filename)


def prepare_masked_array():
    cleanup_files()
    # Write a 4x4 array to disk
    da.array([[1, 1, 2, 2],
              [1, 1, 2, 2],
              [3, 3, 4, 4],
              [3, 3, 4, 4]]
             ).to_zarr("x.zarr")

    # Load the array (lazily)
    x = da.from_zarr("x.zarr")

    # Mask elements where x!=1 (x=1 defined the "organoid" label)
    mx = da.ma.masked_not_equal(x, 1)

    # Here is a (equivalent?) alternative way
    # mx = da.ma.masked_where(x != 1, x)

    return mx


def example_1_new_zarr():
    print("[example_1_new_zarr] Start")
    # Prepare and modify a masked array
    mx = prepare_masked_array()
    mx += 5
    # Write the masked array to disk
    print("[example_1_new_zarr] Now write the following array to x_new.zarr")
    print(mx.compute())
    mx.to_zarr("x_new.zarr")
    # Load array from disk and check output
    new_x = da.from_zarr("x_new.zarr").compute()
    print("[example_1_new_zarr] I loaded this array:")
    print(new_x)
    success = np.array_equal(
            new_x,
            np.array([[6, 6, 2, 2],
                      [6, 6, 2, 2],
                      [3, 3, 4, 4],
                      [3, 3, 4, 4]]
                     )
            )
    if success:
        print("[example_1_new_zarr] Success")
    else:
        print("[example_1_new_zarr] Failed")
    print()


def example_2_overwrite_zarr():
    print("[example_2_overwrite_zarr] Start")
    # Prepare and modify a masked array
    mx = prepare_masked_array()
    mx += 5
    # Write the masked array to disk
    print("[example_2_overwrite_zarr] Now write the following array to x.zarr")
    print(mx.compute())
    mx.to_zarr("x.zarr", overwrite=True)
    # Load array from disk and check output
    new_x = da.from_zarr("x.zarr").compute()
    print("[example_2_overwrite_zarr] I loaded this array:")
    print(new_x)
    success = np.array_equal(
            new_x,
            np.array([[6, 6, 2, 2],
                      [6, 6, 2, 2],
                      [3, 3, 4, 4],
                      [3, 3, 4, 4]]
                     )
            )
    if success:
        print("[example_2_overwrite_zarr] Success")
    else:
        print("[example_2_overwrite_zarr] Failed")
    print()


if __name__ == "__main__":
    example_1_new_zarr()
    example_2_overwrite_zarr()

Output:

[example_1_new_zarr] Start
[example_1_new_zarr] Now write the following array to x_new.zarr
[[6 6 -- --]
 [6 6 -- --]
 [-- -- -- --]
 [-- -- -- --]]
[example_1_new_zarr] I loaded this array:
[[6 6 2 2]
 [6 6 2 2]
 [3 3 4 4]
 [3 3 4 4]]
[example_1_new_zarr] Success

[example_2_overwrite_zarr] Start
[example_2_overwrite_zarr] Now write the following array to x.zarr
[[6 6 -- --]
 [6 6 -- --]
 [-- -- -- --]
 [-- -- -- --]]
[example_2_overwrite_zarr] I loaded this array:
[[0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]]
[example_2_overwrite_zarr] Failed

@tcompa
Copy link
Collaborator Author

tcompa commented Feb 14, 2023

An alternative option is to use "standard" dask arrays (that is, not masked arrays).
In that case, I think the issue is not modifying some parts of the array, but rather handling the undefined shapes/chunks in the delayed operations (e.g. array[mask] has undefined shape, until it is computed).

In to_zarr, it's worth noting that:

Raises: ValueError
    If arr has unknown chunk sizes, which is not supported by Zarr. If region is specified and url is not a zarr.Array

The solution in that case could be to use compute_chunk_sizes right before writing to disk, but it requires some testing (especially to make sure that we do not end up re-computing the whole array).


Refs:

@tcompa
Copy link
Collaborator Author

tcompa commented Feb 14, 2023

Self-reminder: both previous comments are not dealing with regions, which would increase complexity a bit further.

@jluethi
Copy link
Collaborator

jluethi commented Feb 14, 2023

TL;DR I could only make it "work" when writing to a new zarr array, not when overwriting an existing one

Interesting. The main use cases I would foresee are:

  1. Creating new label masks that are contained within a ROI. e.g. we run organoid segmentation (=> find big objects). Within those big objects, we run nuclear segmentation (find small objects).
  2. Make measurements only within certain regions (e.g. only measure objects within a certain masked region => make a measurement within the cells of a given organoid that was segmented).

I don't foresee using this approach to do masked image overwriting (also not expecting masked image writing to a new image, but less sure there).
We may hit masked label writing, but mostly to new label images (because those are cheap anyway).
And a lot of masked reading :)

Self-reminder: both previous comments are not dealing with regions, which would increase complexity a bit further.

What does it mean that they don't deal with regions? What are regions in this context?


Let's discuss the details tomorrow :)

@tcompa
Copy link
Collaborator Author

tcompa commented Feb 14, 2023

Briefly (and then let's clarify tomorrow):

  1. I also guessed that masked writing is not the main goal, and that this issue was slightly out of scope. This would simplify things, I think.
  2. Regions as in something.to_zarr(url=something, region=region, ...).

@jluethi
Copy link
Collaborator

jluethi commented Feb 14, 2023

Ah, these regions! Yes, the main thought is to use it for work with such regions.
(or to refactor the way we loop over the ROIs in a different way, if to_zarr with region becomes the roadblock)

@tcompa tcompa added the High Priority Current Priorities & Blocking Issues label Feb 15, 2023
@tcompa
Copy link
Collaborator Author

tcompa commented Feb 15, 2023

After discussions with @jluethi, we better defined the expected behavior.
Briefly:

  1. We have image data on disk, in a zarr array
  2. We perform "primary" (i.e. organoid) segmentation of the image data, and we store the results (both the label image and the label anndata table) into the zarr array
  3. We now want to perform "secondary" (i.e. nuclear) segmentation of the image data, using the organoid labels as a filter for what should enter the segmentation functions (say this is a cellpose segmentation). More in detail:
    a. Before calling a cellpose function, we should make sure that its input image array has no data (read: zero values) for elements that do not belong to an organoid.
    b. The output label image will be non-zero only in the relevant region (that is, where the input was non-zero).
    c. Upon storing the output label image, we should not overwrite any existing element outside the relevant region.

Because 3c is complex to achieve by directly calling something like to_zarr, we may have to resort to "backup" copies of the array. In that case, we would do something along the following lines (possibly to be optimized for efficiency):

  • Create a working copy of the (relevant portion of) image data
  • Set some elements of this array to zero, by checking the condition of organoid labels being equal to a certain label value.
  • Use this array as an input for cellpose.
  • Load the current (relevant portion of the) array of nuclear labels. Note: this is created afresh when processing the first organoid, but upon processing the other organoids it already exists and it already contains data.
  • Proceed with a partial update the (relevant portion of) the nuclear label data, where only the non-zero labels from the current segmentation are updated.
  • Write to disk, and move on to the next organoid.

Note that "relevant portion of" always refer to additional layers of partitioning, which e.g. in our case corresponds to organoid-bounding-box regions.

@tcompa
Copy link
Collaborator Author

tcompa commented Feb 15, 2023

To better explain the expected behavior, here is a pure-numpy example that goes in the intended direction - as far as I understand. Note that it does not include the concept of intermediate partitioning levels (like FOVs or bounding boxes), but only the logic of making temporary copies and then doing partial updates. Also: it can most likely be optimized further, but that's not the point here.

@jluethi Does the logic in this example resemble what we discussed this morning? Do you see some crucial missing steps (apart from the obvious reading-from-disk and writing-to-disk, and apart from adding zarr+dask where needed)?
If this looks reasonable, I will move towards a more realistic example.

Script:

import numpy as np


def segmentation(img_data, shift):
    return np.random.randint(1, 4, size=img_data.shape) + shift


# Image data
img = np.array([[111, 111, 222, 222],
                [111, 111, 222, 222],
                [222, 222, 222, 222],
                [222, 222, 222, 222]])

# Primary (organoid) labels
organoid_labels = np.array([[1, 1, 2, 2],
                            [1, 1, 2, 2],
                            [2, 2, 2, 2],
                            [2, 2, 2, 2]])


# Secondary (nuclear) labeling, looping over organoids
shift = 0  # useful for relabeling
nuclei_labels = np.zeros_like(organoid_labels)
for organoid_label_value in np.sort(np.unique(organoid_labels)):
    print(f"\n--- Start processing {organoid_label_value=} ---\n")

    organoid_mask = organoid_labels == organoid_label_value
    background_mask = organoid_labels != organoid_label_value
    print("Organoid mask:")
    print(organoid_mask)
    print()

    img_to_process = img.copy()
    img_to_process[background_mask] = 0 
    print("Image data to process:")
    print(img_to_process)
    print()

    tmp_nuclei_labels = segmentation(img_to_process, shift=shift)
    print("New nuclei labels: "
          "(note that values outside the organoid will be discarded)")
    print(tmp_nuclei_labels)
    print()

    nuclei_labels[organoid_mask] = tmp_nuclei_labels[organoid_mask]
    shift = np.amax(nuclei_labels)
    print("Current status of nuclei labels:")
    print(nuclei_labels)
    print()

Output:

--- Start processing organoid_label_value=1 ---

Organoid mask:
[[ True  True False False]
 [ True  True False False]
 [False False False False]
 [False False False False]]

Image data to process:
[[111 111   0   0]
 [111 111   0   0]
 [  0   0   0   0]
 [  0   0   0   0]]

New nuclei labels: (note that values outside the organoid will be discarded)
[[2 3 1 1]
 [2 2 1 3]
 [2 3 1 3]
 [1 1 2 3]]

Current status of nuclei labels:
[[2 3 0 0]
 [2 2 0 0]
 [0 0 0 0]
 [0 0 0 0]]


--- Start processing organoid_label_value=2 ---

Organoid mask:
[[False False  True  True]
 [False False  True  True]
 [ True  True  True  True]
 [ True  True  True  True]]

Image data to process:
[[  0   0 222 222]
 [  0   0 222 222]
 [222 222 222 222]
 [222 222 222 222]]

New nuclei labels: (note that values outside the organoid will be discarded)
[[5 6 4 4]
 [5 6 5 6]
 [4 4 6 4]
 [5 5 6 4]]

Current status of nuclei labels:
[[2 3 4 4]
 [2 2 5 6]
 [4 4 6 4]
 [5 5 6 4]]

@jluethi
Copy link
Collaborator

jluethi commented Feb 15, 2023

Load the current (relevant portion of the) array of nuclear labels. Note: this is created afresh when processing the first organoid, but upon processing the other organoids it already exists and it already contains data.

We only load a bounding box over a given object, e.g. a given organoid. The bounding box for the next organoid probably has very small overlap with that, so I'm not sure what you're referencing here that already contains data. Or is it the placeholder object that exists and will be overwritten?

Regarding the numpy code example:
Yes, this looks like what I'd imagine! :)

It obviously needs to work in combination with regions, i.e. only load the bounding box and only write the masked bounding box back, because otherwise too much data would need to be loaded into memory at once (especially in 3D cases). But the core logic is sound!
If we can make this work, our ROIs become much more flexibel and we can segment within ROIs, e.g. we can segment nuclei within organoids [minus the 2D vs. 3D mixing question]

@tcompa
Copy link
Collaborator Author

tcompa commented Feb 15, 2023

We only load a bounding box over a given object, e.g. a given organoid. The bounding box for the next organoid probably has very small overlap with that, so I'm not sure what you're referencing here that already contains data. Or is it the placeholder object that exists and will be overwritten?

My bad.
I was still thinking that we would first load FOV ROIs, and then organoid ROIs inside them. But that would make no sense, since we already have a table listing all organoid ROIs that we can loop over.

TL;DR we should never talk about FOVs here (and we should also update the variable names in cellpose_segmentation, since they do refer to FOVs)


I could get a (admittedly very hack-ish) prototype to work, in the following way:

  • I ran the standard cellpose example, with minor changes (adding the bounding_box_ROI_table_name="label_DAPI_bbox" argument, and removing the overlapping-bounding-box check). This produced a dataset with image data and label data (those are nuclear labels, but I'm using them as if they were the organoids)

  • I prepared a new custom cellpose_segmentation task (we may discuss later on the tradeoffs for having everything in the same task), which is quite similar to the standard one in the boilerplate code but has some new logic related to the pure-numpy example above - see code below. I ran it with arguments ROI_table_name="label_DAPI_bbox" and output_label_name="label_DAPI_secondary".

  • I replaced the cellpose segmentation with some random numbers, just to see some "secondary" labels appearing inside the "primary" ones (which in this case are nuclei, but are meant to be organoids).

The output seems reasonable, see following screenshots of (1) the standard ("primary") labels, (2) both "primary" and "secondary" labels:
only
with_all

(note that I stopped processing at the 100-th ROI, so that not all nuclei also have the internal structure)

Bounding boxes clearly overlap, but it seems like the background was not modified. Of course this could be by accident (it's hard to say whether random numbers come from one nucleus or from another one), and I shall make the test more compelling.

The relevant block of code (this is all in examples/05_cardio_tiny_dataset_cellpose/prototype_secondary_labeling.py, in the 194-store-masked-array-to-zarr branch) looks like

    for i_ROI, indices in enumerate(list_indices):
        logger.info(f"[{well_id}] Now processing ROI {i_ROI+1}/{num_ROIs}")

        # Define region
        s_z, e_z, s_y, e_y, s_x, e_x = indices[:]

        # Prepare input for cellpose
        input_image_array = data_zyx[s_z:e_z, s_y:e_y, s_x:e_x].compute()

        # Load current primary labels (FIXME: do not hard-code path)
        organoid_labels = da.from_zarr(f"{zarrurl}labels/label_DAPI/0")[s_z:e_z, s_y:e_y, s_x:e_x].compute()

        # Set label_value (FIXME: is this correct?)
        label_value = int(ROI_table.obs.index[i_ROI]) + 1

       # Define organoid/background masks (FIXME: make variable names more general)
        organoid_mask = organoid_labels == label_value
        background_mask = organoid_labels != label_value

        # Load current secondary labels
        old_mask = da.from_zarr(f"{zarrurl}labels/{output_label_name}/0")[s_z:e_z, s_y:e_y, s_x:e_x].compute()

        # Compute new secondary labels for current ROI
        new_mask = np.zeros_like(input_image_array)
        new_mask[organoid_mask] = np.random.randint(0, 100, size=new_mask.shape)[organoid_mask]

        # IMPORTANT STEP: restore original background 
        new_mask[background_mask] = old_mask[background_mask]

        # Compute and store 0-th level to disk
        region = (slice(s_z, e_z), slice(s_y, e_y), slice(s_x, e_x))
        da.array(new_mask).to_zarr(
            url=mask_zarr,
            region=region,
            compute=True,
        )

(needless to say: this is far from being clean..)

@tcompa
Copy link
Collaborator Author

tcompa commented Feb 15, 2023

A simple to-do that will clarify things [but not today ;)]:

  • Improve the fake segmentation, to make it clear whether we are overwriting secondary labels from other neighboring ROIs.

@tcompa
Copy link
Collaborator Author

tcompa commented Feb 16, 2023

Concerning previous comments: with a minor update to the prototype (and upon inspecting the labels in napari), I can confirm that "secondary" labels for neighboring "primary" labels are not overwritten (that is, only the non-background pixels are updated, and then the whole new array is written to disk).

I think this means that the core feature is now well captured by the prototype. Now we should aim at making it a more polished cellpose_segmentation_secondary task (that's the simplest way to get started, but will only exist in the current branch), and then likely integrate it into the cellpose_segmentation.

Discussion about napari_workflows_wrapper will come later.

@tcompa
Copy link
Collaborator Author

tcompa commented Feb 17, 2023

This version of example_06 start to look promising, apart from:

Other than that, we now can perform (only through the example scripts) secondary labeling inside the organoids, even if the quality is terrible and it goes nowhere close to identifying nuclei.

Screenshot from 2023-02-17 14-51-11

@tcompa tcompa linked a pull request Feb 20, 2023 that will close this issue
3 tasks
@tcompa tcompa added Priority Important, but not the highest priority High Priority Current Priorities & Blocking Issues and removed High Priority Current Priorities & Blocking Issues labels Feb 23, 2023
@tcompa tcompa self-assigned this Mar 1, 2023
@tcompa tcompa removed the High Priority Current Priorities & Blocking Issues label Mar 1, 2023
@tcompa tcompa added the Tables AnnData and ROI/feature tables label Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority Important, but not the highest priority Tables AnnData and ROI/feature tables
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants