-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store masked array to zarr #194
Comments
Let's better define this issue. First requirement:
Slightly more realistic requirement:
|
Sounds like a very good start! If those requirements / tests work, I'm confident that the overall approach can work :) Let's see if this will be easy to implement or not |
For the record, here is a first quick attempt with dask masked arrays. TL;DR I could only make it "work" when writing to a new zarr array, not when overwriting an existing one Expected behavior:
Test import dask.array as da
import numpy as np
def cleanup_files():
# Remove some existing files, to avoid ContainsArrayError
import os
import shutil
for filename in ["x.zarr", "x_new.zarr", "y.zarr"]:
if os.path.exists(filename):
shutil.rmtree(filename)
def prepare_masked_array():
cleanup_files()
# Write a 4x4 array to disk
da.array([[1, 1, 2, 2],
[1, 1, 2, 2],
[3, 3, 4, 4],
[3, 3, 4, 4]]
).to_zarr("x.zarr")
# Load the array (lazily)
x = da.from_zarr("x.zarr")
# Mask elements where x!=1 (x=1 defined the "organoid" label)
mx = da.ma.masked_not_equal(x, 1)
# Here is a (equivalent?) alternative way
# mx = da.ma.masked_where(x != 1, x)
return mx
def example_1_new_zarr():
print("[example_1_new_zarr] Start")
# Prepare and modify a masked array
mx = prepare_masked_array()
mx += 5
# Write the masked array to disk
print("[example_1_new_zarr] Now write the following array to x_new.zarr")
print(mx.compute())
mx.to_zarr("x_new.zarr")
# Load array from disk and check output
new_x = da.from_zarr("x_new.zarr").compute()
print("[example_1_new_zarr] I loaded this array:")
print(new_x)
success = np.array_equal(
new_x,
np.array([[6, 6, 2, 2],
[6, 6, 2, 2],
[3, 3, 4, 4],
[3, 3, 4, 4]]
)
)
if success:
print("[example_1_new_zarr] Success")
else:
print("[example_1_new_zarr] Failed")
print()
def example_2_overwrite_zarr():
print("[example_2_overwrite_zarr] Start")
# Prepare and modify a masked array
mx = prepare_masked_array()
mx += 5
# Write the masked array to disk
print("[example_2_overwrite_zarr] Now write the following array to x.zarr")
print(mx.compute())
mx.to_zarr("x.zarr", overwrite=True)
# Load array from disk and check output
new_x = da.from_zarr("x.zarr").compute()
print("[example_2_overwrite_zarr] I loaded this array:")
print(new_x)
success = np.array_equal(
new_x,
np.array([[6, 6, 2, 2],
[6, 6, 2, 2],
[3, 3, 4, 4],
[3, 3, 4, 4]]
)
)
if success:
print("[example_2_overwrite_zarr] Success")
else:
print("[example_2_overwrite_zarr] Failed")
print()
if __name__ == "__main__":
example_1_new_zarr()
example_2_overwrite_zarr() Output:
|
An alternative option is to use "standard" dask arrays (that is, not masked arrays). In
The solution in that case could be to use compute_chunk_sizes right before writing to disk, but it requires some testing (especially to make sure that we do not end up re-computing the whole array). Refs: |
Self-reminder: both previous comments are not dealing with regions, which would increase complexity a bit further. |
Interesting. The main use cases I would foresee are:
I don't foresee using this approach to do masked image overwriting (also not expecting masked image writing to a new image, but less sure there).
What does it mean that they don't deal with regions? What are regions in this context? Let's discuss the details tomorrow :) |
Briefly (and then let's clarify tomorrow):
|
Ah, these regions! Yes, the main thought is to use it for work with such regions. |
After discussions with @jluethi, we better defined the expected behavior.
Because 3c is complex to achieve by directly calling something like
Note that "relevant portion of" always refer to additional layers of partitioning, which e.g. in our case corresponds to organoid-bounding-box regions. |
To better explain the expected behavior, here is a pure-numpy example that goes in the intended direction - as far as I understand. Note that it does not include the concept of intermediate partitioning levels (like FOVs or bounding boxes), but only the logic of making temporary copies and then doing partial updates. Also: it can most likely be optimized further, but that's not the point here. @jluethi Does the logic in this example resemble what we discussed this morning? Do you see some crucial missing steps (apart from the obvious reading-from-disk and writing-to-disk, and apart from adding zarr+dask where needed)? Script: import numpy as np
def segmentation(img_data, shift):
return np.random.randint(1, 4, size=img_data.shape) + shift
# Image data
img = np.array([[111, 111, 222, 222],
[111, 111, 222, 222],
[222, 222, 222, 222],
[222, 222, 222, 222]])
# Primary (organoid) labels
organoid_labels = np.array([[1, 1, 2, 2],
[1, 1, 2, 2],
[2, 2, 2, 2],
[2, 2, 2, 2]])
# Secondary (nuclear) labeling, looping over organoids
shift = 0 # useful for relabeling
nuclei_labels = np.zeros_like(organoid_labels)
for organoid_label_value in np.sort(np.unique(organoid_labels)):
print(f"\n--- Start processing {organoid_label_value=} ---\n")
organoid_mask = organoid_labels == organoid_label_value
background_mask = organoid_labels != organoid_label_value
print("Organoid mask:")
print(organoid_mask)
print()
img_to_process = img.copy()
img_to_process[background_mask] = 0
print("Image data to process:")
print(img_to_process)
print()
tmp_nuclei_labels = segmentation(img_to_process, shift=shift)
print("New nuclei labels: "
"(note that values outside the organoid will be discarded)")
print(tmp_nuclei_labels)
print()
nuclei_labels[organoid_mask] = tmp_nuclei_labels[organoid_mask]
shift = np.amax(nuclei_labels)
print("Current status of nuclei labels:")
print(nuclei_labels)
print() Output:
|
We only load a bounding box over a given object, e.g. a given organoid. The bounding box for the next organoid probably has very small overlap with that, so I'm not sure what you're referencing here that already contains data. Or is it the placeholder object that exists and will be overwritten? Regarding the numpy code example: It obviously needs to work in combination with regions, i.e. only load the bounding box and only write the masked bounding box back, because otherwise too much data would need to be loaded into memory at once (especially in 3D cases). But the core logic is sound! |
My bad. TL;DR we should never talk about FOVs here (and we should also update the variable names in cellpose_segmentation, since they do refer to FOVs) I could get a (admittedly very hack-ish) prototype to work, in the following way:
The output seems reasonable, see following screenshots of (1) the standard ("primary") labels, (2) both "primary" and "secondary" labels: (note that I stopped processing at the 100-th ROI, so that not all nuclei also have the internal structure) Bounding boxes clearly overlap, but it seems like the background was not modified. Of course this could be by accident (it's hard to say whether random numbers come from one nucleus or from another one), and I shall make the test more compelling. The relevant block of code (this is all in for i_ROI, indices in enumerate(list_indices):
logger.info(f"[{well_id}] Now processing ROI {i_ROI+1}/{num_ROIs}")
# Define region
s_z, e_z, s_y, e_y, s_x, e_x = indices[:]
# Prepare input for cellpose
input_image_array = data_zyx[s_z:e_z, s_y:e_y, s_x:e_x].compute()
# Load current primary labels (FIXME: do not hard-code path)
organoid_labels = da.from_zarr(f"{zarrurl}labels/label_DAPI/0")[s_z:e_z, s_y:e_y, s_x:e_x].compute()
# Set label_value (FIXME: is this correct?)
label_value = int(ROI_table.obs.index[i_ROI]) + 1
# Define organoid/background masks (FIXME: make variable names more general)
organoid_mask = organoid_labels == label_value
background_mask = organoid_labels != label_value
# Load current secondary labels
old_mask = da.from_zarr(f"{zarrurl}labels/{output_label_name}/0")[s_z:e_z, s_y:e_y, s_x:e_x].compute()
# Compute new secondary labels for current ROI
new_mask = np.zeros_like(input_image_array)
new_mask[organoid_mask] = np.random.randint(0, 100, size=new_mask.shape)[organoid_mask]
# IMPORTANT STEP: restore original background
new_mask[background_mask] = old_mask[background_mask]
# Compute and store 0-th level to disk
region = (slice(s_z, e_z), slice(s_y, e_y), slice(s_x, e_x))
da.array(new_mask).to_zarr(
url=mask_zarr,
region=region,
compute=True,
) (needless to say: this is far from being clean..) |
A simple to-do that will clarify things [but not today ;)]:
|
Concerning previous comments: with a minor update to the prototype (and upon inspecting the labels in napari), I can confirm that "secondary" labels for neighboring "primary" labels are not overwritten (that is, only the non-background pixels are updated, and then the whole new array is written to disk). I think this means that the core feature is now well captured by the prototype. Now we should aim at making it a more polished Discussion about |
This version of example_06 start to look promising, apart from:
Other than that, we now can perform (only through the example scripts) secondary labeling inside the organoids, even if the quality is terrible and it goes nowhere close to identifying nuclei. |
(branching from #45)
The text was updated successfully, but these errors were encountered: