Review/optimize use of `get_overlapping_pairs_3D` in Cellpose task #779

tcompa · 2024-07-03T15:02:19Z

Branching from #764 (fixed in principle with #778)

After a fix, it'd be useful to have a rough benchmark of get_overlapping_pairs_3D for one of those many-labels cases (say 1000 labels for a given i_ROI). Since this function does nothing else than printing a warning, I would be very strict in how long a runtime we can accept.

If it turns out that it is slow, we can easily improve it in a trivial way (simply stop after the first overlap is found) or also in more systematic ways (the current approach is close to being the definition of a slow Python function: a quadratic-scaling for loop, which calls a pure-Python function and then even appends to a list).

If the function is not slow, then there's no special need for a refactor.

The text was updated successfully, but these errors were encountered:

jluethi · 2024-07-03T16:46:00Z

Another thing to consider here:
This (as well as the original implementation) will only check for potential bounding box overlaps within a given ROI that's being processed, right?

A future improvement may be to check for overlaps across ROIs as well, as in theory, they could also overlap. Given that this is just for warnings, not sure how relevant this is though. Our masking should still work for this after all.

tcompa · 2024-07-04T06:44:06Z

A future improvement may be to check for overlaps across ROIs as well, as in theory, they could also overlap. Given that this is just for warnings, not sure how relevant this is though. Our masking should still work for this after all.

For the record, making this change is quite straightforward, since we should only move the check from within the ROI loop to its end (EDIT: this diff is not working, but the idea remains valid):

diff --git a/fractal_tasks_core/tasks/cellpose_segmentation.py b/fractal_tasks_core/tasks/cellpose_segmentation.py
index ea78661a..403630a1 100644
--- a/fractal_tasks_core/tasks/cellpose_segmentation.py
+++ b/fractal_tasks_core/tasks/cellpose_segmentation.py
@@ -541,15 +541,6 @@ def cellpose_segmentation(
 
             bbox_dataframe_list.append(bbox_df)
 
-            overlap_list = get_overlapping_pairs_3D(
-                bbox_df, full_res_pxl_sizes_zyx
-            )
-            if len(overlap_list) > 0:
-                logger.warning(
-                    f"ROI {indices} has "
-                    f"{len(overlap_list)} bounding-box pairs overlap"
-                )
-
         # Compute and store 0-th level to disk
         da.array(new_label_img).to_zarr(
             url=mask_zarr,
@@ -582,6 +573,15 @@ def cellpose_segmentation(
             bbox_dataframe_list = [empty_bounding_box_table()]
         # Concatenate all ROI dataframes
         df_well = pd.concat(bbox_dataframe_list, axis=0, ignore_index=True)
+
+        overlap_list = get_overlapping_pairs_3D(
+            bbox_dataframe_list, full_res_pxl_sizes_zyx
+        )
+        if len(overlap_list) > 0:
+            logger.warning(
+                f"{len(overlap_list)} bounding-box pairs overlap"
+            )
+
         df_well.index = df_well.index.astype(str)
         # Extract labels and drop them from df_well
         labels = pd.DataFrame(df_well["label"].astype(str))

However we should not do this without actual benchmark, or we risk introducing a very bad scaling. If there are N organoids with M labels each (under the fake assumption that each organoid has the same number of internal objects), we have:

Current main: (N*(N-1)/2) * (M*(M-1)/2) overlap checks
After Fix wrong repeated overlap check for bounding-boxes in cellpose task #778: NM(M-1)/2 overlap checks
After the patch described in this comment: (NM)(N*M-1)/2 overlap checks

E.g. for N=138 organoids (ref #764) and 100 labels (random guess), we'd have:

46*10^6 checks
683k checks
95*10^6 checks

Note that:

If it turns out that it is slow, we can easily improve it in a trivial way (simply stop after the first overlap is found) or also in more systematic ways (the current approach is close to being the definition of a slow Python function: a quadratic-scaling for loop, which calls a pure-Python function and then even appends to a list).

jluethi · 2024-07-04T07:55:17Z

Agreed on not doing this now. More something to test if this gets benchmarked further :)

I'd say we go with the fix now as you implemented it here. And consider most of this benchmarking in the context of a future Cellpose task refactor to use new OME-Zarr reader/writer strategy

jluethi · 2024-10-17T09:06:53Z

I just debugged this with some users who (mistakenly) had output ROI tables set for single cell segmentation where they have tens of thousands of segmented objects. In that case, this performance became super prohibitive, even in the slightly optimized version.

I'd say we should remove this in the next Cellpose refactor. The warning is not very useful, but sometimes highly performance critcial.

For the extreme example, for 1 user, their whole segmentation task ran in under a minute, but then it took almost 9 hours (!!) to do these overlap checks for 28 thousand label objects:

2024-10-16 10:28:52,329; INFO; START cellpose_segmentation task
[... Detail logs] 
2024-10-16 10:28:53,449; INFO; [segment_ROI] START | x: <class 'numpy.ndarray'>, (1, 1, 5400, 5120) | do_3D=False | model.diam_mean=17.0 | diameter=10.0 | advanced_cellpose_model_params.flow_threshold=0.4 | normalize.type='default'
2024-10-16 10:29:27,081; INFO; [segment_ROI] END   | Elapsed: 33.627 s | mask.shape=(1, 5400, 5120), mask.dtype=dtype('uint16') (then <class 'numpy.uint32'>), np.max(mask)=28360 | model.diam_mean=17.0 | diameter=10.0 | advanced_cellpose_model_params.flow_threshold=0.4
2024-10-16 10:29:27,134; INFO; ROI had num_labels_roi=28360, num_labels_tot={'num_labels_tot': 28360}
[This is were the cellpose segmentation has ended]
2024-10-16 19:20:10,278; WARNING; ROI [0, 1, 0, 5400, 0, 5120] has 18940 bounding-box pairs overlap

tcompa added Backlog Backlog issues we may eventually fix, but aren't a priority testing performance labels Jul 3, 2024

jluethi added this to Fractal Project Management Jul 3, 2024

github-project-automation bot moved this to TODO in Fractal Project Management Jul 3, 2024

tcompa mentioned this issue Jul 3, 2024

Fix wrong repeated overlap check for bounding-boxes in cellpose task #778

Merged

1 task

jluethi mentioned this issue Oct 17, 2024

ngio for fractal-tasks-core: Overview #850

Open

64 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review/optimize use of `get_overlapping_pairs_3D` in Cellpose task #779

Review/optimize use of `get_overlapping_pairs_3D` in Cellpose task #779

tcompa commented Jul 3, 2024

jluethi commented Jul 3, 2024

tcompa commented Jul 4, 2024 •

edited

Loading

jluethi commented Jul 4, 2024

jluethi commented Oct 17, 2024

Review/optimize use of get_overlapping_pairs_3D in Cellpose task #779

Review/optimize use of get_overlapping_pairs_3D in Cellpose task #779

Comments

tcompa commented Jul 3, 2024

jluethi commented Jul 3, 2024

tcompa commented Jul 4, 2024 • edited Loading

jluethi commented Jul 4, 2024

jluethi commented Oct 17, 2024

Review/optimize use of `get_overlapping_pairs_3D` in Cellpose task #779

Review/optimize use of `get_overlapping_pairs_3D` in Cellpose task #779

tcompa commented Jul 4, 2024 •

edited

Loading