Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review/optimize use of get_overlapping_pairs_3D in Cellpose task #779

Open
tcompa opened this issue Jul 3, 2024 · 4 comments
Open

Review/optimize use of get_overlapping_pairs_3D in Cellpose task #779

tcompa opened this issue Jul 3, 2024 · 4 comments
Labels
Backlog Backlog issues we may eventually fix, but aren't a priority performance testing

Comments

@tcompa
Copy link
Collaborator

tcompa commented Jul 3, 2024

Branching from #764 (fixed in principle with #778)

After a fix, it'd be useful to have a rough benchmark of get_overlapping_pairs_3D for one of those many-labels cases (say 1000 labels for a given i_ROI). Since this function does nothing else than printing a warning, I would be very strict in how long a runtime we can accept.

If it turns out that it is slow, we can easily improve it in a trivial way (simply stop after the first overlap is found) or also in more systematic ways (the current approach is close to being the definition of a slow Python function: a quadratic-scaling for loop, which calls a pure-Python function and then even appends to a list).

If the function is not slow, then there's no special need for a refactor.

@jluethi
Copy link
Collaborator

jluethi commented Jul 3, 2024

Another thing to consider here:
This (as well as the original implementation) will only check for potential bounding box overlaps within a given ROI that's being processed, right?

A future improvement may be to check for overlaps across ROIs as well, as in theory, they could also overlap. Given that this is just for warnings, not sure how relevant this is though. Our masking should still work for this after all.

@tcompa
Copy link
Collaborator Author

tcompa commented Jul 4, 2024

A future improvement may be to check for overlaps across ROIs as well, as in theory, they could also overlap. Given that this is just for warnings, not sure how relevant this is though. Our masking should still work for this after all.

For the record, making this change is quite straightforward, since we should only move the check from within the ROI loop to its end (EDIT: this diff is not working, but the idea remains valid):

diff --git a/fractal_tasks_core/tasks/cellpose_segmentation.py b/fractal_tasks_core/tasks/cellpose_segmentation.py
index ea78661a..403630a1 100644
--- a/fractal_tasks_core/tasks/cellpose_segmentation.py
+++ b/fractal_tasks_core/tasks/cellpose_segmentation.py
@@ -541,15 +541,6 @@ def cellpose_segmentation(
 
             bbox_dataframe_list.append(bbox_df)
 
-            overlap_list = get_overlapping_pairs_3D(
-                bbox_df, full_res_pxl_sizes_zyx
-            )
-            if len(overlap_list) > 0:
-                logger.warning(
-                    f"ROI {indices} has "
-                    f"{len(overlap_list)} bounding-box pairs overlap"
-                )
-
         # Compute and store 0-th level to disk
         da.array(new_label_img).to_zarr(
             url=mask_zarr,
@@ -582,6 +573,15 @@ def cellpose_segmentation(
             bbox_dataframe_list = [empty_bounding_box_table()]
         # Concatenate all ROI dataframes
         df_well = pd.concat(bbox_dataframe_list, axis=0, ignore_index=True)
+
+        overlap_list = get_overlapping_pairs_3D(
+            bbox_dataframe_list, full_res_pxl_sizes_zyx
+        )
+        if len(overlap_list) > 0:
+            logger.warning(
+                f"{len(overlap_list)} bounding-box pairs overlap"
+            )
+
         df_well.index = df_well.index.astype(str)
         # Extract labels and drop them from df_well
         labels = pd.DataFrame(df_well["label"].astype(str))

However we should not do this without actual benchmark, or we risk introducing a very bad scaling. If there are N organoids with M labels each (under the fake assumption that each organoid has the same number of internal objects), we have:

  1. Current main: (N*(N-1)/2) * (M*(M-1)/2) overlap checks
  2. After Fix wrong repeated overlap check for bounding-boxes in cellpose task #778: NM(M-1)/2 overlap checks
  3. After the patch described in this comment: (NM)(N*M-1)/2 overlap checks

E.g. for N=138 organoids (ref #764) and 100 labels (random guess), we'd have:

  1. 46*10^6 checks
  2. 683k checks
  3. 95*10^6 checks

Note that:

If it turns out that it is slow, we can easily improve it in a trivial way (simply stop after the first overlap is found) or also in more systematic ways (the current approach is close to being the definition of a slow Python function: a quadratic-scaling for loop, which calls a pure-Python function and then even appends to a list).

@jluethi
Copy link
Collaborator

jluethi commented Jul 4, 2024

Agreed on not doing this now. More something to test if this gets benchmarked further :)

I'd say we go with the fix now as you implemented it here. And consider most of this benchmarking in the context of a future Cellpose task refactor to use new OME-Zarr reader/writer strategy

@jluethi
Copy link
Collaborator

jluethi commented Oct 17, 2024

I just debugged this with some users who (mistakenly) had output ROI tables set for single cell segmentation where they have tens of thousands of segmented objects. In that case, this performance became super prohibitive, even in the slightly optimized version.

I'd say we should remove this in the next Cellpose refactor. The warning is not very useful, but sometimes highly performance critcial.

For the extreme example, for 1 user, their whole segmentation task ran in under a minute, but then it took almost 9 hours (!!) to do these overlap checks for 28 thousand label objects:

2024-10-16 10:28:52,329; INFO; START cellpose_segmentation task
[... Detail logs] 
2024-10-16 10:28:53,449; INFO; [segment_ROI] START | x: <class 'numpy.ndarray'>, (1, 1, 5400, 5120) | do_3D=False | model.diam_mean=17.0 | diameter=10.0 | advanced_cellpose_model_params.flow_threshold=0.4 | normalize.type='default'
2024-10-16 10:29:27,081; INFO; [segment_ROI] END   | Elapsed: 33.627 s | mask.shape=(1, 5400, 5120), mask.dtype=dtype('uint16') (then <class 'numpy.uint32'>), np.max(mask)=28360 | model.diam_mean=17.0 | diameter=10.0 | advanced_cellpose_model_params.flow_threshold=0.4
2024-10-16 10:29:27,134; INFO; ROI had num_labels_roi=28360, num_labels_tot={'num_labels_tot': 28360}
[This is were the cellpose segmentation has ended]
2024-10-16 19:20:10,278; WARNING; ROI [0, 1, 0, 5400, 0, 5120] has 18940 bounding-box pairs overlap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backlog Backlog issues we may eventually fix, but aren't a priority performance testing
Projects
Development

No branches or pull requests

2 participants