-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve efficiency of compute_statistic by minimizing data access #2147
Improve efficiency of compute_statistic by minimizing data access #2147
Conversation
…by computing minimal sub-cube for mask
glue/core/data.py
Outdated
subarray_slices = [] | ||
for idim in range(mask.ndim): | ||
collapse_axes = tuple(index for index in range(mask.ndim) if index != idim) | ||
valid = mask.any(axis=collapse_axes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This step can be arbitrarily expensive, no? It will loop over all of the mask data? Or is mask already sliced down at a previous step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already access np.any on the mask above so this won't change much, and we typically operate in chunks for big arrays.
When we use the subcube approach, we need to pad out the result so the shape matches the result without optimization. I've added a regression test for this, but still need to push up a fix once I have time. |
Codecov Report
@@ Coverage Diff @@
## master #2147 +/- ##
==========================================
- Coverage 87.87% 87.86% -0.02%
==========================================
Files 246 246
Lines 22596 22708 +112
==========================================
+ Hits 19857 19952 +95
- Misses 2739 2756 +17
Continue to review full report at Codecov.
|
This improves the efficiency of compute_statistic, especially in the context of the profile viewer, when subsets are applied, by first finding the minimal bounding box for the selection and then extracting data using this bounding box only. In simple tests, this can improve performance by 30x or more. This works especially well when loading CASA datasets since those are very sensitive to disk access.
This needs tests and a changelog entry, and the runtime errors need to be addressed.
cc @keflavich