-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk stats bug fix #437
Bulk stats bug fix #437
Conversation
…eature labels are non-unique within the same timestep
Linting results by Pylint:Your code has been rated at 8.70/10 (previous run: 8.70/10, +0.00) |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## RC_v1.5.x #437 +/- ##
==========================================
Coverage 60.91% 60.91%
==========================================
Files 23 23
Lines 3541 3544 +3
==========================================
+ Hits 2157 2159 +2
- Misses 1384 1385 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JuliaKukulies ! I want to do a bit more testing (particularly with #354 being in the 1.6.0 branch now), but at this stage I'm happy for this to be merged.
"Feature labels are not unique which may cause unexpected results for the computation of bulk statistics." | ||
) | ||
# extra warning when feature labels are not unique in timestep | ||
uniques = features.groupby("time")[id_column].value_counts().values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we group by time
or by frame
? I actually don't know the correct answer here; perhaps this opens a philosophical can of worms that we don't want to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think by "time" would make more sense for this warning because we actually use the time dimension to perform the bulk statistics for each unique feature on:
tobac/tobac/utils/bulk_statistics.py
Lines 288 to 298 in 698d53b
# get bulk statistics for each timestep | |
step_statistics = [] | |
for tt in pd.to_datetime(segmentation_mask.time): | |
# select specific timestep | |
segmentation_mask_t = segmentation_mask.sel(time=tt).data | |
fields_t = ( | |
field.sel(time=tt).values if "time" in field.coords else field.values | |
for field in fields | |
) | |
features_t = features.loc[features.time == tt].copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you have a good point. The latter could done by "frame", too. So lets just briefly discuss this in the next dev meeting maybe!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I prefer using "time"
here, but can see also why "frame"
would also be appropriate (e.g. in edge cases like Sean mentioned where dataframe times are saved to ms accuracy rather than ns causing mismatches. Maybe we can leave it as is and revisit if it causes any issues in future
Thanks for your quick review @freemansw1. And totally agree- we should add more tests for this in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for catching this! I am happy for this to be merged as is, with the possibility of revisiting the "time" vs "frame" issue in future if needs be
"Feature labels are not unique which may cause unexpected results for the computation of bulk statistics." | ||
) | ||
# extra warning when feature labels are not unique in timestep | ||
uniques = features.groupby("time")[id_column].value_counts().values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I prefer using "time"
here, but can see also why "frame"
would also be appropriate (e.g. in edge cases like Sean mentioned where dataframe times are saved to ms accuracy rather than ns causing mismatches. Maybe we can leave it as is and revisit if it causes any issues in future
@JuliaKukulies are you happy to merge? |
A few small fixes for the bulk statistics method
get_statistics_from_mask
:At some locations the hardcoded column name "feature" was not updated with the string variable
id_column
which can be defined by the user in case one wants to use an ID with another column name to compute the bulk statistics. This caused the bulk statistics to fail when you input a dataframe that uses a different column name as the feature ID.Added some test parameters in
test_utils_bulk_statistics
to account for the above case. Also addedindex
here since we have not previously covered this parameter with the tests. This parameter allows users to compute bulk statistics for specified feature regions only.Corrected the warning message that accidently included a
raise
statement which resulted in an error message when the warning was triggered (TypeError: exceptions must derive from BaseException
)Added another warning that makes users aware when feature labels are non-unique even within the same timestep. This is because there is a common use case of working with storm IDs that stay the same over multiple timesteps (for one track). For this case, our code actually still works correctly because we assign the computed statistics for each timestep independently. However, there should not be any feature IDs occurring multiple times within the same timestep. This would lead to unexpected results when the bulk statistics are added to the output dataframe.
Finally, I added a line to make sure that feature IDs are integers (this is already controlled by how we output the feature dataframe, but does not hurt to double check here since users might modify the feature dataframe or use the output from a different tracking algorithm/dataset)