Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk stats bug fix #437

Merged

Conversation

JuliaKukulies
Copy link
Member

@JuliaKukulies JuliaKukulies commented Jul 23, 2024

A few small fixes for the bulk statistics method get_statistics_from_mask:

  • At some locations the hardcoded column name "feature" was not updated with the string variable id_column which can be defined by the user in case one wants to use an ID with another column name to compute the bulk statistics. This caused the bulk statistics to fail when you input a dataframe that uses a different column name as the feature ID.

  • Added some test parameters in test_utils_bulk_statistics to account for the above case. Also added index here since we have not previously covered this parameter with the tests. This parameter allows users to compute bulk statistics for specified feature regions only.

  • Corrected the warning message that accidently included a raise statement which resulted in an error message when the warning was triggered (TypeError: exceptions must derive from BaseException)

  • Added another warning that makes users aware when feature labels are non-unique even within the same timestep. This is because there is a common use case of working with storm IDs that stay the same over multiple timesteps (for one track). For this case, our code actually still works correctly because we assign the computed statistics for each timestep independently. However, there should not be any feature IDs occurring multiple times within the same timestep. This would lead to unexpected results when the bulk statistics are added to the output dataframe.

  • Finally, I added a line to make sure that feature IDs are integers (this is already controlled by how we output the feature dataframe, but does not hurt to double check here since users might modify the feature dataframe or use the output from a different tracking algorithm/dataset)

  • Have you followed our guidelines in CONTRIBUTING.md?
  • Have you self-reviewed your code and corrected any misspellings?
  • Have you written documentation that is easy to understand?
  • Have you written descriptive commit messages?
  • [NA] Have you added NumPy docstrings for newly added functions?
  • Have you formatted your code using black?
  • If you have introduced a new functionality, have you added adequate unit tests?
  • Have all tests passed in your local clone?
  • [NA] If you have introduced a new functionality, have you added an example notebook?
  • Have you kept your pull request small and limited so that it is easy to review?
  • Have the newest changes from this branch been merged?

Copy link

github-actions bot commented Jul 23, 2024

Linting results by Pylint:

Your code has been rated at 8.70/10 (previous run: 8.70/10, +0.00)
The linting score is an indicator that reflects how well your code version follows Pylint’s coding standards and quality metrics with respect to the RC_v1.5.x branch.
A decrease usually indicates your new code does not fully meet style guidelines or has potential errors.

Copy link

codecov bot commented Jul 24, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 60.91%. Comparing base (57612ec) to head (2ede64b).
Report is 115 commits behind head on RC_v1.5.x.

Files with missing lines Patch % Lines
tobac/utils/bulk_statistics.py 66.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           RC_v1.5.x     #437   +/-   ##
==========================================
  Coverage      60.91%   60.91%           
==========================================
  Files             23       23           
  Lines           3541     3544    +3     
==========================================
+ Hits            2157     2159    +2     
- Misses          1384     1385    +1     
Flag Coverage Δ
unittests 60.91% <66.66%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@JuliaKukulies JuliaKukulies added the bug Code that is failing or producing the wrong result label Jul 26, 2024
Copy link
Member

@freemansw1 freemansw1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JuliaKukulies ! I want to do a bit more testing (particularly with #354 being in the 1.6.0 branch now), but at this stage I'm happy for this to be merged.

"Feature labels are not unique which may cause unexpected results for the computation of bulk statistics."
)
# extra warning when feature labels are not unique in timestep
uniques = features.groupby("time")[id_column].value_counts().values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we group by time or by frame? I actually don't know the correct answer here; perhaps this opens a philosophical can of worms that we don't want to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think by "time" would make more sense for this warning because we actually use the time dimension to perform the bulk statistics for each unique feature on:

# get bulk statistics for each timestep
step_statistics = []
for tt in pd.to_datetime(segmentation_mask.time):
# select specific timestep
segmentation_mask_t = segmentation_mask.sel(time=tt).data
fields_t = (
field.sel(time=tt).values if "time" in field.coords else field.values
for field in fields
)
features_t = features.loc[features.time == tt].copy()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you have a good point. The latter could done by "frame", too. So lets just briefly discuss this in the next dev meeting maybe!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I prefer using "time" here, but can see also why "frame" would also be appropriate (e.g. in edge cases like Sean mentioned where dataframe times are saved to ms accuracy rather than ns causing mismatches. Maybe we can leave it as is and revisit if it causes any issues in future

@JuliaKukulies
Copy link
Member Author

Thanks @JuliaKukulies ! I want to do a bit more testing (particularly with #354 being in the 1.6.0 branch now), but at this stage I'm happy for this to be merged.

Thanks for your quick review @freemansw1. And totally agree- we should add more tests for this in v1.6.0

Copy link
Member

@w-k-jones w-k-jones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for catching this! I am happy for this to be merged as is, with the possibility of revisiting the "time" vs "frame" issue in future if needs be

"Feature labels are not unique which may cause unexpected results for the computation of bulk statistics."
)
# extra warning when feature labels are not unique in timestep
uniques = features.groupby("time")[id_column].value_counts().values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I prefer using "time" here, but can see also why "frame" would also be appropriate (e.g. in edge cases like Sean mentioned where dataframe times are saved to ms accuracy rather than ns causing mismatches. Maybe we can leave it as is and revisit if it causes any issues in future

@freemansw1
Copy link
Member

@JuliaKukulies are you happy to merge?

@JuliaKukulies JuliaKukulies merged commit dadad41 into tobac-project:RC_v1.5.x Aug 15, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Code that is failing or producing the wrong result
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants