Bulk stats bug fix #437

JuliaKukulies · 2024-07-23T21:00:12Z

A few small fixes for the bulk statistics method get_statistics_from_mask:

At some locations the hardcoded column name "feature" was not updated with the string variable id_column which can be defined by the user in case one wants to use an ID with another column name to compute the bulk statistics. This caused the bulk statistics to fail when you input a dataframe that uses a different column name as the feature ID.
Added some test parameters in test_utils_bulk_statistics to account for the above case. Also added index here since we have not previously covered this parameter with the tests. This parameter allows users to compute bulk statistics for specified feature regions only.
Corrected the warning message that accidently included a raise statement which resulted in an error message when the warning was triggered (TypeError: exceptions must derive from BaseException)
Added another warning that makes users aware when feature labels are non-unique even within the same timestep. This is because there is a common use case of working with storm IDs that stay the same over multiple timesteps (for one track). For this case, our code actually still works correctly because we assign the computed statistics for each timestep independently. However, there should not be any feature IDs occurring multiple times within the same timestep. This would lead to unexpected results when the bulk statistics are added to the output dataframe.
Finally, I added a line to make sure that feature IDs are integers (this is already controlled by how we output the feature dataframe, but does not hurt to double check here since users might modify the feature dataframe or use the output from a different tracking algorithm/dataset)

Have you followed our guidelines in CONTRIBUTING.md?
Have you self-reviewed your code and corrected any misspellings?
Have you written documentation that is easy to understand?
Have you written descriptive commit messages?
[NA] Have you added NumPy docstrings for newly added functions?
Have you formatted your code using black?
If you have introduced a new functionality, have you added adequate unit tests?
Have all tests passed in your local clone?
[NA] If you have introduced a new functionality, have you added an example notebook?
Have you kept your pull request small and limited so that it is easy to review?
Have the newest changes from this branch been merged?

…d yet

…eature labels are non-unique within the same timestep

github-actions · 2024-07-23T21:02:45Z

Linting results by Pylint:

Your code has been rated at 8.70/10 (previous run: 8.70/10, +0.00)
_{The linting score is an indicator that reflects how well your code version follows Pylint’s coding standards and quality metrics with respect to the RC_v1.5.x branch.

A decrease usually indicates your new code does not fully meet style guidelines or has potential errors.}

…_fix

codecov · 2024-07-24T15:04:00Z

Codecov Report

Attention: Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 60.91%. Comparing base (57612ec) to head (2ede64b).
Report is 115 commits behind head on RC_v1.5.x.

Files with missing lines	Patch %	Lines
tobac/utils/bulk_statistics.py	66.66%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           RC_v1.5.x     #437   +/-   ##
==========================================
  Coverage      60.91%   60.91%           
==========================================
  Files             23       23           
  Lines           3541     3544    +3     
==========================================
+ Hits            2157     2159    +2     
- Misses          1384     1385    +1

Flag	Coverage Δ
unittests	`60.91% <66.66%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

freemansw1

Thanks @JuliaKukulies ! I want to do a bit more testing (particularly with #354 being in the 1.6.0 branch now), but at this stage I'm happy for this to be merged.

freemansw1 · 2024-07-26T15:56:24Z

tobac/utils/bulk_statistics.py

            "Feature labels are not unique which may cause unexpected results for the computation of bulk statistics."
        )
+    # extra warning when feature labels are not unique in timestep
+    uniques = features.groupby("time")[id_column].value_counts().values


should we group by time or by frame? I actually don't know the correct answer here; perhaps this opens a philosophical can of worms that we don't want to do.

I think by "time" would make more sense for this warning because we actually use the time dimension to perform the bulk statistics for each unique feature on:

tobac/tobac/utils/bulk_statistics.py

Lines 288 to 298 in 698d53b

# get bulk statistics for each timestep

step_statistics = []

for tt in pd.to_datetime(segmentation_mask.time):

# select specific timestep

segmentation_mask_t = segmentation_mask.sel(time=tt).data

fields_t = (

field.sel(time=tt).values if "time" in field.coords else field.values

for field in fields

)

features_t = features.loc[features.time == tt].copy()

But you have a good point. The latter could done by "frame", too. So lets just briefly discuss this in the next dev meeting maybe!

Personally I prefer using "time" here, but can see also why "frame" would also be appropriate (e.g. in edge cases like Sean mentioned where dataframe times are saved to ms accuracy rather than ns causing mismatches. Maybe we can leave it as is and revisit if it causes any issues in future

JuliaKukulies · 2024-07-26T20:37:29Z

Thanks @JuliaKukulies ! I want to do a bit more testing (particularly with #354 being in the 1.6.0 branch now), but at this stage I'm happy for this to be merged.

Thanks for your quick review @freemansw1. And totally agree- we should add more tests for this in v1.6.0

w-k-jones

Great, thanks for catching this! I am happy for this to be merged as is, with the possibility of revisiting the "time" vs "frame" issue in future if needs be

w-k-jones · 2024-08-09T18:47:21Z

tobac/utils/bulk_statistics.py

            "Feature labels are not unique which may cause unexpected results for the computation of bulk statistics."
        )
+    # extra warning when feature labels are not unique in timestep
+    uniques = features.groupby("time")[id_column].value_counts().values


Personally I prefer using "time" here, but can see also why "frame" would also be appropriate (e.g. in edge cases like Sean mentioned where dataframe times are saved to ms accuracy rather than ns causing mismatches. Maybe we can leave it as is and revisit if it causes any issues in future

freemansw1 · 2024-08-15T15:09:59Z

@JuliaKukulies are you happy to merge?

JuliaKukulies added 4 commits July 23, 2024 08:38

fix warning statement when feature IDs are not unique

fd025d0

allow for different ID column name at locations where it was not fixe…

ed5407a

…d yet

added parameters for bulk statistic tests and an extra warning when f…

cf6f4c8

…eature labels are non-unique within the same timestep

black formatting

4ab4342

JuliaKukulies self-assigned this Jul 23, 2024

JuliaKukulies added this to the Version 1.5.4 milestone Jul 23, 2024

JuliaKukulies requested review from freemansw1, w-k-jones, fsenf, kelcyno and harrietgilmour July 23, 2024 21:00

JuliaKukulies added 4 commits July 23, 2024 15:04

black formatting with right versiobn

ab260b3

fixed message

af97821

and one more formatting

f902c65

Merge remote-tracking branch 'upstream/RC_v1.5.x' into bulk_stats_bug…

e5f64c4

…_fix

made warning message more readable

2ede64b

JuliaKukulies added the bug Code that is failing or producing the wrong result label Jul 26, 2024

freemansw1 approved these changes Jul 26, 2024

View reviewed changes

fsenf requested a review from fziegner July 31, 2024 12:56

fsenf removed request for fsenf, kelcyno, harrietgilmour and fziegner August 9, 2024 14:18

w-k-jones approved these changes Aug 9, 2024

View reviewed changes

JuliaKukulies merged commit dadad41 into tobac-project:RC_v1.5.x Aug 15, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk stats bug fix #437

Bulk stats bug fix #437

JuliaKukulies commented Jul 23, 2024 •

edited

Loading

github-actions bot commented Jul 23, 2024 •

edited

Loading

codecov bot commented Jul 24, 2024 •

edited

Loading

freemansw1 left a comment

freemansw1 Jul 26, 2024

JuliaKukulies Jul 26, 2024

JuliaKukulies Jul 26, 2024

w-k-jones Aug 9, 2024

JuliaKukulies commented Jul 26, 2024

w-k-jones left a comment

w-k-jones Aug 9, 2024

freemansw1 commented Aug 15, 2024

	# get bulk statistics for each timestep
	step_statistics = []

	for tt in pd.to_datetime(segmentation_mask.time):
	# select specific timestep
	segmentation_mask_t = segmentation_mask.sel(time=tt).data
	fields_t = (
	field.sel(time=tt).values if "time" in field.coords else field.values
	for field in fields
	)
	features_t = features.loc[features.time == tt].copy()

Bulk stats bug fix #437

Bulk stats bug fix #437

Conversation

JuliaKukulies commented Jul 23, 2024 • edited Loading

github-actions bot commented Jul 23, 2024 • edited Loading

Linting results by Pylint:

codecov bot commented Jul 24, 2024 • edited Loading

Codecov Report

freemansw1 left a comment

Choose a reason for hiding this comment

freemansw1 Jul 26, 2024

Choose a reason for hiding this comment

JuliaKukulies Jul 26, 2024

Choose a reason for hiding this comment

JuliaKukulies Jul 26, 2024

Choose a reason for hiding this comment

w-k-jones Aug 9, 2024

Choose a reason for hiding this comment

JuliaKukulies commented Jul 26, 2024

w-k-jones left a comment

Choose a reason for hiding this comment

w-k-jones Aug 9, 2024

Choose a reason for hiding this comment

freemansw1 commented Aug 15, 2024

JuliaKukulies commented Jul 23, 2024 •

edited

Loading

github-actions bot commented Jul 23, 2024 •

edited

Loading

codecov bot commented Jul 24, 2024 •

edited

Loading