Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 229 new ecnt #231

Merged
merged 7 commits into from
Nov 9, 2022
Merged

Issue 229 new ecnt #231

merged 7 commits into from
Nov 9, 2022

Conversation

TatianaBurek
Copy link
Collaborator

Pull Request Testing

  • Describe testing already performed for these changes:

  • compared results on the added methods with the statistics calculated by MET

  • Recommend testing for the reviewer(s) to perform, including the location of input datasets, and any additional instructions:

  • Do these changes include sufficient documentation updates, ensuring that no errors or warnings exist in the build of the documentation? [Yes or No]

  • Do these changes include sufficient testing updates? [Yes or No]

  • Will this PR result in changes to the test suite? [Yes or No]

    If yes, describe the new output and/or changes to the existing output:

  • Please complete this pull request review by [Fill in date].

Pull Request Checklist

See the METplus Workflow for details.

  • Review the source issue metadata (required labels, projects, and milestone).
  • Complete the PR definition above.
  • Ensure the PR title matches the feature or bugfix branch name.
  • Define the PR metadata, as permissions allow.
    Select: Reviewer(s)
    Select: Organization level software support Project or Repository level development cycle Project
    Select: Milestone as the version that will include these changes
  • After submitting the PR, select Development with the original issue number.
  • After the PR is approved, merge your changes. If permissions do not allow this, request that the reviewer do the merge.
  • Close the linked issue and delete your feature or bugfix branch from GitHub.

@TatianaBurek TatianaBurek added this to the METcalcpy-2.0 milestone Nov 7, 2022
warnings.filterwarnings('error')
try:
n_ge_obs = sum_column_data_by_name(input_data, columns_names, 'n_ge_obs')
me_ge_obs = sum_column_data_by_name(input_data, columns_names, 'me_ge_obs')/n_ge_obs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this logic is correct. We need to compute an aggregated ME_GE_OBS as a weighted average where N_GE_OBS defines the weight. That's different from what's being compute here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the common case when total is used as a weight we get the sum of total values, get sum of the column values and divide column sum on total sum (weighted_average method)
Here I did the same but replaced total with n_ge_obs. Why this is incorrect? How should it be computed?

warnings.filterwarnings('error')
try:
n_lt_obs = sum_column_data_by_name(input_data, columns_names, 'n_lt_obs')
me_lt_obs = sum_column_data_by_name(input_data, columns_names, 'me_lt_obs')/n_lt_obs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this logic is correct. We need to compute an aggregated ME_LT_OBS as a weighted average where N_LT_OBS defines the weight. That's different from what's being compute here.

Copy link
Contributor

@JohnHalleyGotway JohnHalleyGotway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TatianaBurek thanks for working on these updates. I made a handful of comments that require attention. I'll mark this as "Request Changes". Please just re-request my review once you're finished with the next round of updates.

"""
warnings.filterwarnings('error')
try:
total = get_total_values(input_data, columns_names, aggregation)
crps_emp_fair = sum_column_data_by_name(input_data, columns_names, 'crps_emp_fair') / total
result = round_half_up(crps_emp_fair, PRECISION)
statistic = sum_column_data_by_name(input_data, columns_names, column_name) / total
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TatianaBurek, yes, good point I suspect there's an issue here as well. It's possible I just don't understand how this code is working. But it looks to me like you're summing up the STATISTIC values and TOTAL counts and dividing the first by the second.

Using R to provide an example of aggregating MAE values, where the weight is defined by the total column:

R
MAE = c(5, 10, 8)
TOTAL = c(1000, 1500, 1250)
c("Bad aggregated value using this logic = ", sum(MAE) / sum(TOTAL))
# Prints incorrect value of 0.00613
c("Correct weighted aggregation = ", sum(MAE*TOTAL)/sum(TOTAL))
# Prints correct value of 8

Those values are so different, so I'm assuming this IS working, but I'm just not grasping how.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the preprocessing of the data - calculation of the additional statistics , multiplying by the weight, renaming columns - is happening in agg_stat.py script. For example, here I prepare ecnt data:

def _prepare_ecnt_data(self, data_for_prepare):

calculate_ methods use the data that was already multiplied by the weight

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decision on separating preparing and calculating data was made based on the need to improve the speed of bootstraping process. It makes more sense to precalculate base values once and use these values n-replication times during bootstraping

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, as long as you're confident that the weighted averages are being computed properly, I'll go ahead and approve. Those details just aren't immediately obvious when reviewing the code.

Copy link
Contributor

@JohnHalleyGotway JohnHalleyGotway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve of these changes after Tatiana double-checked and confirmed that the weighted averages are being computed properly.

@TatianaBurek
Copy link
Collaborator Author

I checked that me_ge_obs and me_lt_obs are calculated the same way as other weighted average stats but instead of total as the weight they use n_ge_obs and n_lt_obs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
2 participants