Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_assimilate_histogram and _regenerate_histogram refactor into standalone functions #820

Open
ksneab7 opened this issue May 15, 2023 · 11 comments
Assignees
Labels
New Feature A feature addition not currently in the library

Comments

@ksneab7
Copy link
Contributor

ksneab7 commented May 15, 2023

Is your feature request related to a problem? Please describe.
_assimilate_histogram and _regenerate_histogram functions are not using self for anything of substance and as a result can be separated into their own standalone static functions.

Describe the outcome you'd like:
Move these two functions outside of the class to histogram_utils.py

Additional context:

@ksneab7 ksneab7 added the New Feature A feature addition not currently in the library label May 15, 2023
@junholee6a
Copy link
Contributor

I presume this would also involve writing unit tests for _assimilate_histogram and _regenerate_histogram in test_histogram_utils.py? Neither have any direct unit tests at the moment, though they might be called indirectly from other tests

@taylorfturner
Copy link
Contributor

Yes, definitely would want to ensure coverage with testing for sure @junholee6a

@junholee6a
Copy link
Contributor

Working on this. It'll take a bit to understand the functions and write test cases

@junholee6a
Copy link
Contributor

junholee6a commented Aug 9, 2023

In _assimilate_histogram(), it seems like each bin in the from_hist maps to at most two bins in the dest_bin. This means that when from_hist has much larger bins than dest_bin, its values aren't distributed evenly:

Test code:

def test_assimilate_histogram(self):
    ret_value = histogram_utils._assimilate_histogram(
        from_hist_entity_count_per_bin=np.array([50, 0]),
        from_hist_bin_edges=np.array([0, 5, 10]),
        dest_hist_entity_count_per_bin=np.zeros(10),
        dest_hist_bin_edges=np.array((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)),
        dest_hist_num_bin=10,
    )
    print('=============== _assimilate_histogram result ===============')
    print(ret_value)

Output (reformatted by hand):

=============== _assimilate_histogram result ===============
(
    {
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 40.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
    },
    80.0
)

Expected output: bin_counts should be distributed more evenly?

=============== _assimilate_histogram result ===============
(
    {
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 10.,  10.,  10.,  10.,  0.,  0.,  0.,  0.,  0.])
    },
    some_loss_value
)

Is this a bug?

@junholee6a
Copy link
Contributor

Potentially related issue: #838

@ksneab7
Copy link
Contributor Author

ksneab7 commented Aug 9, 2023

#838 is related. If we merge the capabilities and then make that merged function a standalone that would essentially solve both issues, but we separated these out as they accomplish different goals are are hefty tasks on their own for one developer.

@ksneab7
Copy link
Contributor Author

ksneab7 commented Aug 9, 2023

In _assimilate_histogram(), it seems like each bin in the from_hist maps to at most two bins in the dest_bin. This means that when from_hist has much larger bins than dest_bin, its values aren't distributed evenly:

Test code:

def test_assimilate_histogram(self):
    ret_value = histogram_utils._assimilate_histogram(
        from_hist_entity_count_per_bin=np.array([50, 0]),
        from_hist_bin_edges=np.array([0, 5, 10]),
        dest_hist_entity_count_per_bin=np.zeros(10),
        dest_hist_bin_edges=np.array((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)),
        dest_hist_num_bin=10,
    )
    print('=============== _assimilate_histogram result ===============')
    print(ret_value)

Output (reformatted by hand):

=============== _assimilate_histogram result ===============
(
    {
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 40.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
    },
    80.0
)

Expected output: bin_counts should be distributed more evenly?

=============== _assimilate_histogram result ===============
(
    {
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 10.,  10.,  10.,  10.,  0.,  0.,  0.,  0.,  0.])
    },
    some_loss_value
)

Is this a bug?

Looking into this now

@ksneab7
Copy link
Contributor Author

ksneab7 commented Aug 9, 2023

So there is a couple of things with this scenario that we cant necessarily make assumptions about.
While I agree that the actual result:

{
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 40.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
},

Is not super intuitive, I also do not like the evenly distributed assumption:

{
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 10.,  10.,  10.,  10.,  0.,  0.,  0.,  0.,  0.])
},

because of the fact that there is a pretty big assumption there as well.
We are in both scenarios technically misrepresenting the data as its making the bins smaller and as a result forcing us to make those assumptions.
This is an open question within the team and we are definitely open to suggestions.
My personal thought is whatever we choose there needs to be a warning about losing accuracy if the bin sizes go from larger to smaller (less bins -> more bins).

@taylorfturner @micdavis @JGSweets @junholee6a thoughts?

@junholee6a
Copy link
Contributor

because of the fact that there is a pretty big assumption there as well.

Definitely, this assumes that the values in from_hist are evenly distributed within each bin, but I don't think there's a more intuitive alternative without more information about the data represented by the histograms.

My personal thought is whatever we choose there needs to be a warning about losing accuracy if the bin sizes go from larger to smaller (less bins -> more bins).

Maybe this is covered by returning a higher hist_loss value? Accuracy can be lost when using _assimilate_histogram even if the bin sizes don't change, i.e. when the bins are only shifted sideways from from_hist to dest_hist.

@ksneab7
Copy link
Contributor Author

ksneab7 commented Aug 9, 2023

After thinking about it, I may be leaning more towards your suggested expected outcome, I think your right in that there is not a more intuitive approach without some other significant details being shared.
If @taylorfturner and @micdavis are in agreement I can throw a bug report up to implement that fix instead of its current state, which IMO is worse than assuming the even distribution.

@ksneab7
Copy link
Contributor Author

ksneab7 commented Aug 10, 2023

Thank you @junholee6a for your input. This call out is VERY much appreciated and I am hopeful this will be looked into as time provides!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
New Feature A feature addition not currently in the library
Projects
None yet
Development

No branches or pull requests

5 participants