`_assimilate_histogram` and `_regenerate_histogram` refactor into standalone functions #820

ksneab7 · 2023-05-15T17:50:38Z

Is your feature request related to a problem? Please describe.
_assimilate_histogram and _regenerate_histogram functions are not using self for anything of substance and as a result can be separated into their own standalone static functions.

Describe the outcome you'd like:
Move these two functions outside of the class to histogram_utils.py

Additional context:

The text was updated successfully, but these errors were encountered:

junholee6a · 2023-08-04T00:57:58Z

I presume this would also involve writing unit tests for _assimilate_histogram and _regenerate_histogram in test_histogram_utils.py? Neither have any direct unit tests at the moment, though they might be called indirectly from other tests

taylorfturner · 2023-08-04T01:02:06Z

Yes, definitely would want to ensure coverage with testing for sure @junholee6a

junholee6a · 2023-08-05T19:35:09Z

Working on this. It'll take a bit to understand the functions and write test cases

junholee6a · 2023-08-09T03:47:29Z

In _assimilate_histogram(), it seems like each bin in the from_hist maps to at most two bins in the dest_bin. This means that when from_hist has much larger bins than dest_bin, its values aren't distributed evenly:

Test code:

def test_assimilate_histogram(self):
    ret_value = histogram_utils._assimilate_histogram(
        from_hist_entity_count_per_bin=np.array([50, 0]),
        from_hist_bin_edges=np.array([0, 5, 10]),
        dest_hist_entity_count_per_bin=np.zeros(10),
        dest_hist_bin_edges=np.array((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)),
        dest_hist_num_bin=10,
    )
    print('=============== _assimilate_histogram result ===============')
    print(ret_value)

Output (reformatted by hand):

=============== _assimilate_histogram result ===============
(
    {
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 40.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
    },
    80.0
)

Expected output: bin_counts should be distributed more evenly?

=============== _assimilate_histogram result ===============
(
    {
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 10.,  10.,  10.,  10.,  0.,  0.,  0.,  0.,  0.])
    },
    some_loss_value
)

Is this a bug?

junholee6a · 2023-08-09T15:49:13Z

Potentially related issue: #838

ksneab7 · 2023-08-09T15:54:51Z

#838 is related. If we merge the capabilities and then make that merged function a standalone that would essentially solve both issues, but we separated these out as they accomplish different goals are are hefty tasks on their own for one developer.

ksneab7 · 2023-08-09T16:02:14Z

In _assimilate_histogram(), it seems like each bin in the from_hist maps to at most two bins in the dest_bin. This means that when from_hist has much larger bins than dest_bin, its values aren't distributed evenly:

Test code:

def test_assimilate_histogram(self):
    ret_value = histogram_utils._assimilate_histogram(
        from_hist_entity_count_per_bin=np.array([50, 0]),
        from_hist_bin_edges=np.array([0, 5, 10]),
        dest_hist_entity_count_per_bin=np.zeros(10),
        dest_hist_bin_edges=np.array((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)),
        dest_hist_num_bin=10,
    )
    print('=============== _assimilate_histogram result ===============')
    print(ret_value)

Output (reformatted by hand):

=============== _assimilate_histogram result ===============
(
    {
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 40.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
    },
    80.0
)

Expected output: bin_counts should be distributed more evenly?

=============== _assimilate_histogram result ===============
(
    {
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 10.,  10.,  10.,  10.,  0.,  0.,  0.,  0.,  0.])
    },
    some_loss_value
)

Is this a bug?

Looking into this now

ksneab7 · 2023-08-09T16:52:47Z

So there is a couple of things with this scenario that we cant necessarily make assumptions about.
While I agree that the actual result:

{
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 40.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
},

Is not super intuitive, I also do not like the evenly distributed assumption:

{
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 10.,  10.,  10.,  10.,  0.,  0.,  0.,  0.,  0.])
},

because of the fact that there is a pretty big assumption there as well.
We are in both scenarios technically misrepresenting the data as its making the bins smaller and as a result forcing us to make those assumptions.
This is an open question within the team and we are definitely open to suggestions.
My personal thought is whatever we choose there needs to be a warning about losing accuracy if the bin sizes go from larger to smaller (less bins -> more bins).

@taylorfturner @micdavis @JGSweets @junholee6a thoughts?

junholee6a · 2023-08-09T17:47:37Z

because of the fact that there is a pretty big assumption there as well.

Definitely, this assumes that the values in from_hist are evenly distributed within each bin, but I don't think there's a more intuitive alternative without more information about the data represented by the histograms.

My personal thought is whatever we choose there needs to be a warning about losing accuracy if the bin sizes go from larger to smaller (less bins -> more bins).

Maybe this is covered by returning a higher hist_loss value? Accuracy can be lost when using _assimilate_histogram even if the bin sizes don't change, i.e. when the bins are only shifted sideways from from_hist to dest_hist.

ksneab7 · 2023-08-09T20:07:07Z

After thinking about it, I may be leaning more towards your suggested expected outcome, I think your right in that there is not a more intuitive approach without some other significant details being shared.
If @taylorfturner and @micdavis are in agreement I can throw a bug report up to implement that fix instead of its current state, which IMO is worse than assuming the even distribution.

ksneab7 · 2023-08-10T15:15:09Z

Thank you @junholee6a for your input. This call out is VERY much appreciated and I am hopeful this will be looked into as time provides!!!

ksneab7 added the New Feature A feature addition not currently in the library label May 15, 2023

ksneab7 assigned JGSweets, taylorfturner, micdavis and ksneab7 May 15, 2023

ksneab7 mentioned this issue Aug 10, 2023

_assimilate_histogram function not handling reallocation of bucket counts ideally #1017

Open

junholee6a mentioned this issue Aug 22, 2023

Refactor _assimilate_histogram and _regenerate_histogram #1022

Closed

This was referenced Nov 19, 2023

Make _assimilate_histogram() not use self #1071

Merged

Make _assimilate_histogram() not use self (alternative) #1073

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`_assimilate_histogram` and `_regenerate_histogram` refactor into standalone functions #820

`_assimilate_histogram` and `_regenerate_histogram` refactor into standalone functions #820

ksneab7 commented May 15, 2023

junholee6a commented Aug 4, 2023

taylorfturner commented Aug 4, 2023

junholee6a commented Aug 5, 2023

junholee6a commented Aug 9, 2023 •

edited

Loading

junholee6a commented Aug 9, 2023

ksneab7 commented Aug 9, 2023

ksneab7 commented Aug 9, 2023

ksneab7 commented Aug 9, 2023 •

edited

Loading

junholee6a commented Aug 9, 2023

ksneab7 commented Aug 9, 2023

ksneab7 commented Aug 10, 2023 •

edited

Loading

_assimilate_histogram and _regenerate_histogram refactor into standalone functions #820

_assimilate_histogram and _regenerate_histogram refactor into standalone functions #820

Comments

ksneab7 commented May 15, 2023

junholee6a commented Aug 4, 2023

taylorfturner commented Aug 4, 2023

junholee6a commented Aug 5, 2023

junholee6a commented Aug 9, 2023 • edited Loading

junholee6a commented Aug 9, 2023

ksneab7 commented Aug 9, 2023

ksneab7 commented Aug 9, 2023

ksneab7 commented Aug 9, 2023 • edited Loading

junholee6a commented Aug 9, 2023

ksneab7 commented Aug 9, 2023

ksneab7 commented Aug 10, 2023 • edited Loading

`_assimilate_histogram` and `_regenerate_histogram` refactor into standalone functions #820

`_assimilate_histogram` and `_regenerate_histogram` refactor into standalone functions #820

junholee6a commented Aug 9, 2023 •

edited

Loading

ksneab7 commented Aug 9, 2023 •

edited

Loading

ksneab7 commented Aug 10, 2023 •

edited

Loading