-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query] add less biased unsmoothed pdf to ggplot #13608
[query] add less biased unsmoothed pdf to ggplot #13608
Conversation
… and make the default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making such an effort to make this more readable. I have a couple of small suggestions then we can merge this.
p = update_grid_size(p) | ||
return p | ||
|
||
def compute_single_error(s, failure_prob=failure_prob): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you intend to pull these inner functions out or can we just get failure_prob
from the enclosing scope (ie remove the parameter)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter is there because it gets called with different values. This function computes the error bound on an estimated rank for a single value, with a given probability of exceeding the bound. We sometimes want an error bound that applies to all possible values simultaneously, with a given probability of any rank estimate exceeding the bound. Computing that involves computing an error bound for a single value with an appropriately smaller failure probability.
hail/python/hail/ggplot/geoms.py
Outdated
xi, yi = point_on_bound(i, upper) | ||
return (yi - fy) / (xi - fx) | ||
|
||
def update_min_max_slopes(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be clearer to return a tuple and unpack that at the call site rather than use side effects:
min_slope, max_slope = min_max_slopes()
max_slope = slope_from_fixed(ui, upper=True) | ||
|
||
def fix_point_on_result(i, upper): | ||
nonlocal fx, fy, new_y, keep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. I guess I'm not a fan of this pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally I'm not either, but here it felt like the best way to abstract out repeated steps of the algorithm. Maybe it would be clearer if this were a class, and the mutable state fields on the class?
I found this more readable, chunking up common updates to the state of the algorithm rather than repeating more low level changes, but readability is subjective and I'm happy to inline these if you think that's clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, there's not need to change this one. You raised a good point that if this were a class and these variables were attributes of the class, then perhaps I wouldn't object - which is certainly true. I don't think it's worth re-writing this as a class. I think this is fine as you're indexing into and mutating state variables, just please define the variables before you use them in this function so it's clear what you're referencing.
… and make the default
Examples
Histogram without setting min/max. Requires two passes over data, has low resolution over interesting part of the distribution:
Histogram with manual min/max. Most accurate, but different choices of number of bins cause different artifacts. Requires knowledge of distribution beforehand.
Smoothed
approx_cdf
based pdf (old default). Smoothing causes large distortion.New unsmoothed `approx_cdf_ based pdf. Single pass, works well for any distribution, needs no tuning or foreknowledge of distribution.