Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding New Rootogram Plot #81

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

imperorrp
Copy link
Collaborator

@imperorrp imperorrp commented Aug 12, 2024

Pushed draft of the new Rootogram implementation code (#52 ) .

  • The existing hist functions from the currently still-open plot_dist hist addition PR [WIP] Histogram support addition to distplot.py #47 were used here as well as some code from the plot_ppc PR [WIP] Adding PPC plot to Arviz-Plots  #55.

  • The hist visual element backend interface docstring was modified with 'y' now referring to the 'top' y-coordinate explicitly and not height of the bars (as suggested by @OriolAbril). The matplotlib implementation of the hist visual element (which uses bar behind the scenes) was also updated internally with y=y-bottom to reflect this. (This has to be updated in [WIP] Histogram support addition to distplot.py #47 now as well). The bars were not being plotted in the expected positions otherwise.

  • WIP work includes appropriate binning for each facetted subset of the data to be plotted.

  • The observed data is used for binning for each subset and the bin heights here are required for setting the 'top's of the predictive bars so this is computed always, regardless of whether observed is passed as True or False (The True/False condition only determines whether it is plotted or not)

Current plot output when the rugby default Arviz datatree is passed (should work with any datatree with posterior predictive and observed groups like plot_ppc though):

azp.plot_rootogram(data)
image

azp.plot_rootogram(data, plot_kwargs={"predictive": False})
image

azp.plot_rootogram(data, observed=False)
image

pc = azp.plot_rootogram(data, backend="bokeh")
pc.show()

image

azp.plot_rootogram(data, observed_rug=True)
image

^If we want a rugplot for rootograms as well like plot_ppc.


📚 Documentation preview 📚: https://arviz-plots--81.org.readthedocs.build/en/81/

# use get_bins func from arviz-stats on observed data and then use those bins for
# computing histograms for predictive data as well
# WIP: currently only the bins for one variable (without any facetting) is retrieved and used
bins = array_stats.get_bins(obs_distribution["home_points"].values)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder for later: Use xarray aware get_bins function once available from Arviz-Stats

@imperorrp
Copy link
Collaborator Author

Todo: Shift rug lower, below the bottom of the bars

@imperorrp
Copy link
Collaborator Author

Rugplots now render below the bottom of the rootogram bars.

To allow for some tolerance gap for visibility, I've currently set this algorithm:

min_histogram_bottom = min(histogram_bottom)
min_bottom[var_name] = min_histogram_bottom - (0.2 * (0 - min_histogram_bottom))

(It may be that picking a specific number to subtract from min_histogram is better as that'd keep this gap a constant number of units, but then the chart size may vary and so would the perceivable gap)

Output looks like this now when the rugplot is plotted:

image

@imperorrp
Copy link
Collaborator Author

Binning works as expected now using the 'get_bins' arviz-stats branch:

image

src/arviz_plots/backend/__init__.py Outdated Show resolved Hide resolved
Comment on lines 191 to 192
pc_kwargs["aes"] = pc_kwargs.get("aes", {}).copy()
pc_kwargs["aes"].setdefault("overlay", sample_dims) # setting overlay dim
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed, we only want facetting in rootgram

Comment on lines 22 to 23
observed=None,
observed_rug=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove these two. Not including the observed data doesn't really make sense for the rootgram. We could skip the y shift in such cases but then the resulting plot would be a histogram directly, if users want a "regular" histogram they should use plot_dist.

Moreover, as you might have noticed, the rug plot doesn't really give much information because the data is discrete, so there are many lines overlapping at 1, many lines overlapping at 2... so the rugplot doesn't really show how many observations do we have at 1 or at 2, it seems each has only a single observation. And playing with alpha can only help so much, at most you'd get a qualitative information about the number of variables at each value which is less info than the observed line/scatter provide. So I would remove the argument and the rugplot.

Comment on lines 146 to 148
for group_name in (predictive_data_group, "observed_data"):
if group_name not in dt.children:
raise TypeError(f'`data` argument must have the group "{group_name}" for ppcplot')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After removing observed just keep this outside the if.

Comment on lines 251 to 269
# new_obs_hist with histogram->y and left_edge/right_edge midpoint->x
new_obs_hist = xr.Dataset()

for var_name in list(obs_hist.keys()):
left_edges = obs_hist[var_name].sel(plot_axis="left_edges").values
right_edges = obs_hist[var_name].sel(plot_axis="right_edges").values

left_edges = np.array(left_edges)
right_edges = np.array(right_edges)

x = (left_edges + right_edges) / 2
y = obs_hist[var_name].sel(plot_axis="histogram").values

stacked_data = np.stack((x, y), axis=-1)
new_var = xr.DataArray(
stacked_data, dims=["hist_dim", "plot_axis"], coords={"plot_axis": ["x", "y"]}
)

new_obs_hist[var_name] = new_var
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# new_obs_hist with histogram->y and left_edge/right_edge midpoint->x
new_obs_hist = xr.Dataset()
for var_name in list(obs_hist.keys()):
left_edges = obs_hist[var_name].sel(plot_axis="left_edges").values
right_edges = obs_hist[var_name].sel(plot_axis="right_edges").values
left_edges = np.array(left_edges)
right_edges = np.array(right_edges)
x = (left_edges + right_edges) / 2
y = obs_hist[var_name].sel(plot_axis="histogram").values
stacked_data = np.stack((x, y), axis=-1)
new_var = xr.DataArray(
stacked_data, dims=["hist_dim", "plot_axis"], coords={"plot_axis": ["x", "y"]}
)
new_obs_hist[var_name] = new_var
new_obs_hist = xr.concat(
(ds.sel(plot_axis=["right_edges", "left_edges"]).sum("plot_axis") / 2, obs_hist.sel(plot_axis="histogram", drop=True)),
dim="plot_axis",
).assign_coords(plot_axis=["x", "y"])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that the code in the PR would only have worked when all variables get the same number of bins (which doesn't always happen) because the dimension name is hardcoded to hist_dim in all variables of the dataset. As I mentioned in slack, the behaviour has changed in the get_bins PR to support this effect. Now each variable gets independent dimensions hist_dim_mu, hist_dim_tau...

# new_pp_hist dataset
new_pp_hist = xr.Dataset()

for var_name in list(pp_hist.keys()):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also update and get rid of the loop

…ooping logic and removed 'none' backend hist element duplication
@imperorrp
Copy link
Collaborator Author

Tests are failing due to plot_dist modification requirement since plot_rootogram currently depends on the unmerged arviz-stats 'get_bins' branch. This modification will be required globally once this Arviz-Stats branch is merged too

@imperorrp
Copy link
Collaborator Author

Added tests for plot_rootogram

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 93.60000% with 8 lines in your changes missing coverage. Please review.

Project coverage is 85.39%. Comparing base (f4a39af) to head (a67bb2d).

Files with missing lines Patch % Lines
src/arviz_plots/plots/rootogramplot.py 92.72% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #81      +/-   ##
==========================================
+ Coverage   84.80%   85.39%   +0.59%     
==========================================
  Files          21       22       +1     
  Lines        2336     2451     +115     
==========================================
+ Hits         1981     2093     +112     
- Misses        355      358       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@aloctavodia
Copy link
Contributor

aloctavodia commented Sep 13, 2024

An example of a "hanging" rootogram and a "suspended" rootogram, including the uncertainty for the predictions
arviz-devs/Exploratory-Analysis-of-Bayesian-Models#60. We do not necessarily need to follow those examples, they are just suggestions. Actually, once we have a working rootograms that example will use whatever arviz implements.

@imperorrp
Copy link
Collaborator Author

imperorrp commented Sep 16, 2024

An example of a "hanging" rootogram and a "suspended" rootogram, including the uncertainty for the predictions arviz-devs/Exploratory-Analysis-of-Bayesian-Models#60. We do not necessarily need to follow those examples, they are just suggestions. Actually, once we have a working rootograms that example will use whatever arviz implements.

If we follow those examples, the line_y visual element function from #85 can be used for the thin lines to represent the hanging and suspended lines. For the hanging case, we can use the existing logic and for suspended we could modify the logic in this portion of the code:

histogram_bottom = new_obs_hist.sel(plot_axis="y") - pp_hist.sel(plot_axis="histogram")
histogram_bottom = histogram_bottom.expand_dims(plot_axis=["histogram_bottom"])
# print(f" diff = {a}\n")

new_pp_hist = xr.concat(
    (
        new_obs_hist.sel(plot_axis="y"),  # getting tops of histogram (observed values)
        pp_hist.sel(plot_axis="left_edges"),
        pp_hist.sel(plot_axis="right_edges"),
        histogram_bottom,
    ),
    dim="plot_axis",
).assign_coords(plot_axis=["histogram", "left_edges", "right_edges", "histogram_bottom"])

And instead pass a histogram_top along the histogram dimension (root of predictive count minus observed count) and 0 for the histogram_bottom dimension:

histogram_top = pp_hist.sel(plot_axis="histogram") - new_obs_hist.sel(plot_axis="y")
histogram_top = histogram_top.expand_dims(plot_axis=["histogram"])

For the uncertainty representation of 94% HDI, what would the process be?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants