Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plot_cap default args. not working for categorical regression #673

Closed

Conversation

GStechschulte
Copy link
Collaborator

@GStechschulte GStechschulte commented May 20, 2023

This draft PR addresses issue #669 that adds additional logic in _plot_cap_numeric() of the plot_cap function to support indexing of N-dimensional y_hat_bounds to ensure ax.fill_between() works for $K$ outcome classes.

Before plotting begins in _plot_cap_numeric(), an additional variable y_hat_bounds_dim = y_hat_bounds.ndim is created to identify the number of dimensions. Then, when plotting is performed, an if / else statement is used to determine if ndims > 2. If yes, then we loop through each dimension and plot ax.fill_between(), else no for loop is needed.

The added logic allows ax.fill_between() to scale to $K$ classes, but it requires copy and paste of the if / else statement for each color panel combination inside of _plot_cap_numeric. Which in my opinion is kind of ugly and doesn't adhere to DRY. But it works.

Below are a few examples. I also identified that the class names are not specified in the legend due to the fact that the legend currently only looks in the covariates for the unique names.

To Do:

  • run tests
  • run black
  • fix legend to display outcome variable class names
import arviz as az
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import bambi as bmb
from bambi.plots import plot_cap

length = [
    1.3, 1.32, 1.32, 1.4, 1.42, 1.42, 1.47, 1.47, 1.5, 1.52, 1.63, 1.65, 1.65, 1.65, 1.65,
    1.68, 1.7, 1.73, 1.78, 1.78, 1.8, 1.85, 1.93, 1.93, 1.98, 2.03, 2.03, 2.31, 2.36, 2.46,
    3.25, 3.28, 3.33, 3.56, 3.58, 3.66, 3.68, 3.71, 3.89, 1.24, 1.3, 1.45, 1.45, 1.55, 1.6,
    1.6, 1.65, 1.78, 1.78, 1.8, 1.88, 2.16, 2.26, 2.31, 2.36, 2.39, 2.41, 2.44, 2.56, 2.67,
    2.72, 2.79, 2.84
]
choice = [
    "I", "F", "F", "F", "I", "F", "I", "F", "I", "I", "I", "O", "O", "I", "F", "F",
    "I", "O", "F", "O", "F", "F", "I", "F", "I", "F", "F", "F", "F", "F", "O", "O",
    "F", "F", "F", "F", "O", "F", "F", "I", "I", "I", "O", "I", "I", "I", "F", "I",
    "O", "I", "I", "F", "F", "F", "F", "F", "F", "F", "O", "F", "I", "F", "F"
]

sex = ["Male"] * 32 + ["Female"] * 31
data = pd.DataFrame({"choice": choice, "length": length, "sex": sex})
data["choice"]  = pd.Categorical(
    data["choice"].map({"I": "Invertebrates", "F": "Fish", "O": "Other"}),
    ["Other", "Invertebrates", "Fish"],
    ordered=True
)
fig, ax = plot_cap(
    model=model,
    idata=idata,
    covariates="length",
    pps=False,
    legend=True, # not working with response classes
)
fig.set_size_inches(7, 3)

image

See there is no legend for class names. Making it hard to distinguish the lines.

fig, ax = plot_cap(
    model=model,
    idata=idata,
    covariates={"horizontal": "length", "color": "sex", "panel": "sex"},
    fig_kwargs={"figsize": (16, 5), "sharey": True},
    pps=False
);

image

Again, there is no legend for class names. The legend correctly identifies the sex, but it would be more informative if the title was kept to the sex type and the legend indicated the class.

@GStechschulte
Copy link
Collaborator Author

The legend is now able to show the class / outcome name when the data type of the response variable is categorical. If the response variable is categorical, then the default legend=True bool is overwritten to a dict containing {response_name: <class names>}. Inside the plotting function, if the legend != bool, the dict key, values are used for the legend title and values.

Using the same code as above:

image

However, now we have the problem when specifying color and panel to sex, the color is held to C{i} which is itself determined based on the unique values of color. To solve this, we would also need to override the color each time in the loop or set a default colour mapping based on the unique class names (which is still problematic since the color is set and plotted sequentially in the for loop). Any thoughts on this?

image

@GStechschulte GStechschulte marked this pull request as ready for review May 22, 2023 14:05
@@ -262,12 +266,17 @@ def plot_cap(
def _plot_cap_numeric(covariates, cap_data, y_hat_mean, y_hat_bounds, transforms, legend, axes):
main = covariates.get("horizontal")
transform_main = transforms.get(main, identity)
y_hat_bounds_dim = y_hat_bounds.ndim
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we call it y_ndim or y_hat_bounds_ndim?

@tomicapretto
Copy link
Collaborator

@GStechschulte thanks for the great PR, as always.

To your last point "To solve this, we would also need to override the color each time in the loop or set a default colour mapping based on the unique class names (which is still problematic since the color is set and plotted sequentially in the for loop). Any thoughts on this?" I would say one approach would be to allow users to map the dimension of the response to the color or panel layer of the plot. By default, the color is mapped to the level of the response, when there are multiple response levels. But it can be overridden as you did in your example, and then it's up to the user to override things in a meaningful way. For example, I think it would make sense to ask for the response level to be mapped to the panel, in the case you want one panel per level. The question then is how this is specified in the covariates argument. I think one possible choice can be "{response_name}_dim"

So in your example one could be able to do

fig, ax = plot_cap(
    model=model,
    idata=idata,
    covariates={"horizontal": "length", "color": "sex", "panel": "choice_dim"},
    fig_kwargs={"figsize": (16, 5), "sharey": True},
    pps=False
);

or

fig, ax = plot_cap(
    model=model,
    idata=idata,
    covariates={"horizontal": "length", "color": "choice_dim", "panel": "sex"},
    fig_kwargs={"figsize": (16, 5), "sharey": True},
    pps=False
);

And by default the behavior would be as if you pass {"color": "choice_dim"}.

Do you think this makes sense? I'm not married to it of course.

@GStechschulte
Copy link
Collaborator Author

@GStechschulte thanks for the great PR, as always.

To your last point "To solve this, we would also need to override the color each time in the loop or set a default colour mapping based on the unique class names (which is still problematic since the color is set and plotted sequentially in the for loop). Any thoughts on this?" I would say one approach would be to allow users to map the dimension of the response to the color or panel layer of the plot. By default, the color is mapped to the level of the response, when there are multiple response levels. But it can be overridden as you did in your example, and then it's up to the user to override things in a meaningful way. For example, I think it would make sense to ask for the response level to be mapped to the panel, in the case you want one panel per level. The question then is how this is specified in the covariates argument. I think one possible choice can be "{response_name}_dim"

So in your example one could be able to do

fig, ax = plot_cap(
    model=model,
    idata=idata,
    covariates={"horizontal": "length", "color": "sex", "panel": "choice_dim"},
    fig_kwargs={"figsize": (16, 5), "sharey": True},
    pps=False
);

or

fig, ax = plot_cap(
    model=model,
    idata=idata,
    covariates={"horizontal": "length", "color": "choice_dim", "panel": "sex"},
    fig_kwargs={"figsize": (16, 5), "sharey": True},
    pps=False
);

And by default the behavior would be as if you pass {"color": "choice_dim"}.

Do you think this makes sense? I'm not married to it of course.

I will circle back to this when the core functionality of comparisons, predictions, and slopes is completed. If anyone else wants to take this PR on, feel free.

@tomicapretto
Copy link
Collaborator

I will circle back to this when the core functionality of comparisons, predictions, and slopes is completed. If anyone else wants to take this PR on, feel free.

I agree with the approach. Let's do one at at time. Thanks for being explicit by the way :)

@tomicapretto
Copy link
Collaborator

@GStechschulte after all the magic in #684, is this still needed?

@GStechschulte
Copy link
Collaborator Author

GStechschulte commented Jul 17, 2023

@tomicapretto I believe so. I did not change any of the code in plot_types.py in #684 which is where this bug now lives (previously it was directly in the plot_cap.py).

@GStechschulte
Copy link
Collaborator Author

Closing as this PR will no longer resolve the issue. See updated issue #723

@GStechschulte GStechschulte deleted the plot-cap-categorical branch January 21, 2024 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants