Skip to content

Feature request: Add bin_prop computed variable to stat_bin for proportion-based histograms #6478

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kieran-mace opened this issue May 23, 2025 · 1 comment · May be fixed by #6477
Open

Comments

@kieran-mace
Copy link

kieran-mace commented May 23, 2025

Summary

stat_bin currently lacks the after_stat(prop) functionality that stat_count provides, making it difficult to create proportion-based visualizations for continuous data. This feature request proposes adding a bin_prop computed variable to stat_bin to achieve feature parity.

Problem Description

Currently, users can create proportion-based bar charts with discrete data using stat_count:

# This works with discrete data
ggplot(data, aes(x = discrete_var, y = after_stat(prop), fill = group)) +
  geom_bar(position = "dodge")

However, there's no equivalent for continuous data with stat_bin:

# This doesn't work - no prop variable available
ggplot(data, aes(x = continuous_var, y = after_stat(prop), fill = group)) +
  geom_histogram(position = "dodge", bins = 10)

Use Case Example

Consider analyzing weight distribution by sex. Users want to see the proportion of each sex within weight bins:

# Desired functionality (currently not possible)
ggplot(people_data, aes(x = weight, y = after_stat(bin_prop), fill = sex)) +
  stat_bin(geom = "col", bins = 8, position = "dodge") +
  scale_y_continuous(labels = scales::percent) +
  labs(y = "Proportion within bin")

This would show insights such as:

  • Lower weight bins: ~100% female
  • Middle weight bins: Mixed proportions
  • Higher weight bins: ~100% male

Something like this:

Proposed Solution

Add a bin_prop computed variable to stat_bin that calculates the proportion of each group within each bin:

  • bin_prop = count_in_group / total_count_in_bin
  • Handles multiple groups and respects weights
  • For single groups: bin_prop = 1 (backwards compatible)
  • For empty bins: bin_prop = 0

Benefits

  1. Feature parity with stat_count
  2. Enables proportion-based histograms for continuous data
  3. Useful for demographic analysis and group comparisons
  4. Backwards compatible - doesn't break existing code

Alternatives Considered

  1. Manual calculation: Users could manually calculate proportions, but this is cumbersome and error-prone
  2. Using stat_count with discretized data: Loses the benefits of proper binning algorithms
  3. Custom stat function: Would require users to write their own implementation

Expected API

# Documentation would include:
#' @eval rd_computed_vars(
#'   count    = "number of points in bin.",
#'   density  = "density of points in bin, scaled to integrate to 1.",
#'   ncount   = "count, scaled to a maximum of 1.",
#'   ndensity = "density, scaled to a maximum of 1.",
#'   width    = "widths of bins.",
#'   bin_prop = "proportion of points in bin that belong to each group."
#' )

This would enable the intuitive usage:

aes(y = after_stat(bin_prop))

Additional Context

This feature would be particularly valuable for:

  • Demographic analysis (age/income by group)
  • Scientific data (measurements by treatment group)
  • Market research (customer segments by behavior)
  • Any scenario where you want to show group composition within continuous ranges

The implementation should handle edge cases like empty bins, single groups, and weighted data appropriately.

@kieran-mace
Copy link
Author

PR: #6477

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant