Skip to content

Add bin_prop computed variable to stat_bin #6477

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kieran-mace
Copy link

@kieran-mace kieran-mace commented May 23, 2025

Summary

Adds after_stat(bin_prop) functionality to stat_bin, bringing feature parity with stat_count. The new bin_prop computed variable shows the proportion of each group within each bin.

Closes #6478

Motivation

stat_count provides after_stat(prop) for proportion-based visualizations, but stat_bin lacked equivalent functionality. This made it difficult to create proportion-based histograms for continuous data.

Implementation

  • Added compute_panel method to StatBin that calculates bin_prop = count_in_group / total_count_in_bin
  • Handles multiple groups, weights, and empty bins correctly
  • Maintains backwards compatibility (single groups have bin_prop = 1)
  • Updated documentation to include the new computed variable

Usage Example

# Show proportion of each group within weight bins
ggplot(data, aes(x = weight, y = after_stat(bin_prop), fill = sex)) +
  stat_bin(geom = "col", bins = 8, position = "dodge") +
  scale_y_continuous(labels = scales::percent)

This addresses the feature gap where users could use after_stat(prop) with stat_count for discrete data but had no equivalent for continuous data with stat_bin.

Test plan

  • All existing stat_bin tests pass (no regressions)
  • Added comprehensive tests for bin_prop functionality
  • Tested with single groups, multiple groups, and weighted data
  • Verified after_stat(bin_prop) works correctly in plots
  • Confirmed proportions sum to 1 within each bin

Some example output:

library(ggplot2)
# Create sample data with two groups
df <- data.frame(
  mass_kg = c(rnorm(1000, mean = 70, sd = 10), rnorm(500, mean = 85, sd = 12)),
  sex = c(rep("Female", 1000), rep("Male", 500))
)

ggplot(df, aes(x = mass_kg, fill = sex, color = sex)) +
  stat_bin(binwidth = 5, 
           mapping = aes(y = after_stat(bin_prop))) +
  labs(
    title = "Proportion in Each Mass bin",
    x = "Mass",
    y = "Proportion",
    color = "Sex"
  )

ggplot(df, aes(x = mass_kg, fill = sex, color = sex)) +
  stat_bin(binwidth = 5, 
           mapping = aes(y = after_stat(bin_prop)),
           geom = 'line',
           position = 'dodge') +
  labs(
    title = "Proportion in Each Mass bin",
    x = "Mass",
    y = "Proportion",
    color = "Sex"
  )

Created on 2025-05-22 with reprex v2.1.1

🤖 Generated with Claude Code

Brings feature parity with stat_count by adding `after_stat(bin_prop)`
functionality to stat_bin. The bin_prop variable shows the proportion
of each group within each bin, enabling proportion-based visualizations
for binned continuous data.

Key features:
- bin_prop = count_in_group / total_count_in_bin
- Works with multiple groups and respects weights
- Backwards compatible (bin_prop = 1 for single groups)
- Properly handles empty bins

Usage:
ggplot(data, aes(x = continuous_var, y = after_stat(bin_prop), fill = group)) +
  stat_bin(geom = "col", position = "dodge")

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Collaborator

@teunbrand teunbrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi there, thanks for the PR! There are a few concerns that I hope can be alleviated, see related comments.

Comment on lines +78 to +109
if (!is.null(data) && nrow(data) > 0 &&
all(c("count", "xmin", "xmax") %in% names(data))) {

# Calculate bin_prop: proportion of each group within each bin
# Create a unique bin identifier using rounded values to handle floating point precision
data$bin_id <- paste(round(data$xmin, 10), round(data$xmax, 10), sep = "_")

# Calculate total count per bin across all groups
bin_totals <- stats::aggregate(data$count, by = list(bin_id = data$bin_id), FUN = sum)
names(bin_totals)[2] <- "bin_total"

# Merge back to get bin totals for each row
data <- merge(data, bin_totals, by = "bin_id", sort = FALSE)

# Calculate bin_prop: count within group / total count in bin
# When bin_total = 0 (empty bin), set bin_prop based on whether there are multiple groups
n_groups <- length(unique(data$group))
if (n_groups == 1) {
# With only one group, bin_prop is always 1 (100% of the bin belongs to this group)
data$bin_prop <- 1
} else {
# With multiple groups, bin_prop = count / total_count_in_bin, or 0 for empty bins
data$bin_prop <- ifelse(data$bin_total > 0, data$count / data$bin_total, 0)
}

# Remove the temporary columns
data$bin_id <- NULL
data$bin_total <- NULL
} else {
# If we don't have the necessary data, just add a default bin_prop column
data$bin_prop <- if (nrow(data) > 0) rep(1, nrow(data)) else numeric(0)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all seems more complicated than it needs to be. Can't this be computed more directly?

Comment on lines +159 to +160
#' width = "widths of bins.",
#' bin_prop = "proportion of points in bin that belong to each group."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you regenerate the .Rd files as well?

Comment on lines +270 to +271
# Test with 5 bins to get predictable overlap
p <- ggplot(test_data, aes(x, fill = group)) + geom_histogram(bins = 5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaks can be set directly if predictability is an issue

Comment on lines +281 to +288
bins_with_both_groups <- aggregate(data$count > 0, by = list(paste(data$xmin, data$xmax)), sum)
overlapping_bins <- bins_with_both_groups[bins_with_both_groups$x == 2, ]$Group.1

for (bin in overlapping_bins) {
bin_data <- data[paste(data$xmin, data$xmax) == bin, ]
total_prop <- sum(bin_data$bin_prop)
expect_equal(total_prop, 1, tolerance = 1e-6)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it more simple to test that the sum over bins is 1, regardless of how many groups?

Comment on lines +327 to +328
bin1_data <- data[data$x == min(data$x), ]
bin2_data <- data[data$x == max(data$x), ]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bin1_data <- data[data$x == min(data$x), ]
bin2_data <- data[data$x == max(data$x), ]
bin1_data <- data[data$x == 1, ]
bin2_data <- data[data$x == 2, ]

We know from the test data what these values should be

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: Add bin_prop computed variable to stat_bin for proportion-based histograms
2 participants