Add bin_prop computed variable to stat_bin #6477

kieran-mace · 2025-05-23T03:32:47Z

Summary

Adds after_stat(bin_prop) functionality to stat_bin, bringing feature parity with stat_count. The new bin_prop computed variable shows the proportion of each group within each bin.

Closes #6478

Motivation

stat_count provides after_stat(prop) for proportion-based visualizations, but stat_bin lacked equivalent functionality. This made it difficult to create proportion-based histograms for continuous data.

Implementation

Added compute_panel method to StatBin that calculates bin_prop = count_in_group / total_count_in_bin
Handles multiple groups, weights, and empty bins correctly
Maintains backwards compatibility (single groups have bin_prop = 1)
Updated documentation to include the new computed variable

Usage Example

# Show proportion of each group within weight bins
ggplot(data, aes(x = weight, y = after_stat(bin_prop), fill = sex)) +
  stat_bin(geom = "col", bins = 8, position = "dodge") +
  scale_y_continuous(labels = scales::percent)

This addresses the feature gap where users could use after_stat(prop) with stat_count for discrete data but had no equivalent for continuous data with stat_bin.

Test plan

All existing stat_bin tests pass (no regressions)
Added comprehensive tests for bin_prop functionality
Tested with single groups, multiple groups, and weighted data
Verified after_stat(bin_prop) works correctly in plots
Confirmed proportions sum to 1 within each bin

Some example output:

library(ggplot2)
# Create sample data with two groups
df <- data.frame(
  mass_kg = c(rnorm(1000, mean = 70, sd = 10), rnorm(500, mean = 85, sd = 12)),
  sex = c(rep("Female", 1000), rep("Male", 500))
)

ggplot(df, aes(x = mass_kg, fill = sex, color = sex)) +
  stat_bin(binwidth = 5, 
           mapping = aes(y = after_stat(bin_prop))) +
  labs(
    title = "Proportion in Each Mass bin",
    x = "Mass",
    y = "Proportion",
    color = "Sex"
  )

ggplot(df, aes(x = mass_kg, fill = sex, color = sex)) +
  stat_bin(binwidth = 5, 
           mapping = aes(y = after_stat(bin_prop)),
           geom = 'line',
           position = 'dodge') +
  labs(
    title = "Proportion in Each Mass bin",
    x = "Mass",
    y = "Proportion",
    color = "Sex"
  )

^{Created on 2025-05-22 with reprex v2.1.1}

🤖 Generated with Claude Code

Brings feature parity with stat_count by adding `after_stat(bin_prop)` functionality to stat_bin. The bin_prop variable shows the proportion of each group within each bin, enabling proportion-based visualizations for binned continuous data. Key features: - bin_prop = count_in_group / total_count_in_bin - Works with multiple groups and respects weights - Backwards compatible (bin_prop = 1 for single groups) - Properly handles empty bins Usage: ggplot(data, aes(x = continuous_var, y = after_stat(bin_prop), fill = group)) + stat_bin(geom = "col", position = "dodge") 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

teunbrand

Hi there, thanks for the PR! There are a few concerns that I hope can be alleviated, see related comments.

teunbrand · 2025-05-23T09:46:56Z

R/stat-bin.R

+    if (!is.null(data) && nrow(data) > 0 && 
+        all(c("count", "xmin", "xmax") %in% names(data))) {
+
+      # Calculate bin_prop: proportion of each group within each bin
+      # Create a unique bin identifier using rounded values to handle floating point precision
+      data$bin_id <- paste(round(data$xmin, 10), round(data$xmax, 10), sep = "_")
+
+      # Calculate total count per bin across all groups
+      bin_totals <- stats::aggregate(data$count, by = list(bin_id = data$bin_id), FUN = sum)
+      names(bin_totals)[2] <- "bin_total"
+
+      # Merge back to get bin totals for each row
+      data <- merge(data, bin_totals, by = "bin_id", sort = FALSE)
+
+      # Calculate bin_prop: count within group / total count in bin
+      # When bin_total = 0 (empty bin), set bin_prop based on whether there are multiple groups
+      n_groups <- length(unique(data$group))
+      if (n_groups == 1) {
+        # With only one group, bin_prop is always 1 (100% of the bin belongs to this group)
+        data$bin_prop <- 1
+      } else {
+        # With multiple groups, bin_prop = count / total_count_in_bin, or 0 for empty bins
+        data$bin_prop <- ifelse(data$bin_total > 0, data$count / data$bin_total, 0)
+      }
+
+      # Remove the temporary columns
+      data$bin_id <- NULL
+      data$bin_total <- NULL
+    } else {
+      # If we don't have the necessary data, just add a default bin_prop column
+      data$bin_prop <- if (nrow(data) > 0) rep(1, nrow(data)) else numeric(0)
+    }


This all seems more complicated than it needs to be. Can't this be computed more directly?

teunbrand · 2025-05-23T09:47:10Z

R/stat-bin.R

+#'   width    = "widths of bins.",
+#'   bin_prop = "proportion of points in bin that belong to each group."


Can you regenerate the .Rd files as well?

teunbrand · 2025-05-23T09:48:17Z

tests/testthat/test-stat-bin.R

+  # Test with 5 bins to get predictable overlap
+  p <- ggplot(test_data, aes(x, fill = group)) + geom_histogram(bins = 5)


Breaks can be set directly if predictability is an issue

teunbrand · 2025-05-23T09:51:33Z

tests/testthat/test-stat-bin.R

+  bins_with_both_groups <- aggregate(data$count > 0, by = list(paste(data$xmin, data$xmax)), sum)
+  overlapping_bins <- bins_with_both_groups[bins_with_both_groups$x == 2, ]$Group.1
+
+  for (bin in overlapping_bins) {
+    bin_data <- data[paste(data$xmin, data$xmax) == bin, ]
+    total_prop <- sum(bin_data$bin_prop)
+    expect_equal(total_prop, 1, tolerance = 1e-6)
+  }


Isn't it more simple to test that the sum over bins is 1, regardless of how many groups?

teunbrand · 2025-05-23T10:04:43Z

tests/testthat/test-stat-bin.R

+  bin1_data <- data[data$x == min(data$x), ]
+  bin2_data <- data[data$x == max(data$x), ]


Suggested change

bin1_data <- data[data$x == min(data$x), ]

bin2_data <- data[data$x == max(data$x), ]

bin1_data <- data[data$x == 1, ]

bin2_data <- data[data$x == 2, ]

We know from the test data what these values should be

This was referenced May 23, 2025

Feature request: Add bin_prop computed variable to stat_bin for proportion-based histograms #6478

Open

Computing stats between groups #6476

Closed

teunbrand requested changes May 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add bin_prop computed variable to stat_bin #6477

Add bin_prop computed variable to stat_bin #6477

Uh oh!

kieran-mace commented May 23, 2025 •

edited

Loading

Uh oh!

teunbrand left a comment

Uh oh!

teunbrand May 23, 2025

Uh oh!

teunbrand May 23, 2025

Uh oh!

teunbrand May 23, 2025

Uh oh!

teunbrand May 23, 2025

Uh oh!

teunbrand May 23, 2025

Uh oh!

Uh oh!

		#' width = "widths of bins.",
		#' bin_prop = "proportion of points in bin that belong to each group."

		# Test with 5 bins to get predictable overlap
		p <- ggplot(test_data, aes(x, fill = group)) + geom_histogram(bins = 5)

		bin1_data <- data[data$x == min(data$x), ]
		bin2_data <- data[data$x == max(data$x), ]

Add bin_prop computed variable to stat_bin #6477

Are you sure you want to change the base?

Add bin_prop computed variable to stat_bin #6477

Uh oh!

Conversation

kieran-mace commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Implementation

Usage Example

Test plan

Uh oh!

teunbrand left a comment

Choose a reason for hiding this comment

Uh oh!

teunbrand May 23, 2025

Choose a reason for hiding this comment

Uh oh!

teunbrand May 23, 2025

Choose a reason for hiding this comment

Uh oh!

teunbrand May 23, 2025

Choose a reason for hiding this comment

Uh oh!

teunbrand May 23, 2025

Choose a reason for hiding this comment

Uh oh!

teunbrand May 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kieran-mace commented May 23, 2025 •

edited

Loading