-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Add bin_prop computed variable to stat_bin #6477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Brings feature parity with stat_count by adding `after_stat(bin_prop)` functionality to stat_bin. The bin_prop variable shows the proportion of each group within each bin, enabling proportion-based visualizations for binned continuous data. Key features: - bin_prop = count_in_group / total_count_in_bin - Works with multiple groups and respects weights - Backwards compatible (bin_prop = 1 for single groups) - Properly handles empty bins Usage: ggplot(data, aes(x = continuous_var, y = after_stat(bin_prop), fill = group)) + stat_bin(geom = "col", position = "dodge") 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi there, thanks for the PR! There are a few concerns that I hope can be alleviated, see related comments.
if (!is.null(data) && nrow(data) > 0 && | ||
all(c("count", "xmin", "xmax") %in% names(data))) { | ||
|
||
# Calculate bin_prop: proportion of each group within each bin | ||
# Create a unique bin identifier using rounded values to handle floating point precision | ||
data$bin_id <- paste(round(data$xmin, 10), round(data$xmax, 10), sep = "_") | ||
|
||
# Calculate total count per bin across all groups | ||
bin_totals <- stats::aggregate(data$count, by = list(bin_id = data$bin_id), FUN = sum) | ||
names(bin_totals)[2] <- "bin_total" | ||
|
||
# Merge back to get bin totals for each row | ||
data <- merge(data, bin_totals, by = "bin_id", sort = FALSE) | ||
|
||
# Calculate bin_prop: count within group / total count in bin | ||
# When bin_total = 0 (empty bin), set bin_prop based on whether there are multiple groups | ||
n_groups <- length(unique(data$group)) | ||
if (n_groups == 1) { | ||
# With only one group, bin_prop is always 1 (100% of the bin belongs to this group) | ||
data$bin_prop <- 1 | ||
} else { | ||
# With multiple groups, bin_prop = count / total_count_in_bin, or 0 for empty bins | ||
data$bin_prop <- ifelse(data$bin_total > 0, data$count / data$bin_total, 0) | ||
} | ||
|
||
# Remove the temporary columns | ||
data$bin_id <- NULL | ||
data$bin_total <- NULL | ||
} else { | ||
# If we don't have the necessary data, just add a default bin_prop column | ||
data$bin_prop <- if (nrow(data) > 0) rep(1, nrow(data)) else numeric(0) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all seems more complicated than it needs to be. Can't this be computed more directly?
#' width = "widths of bins.", | ||
#' bin_prop = "proportion of points in bin that belong to each group." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you regenerate the .Rd files as well?
# Test with 5 bins to get predictable overlap | ||
p <- ggplot(test_data, aes(x, fill = group)) + geom_histogram(bins = 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Breaks can be set directly if predictability is an issue
bins_with_both_groups <- aggregate(data$count > 0, by = list(paste(data$xmin, data$xmax)), sum) | ||
overlapping_bins <- bins_with_both_groups[bins_with_both_groups$x == 2, ]$Group.1 | ||
|
||
for (bin in overlapping_bins) { | ||
bin_data <- data[paste(data$xmin, data$xmax) == bin, ] | ||
total_prop <- sum(bin_data$bin_prop) | ||
expect_equal(total_prop, 1, tolerance = 1e-6) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it more simple to test that the sum over bins is 1, regardless of how many groups?
bin1_data <- data[data$x == min(data$x), ] | ||
bin2_data <- data[data$x == max(data$x), ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bin1_data <- data[data$x == min(data$x), ] | |
bin2_data <- data[data$x == max(data$x), ] | |
bin1_data <- data[data$x == 1, ] | |
bin2_data <- data[data$x == 2, ] |
We know from the test data what these values should be
Summary
Adds
after_stat(bin_prop)
functionality tostat_bin
, bringing feature parity withstat_count
. The newbin_prop
computed variable shows the proportion of each group within each bin.Closes #6478
Motivation
stat_count
providesafter_stat(prop)
for proportion-based visualizations, butstat_bin
lacked equivalent functionality. This made it difficult to create proportion-based histograms for continuous data.Implementation
compute_panel
method toStatBin
that calculatesbin_prop = count_in_group / total_count_in_bin
bin_prop = 1
)Usage Example
This addresses the feature gap where users could use
after_stat(prop)
withstat_count
for discrete data but had no equivalent for continuous data withstat_bin
.Test plan
stat_bin
tests pass (no regressions)bin_prop
functionalityafter_stat(bin_prop)
works correctly in plotsSome example output:
Created on 2025-05-22 with reprex v2.1.1
🤖 Generated with Claude Code