Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop non-constant aesthetics more thoroughly #4917

Conversation

yutannihilation
Copy link
Member

@yutannihilation yutannihilation commented Jul 24, 2022

Fix #4394

ggplot2 drops an aesthetic when the values are not constant within the group. Currently, this happens per group. Even when the dropping happens, if any of the groups still have the aesthetics, the result keeps the column because vec_rbind() kindly fills the mismatch with NA.

data_new <- vec_rbind(!!!stats)

This creates incomplete result like below. As discussed on #4394, this should drop the aesthetic instead of just showing a warning (which I tried in #4916).

library(ggplot2)

d <- data.frame(
  id = c("a", "a", "b", "b", "c"),
  attr = 1:5
)

ggplot(d, aes(id, fill = attr)) +
  geom_bar()

Created on 2022-07-27 by the reprex package (v2.0.1)

This pull request inlines uniquecols() to catch the dropped columns and record it to an empty env so that we can refer to it. I feel this is better rather than inferring from the results what columns are dropped, but I'm open to any better idea.

@thomasp85 thomasp85 requested a review from clauswilke July 26, 2022 15:33
@clauswilke
Copy link
Member

@thomasp85 A few more days without proper internet for me. Should be able to look this over carefully early next week.

@thomasp85
Copy link
Member

@clauswilke no problem at all, was just going through open PRs and this seemed to belong in your review bucket given the issue discussion — no stress

@yutannihilation yutannihilation added this to the ggplot2 3.4.0 milestone Jul 26, 2022
@yutannihilation yutannihilation mentioned this pull request Jul 26, 2022
@thomasp85
Copy link
Member

@clauswilke will you have time for this one within the next couple of weeks? Sorry to keep pinging you — if you are floored by other work just let me know

@clauswilke
Copy link
Member

@thomasp85 I'll look it over today.

@yutannihilation
Copy link
Member Author

Thanks. Honestly, I'm not sure if this is the best way to record the dropped columns. Any suggestion is welcome.

@clauswilke
Copy link
Member

@yutannihilation Yeah, that's why I've been stalling. I'm not entirely sure what to think. The process of making an environment, adding things to it, and then listing all the contents seems very awkward. Also difficult to reason about. Have you considered working with a simple vector of names and using set functions (union()) to combine sets of names?

@yutannihilation
Copy link
Member Author

I think the idea of using an env as a hashmap is not so uncommon. I came up with several ideas including yours, and I just couldn't figure out which one is most efficient (and didn't have time to benchmark them).

So, do you mean something like this?

    dropped_columns <- list()
    stats <- mapply(function(new, old) {
      ...

      dropped_columns <- append(dropped_columns, names(old)[!unique_idx])

      ...
    }, stats, groups, SIMPLIFY = FALSE)

    ...

    dropped <- Reduce(dropped_columns, union)

@clauswilke
Copy link
Member

Would this work?

dropped_columns <- character(0)
stats <- mapply(function(new, old) {
    ...
    dropped_columns <- c(dropped_columns, names(old)[!unique_idx])
    ...
}, stats, groups, SIMPLIFY = FALSE)
...
dropped <- unique(dropped_columns)

Either way, much simpler to read than the code using environments. For code that is not performance critical, I'd always go with the code that is easier to read than the one that (potentially) is a little faster.

@clauswilke
Copy link
Member

Or alternatively:

dropped <- character(0)
stats <- mapply(function(new, old) {
    ...
    dropped <- union(dropped, names(old)[!unique_idx])
    ...
}, stats, groups, SIMPLIFY = FALSE)
...

@yutannihilation
Copy link
Member Author

Thanks, makes sense. I'll use the first one.

Another one I attempted was returning a list rather than a data.frame (and I forgot to remove result <-...).

    stats <- mapply(function(new, old) {
      ...
      dropped <- union(dropped, names(old)[!unique_idx])
      
      result <- vec_cbind(
        new,
        old[rep(1, nrow(new)), unique_idx, drop = FALSE]
      )
      
      list(
        result = result,
        dropped = dropped
      )
    }, stats, groups, SIMPLIFY = FALSE)

    data_new <- vec_rbind(!!!lapply(stats, function(x) x[["result"]])

    dropped <- vec_unique(vctrs_c(!!!lapply(stats, function(x) x[["dropped"]])))

@clauswilke
Copy link
Member

@yutannihilation Thanks! I have two more questions:

  1. What is unique0()?
  2. What happens when different columns get dropped in different groups. How does the data frame get assembled again? Would it make sense to add a test for this case, to make sure things work as expected?

@yutannihilation
Copy link
Member Author

What is unique0()?

It's defined here. After the vctrs integration (#4868), most of the code uses unique0(), so I followed it.

unique0 <- function(x, ...) if (is.null(x)) x else vec_unique(x, ...)

Would it make sense to add a test for this case, to make sure things work as expected?

Sure. I'll add tests.

@yutannihilation
Copy link
Member Author

Added tests.

2. What happens when different columns get dropped in different groups. How does the data frame get assembled again?

What do you mean by "what happens" here? All get dropped. Do you expect anything more? Sorry, I'm not sure if I understand the intention of this question.

@clauswilke
Copy link
Member

@yutannihilation I was just trying to understand how the code works. There's something that I'm still missing. You're creating separate data frames for each group, and then you're row-binding them all together. That's fine, but what happens when the data frames for different groups have different columns. Shouldn't row-binding fail in this case? But now looking it up, it seems like vec_rbind() can handle this scenario, so I guess that's why it works.

R/stat-.r Outdated
if (length(dropped) > 0) {
cli::cli_warn(c(
"The following aesthetics were dropped during statistical transformation: {.field {glue_collapse(dropped, sep = ', ')}}",
"i" = "This can happen when ggplot fails to infer the correct grouping structure in the data.",
"i" = "Did you forget to specify a {.code group} aesthetic or to convert a numerical variable into a factor?"
))
}
data_new
data_new[, !names(data_new) %in% dropped, drop = FALSE]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the logic here is quite right. Shouldn't line 155 come before line 147? We want to drop all columns in dropped, not just the ones that are not in self$dropped_aes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking this again, it seems you are right. Thanks for catching.

@yutannihilation
Copy link
Member Author

But now looking it up, it seems like vec_rbind() can handle this scenario, so I guess that's why it works.

Yes, that's why we see this kind of incomplete fill/colour aesthetic. This pull request doesn't change anything here. The code inside mapply() looks different, but it's basically just inlining uniquecols() and record dropped.

@yutannihilation
Copy link
Member Author

I added many comments and rewrote the code a bit. I hope this will make it a bit easier to understand the logic...

Copy link
Member

@clauswilke clauswilke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks great now!

@yutannihilation
Copy link
Member Author

Thanks so much for reviewing this complicated pull request! I hope I can do better next time...!

@yutannihilation yutannihilation merged commit 32534e9 into tidyverse:main Aug 24, 2022
@yutannihilation yutannihilation deleted the fix/issue-4394-drop-partially-dropped-aes branch August 24, 2022 14:50
@clauswilke
Copy link
Member

I think you did great. Just because there's room for improvement doesn't mean the first iteration was bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a warning in geom_bar when mapping of a continuous variable to the fill aesthetic is not possible?
3 participants