Skip to content

Drop plot_env from ggplot2 objects? #3994

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
clauswilke opened this issue May 13, 2020 · 22 comments
Open

Drop plot_env from ggplot2 objects? #3994

clauswilke opened this issue May 13, 2020 · 22 comments

Comments

@clauswilke
Copy link
Member

This comment #3619 (comment) prompted me to look into plot_env, and as far as I can see it's stored and handed around just so it can be eventually given to the function combine_vars(), which then doesn't use it:

ggplot2/R/facet-.r

Lines 544 to 587 in e9b9946

combine_vars <- function(data, env = emptyenv(), vars = NULL, drop = TRUE) {
possible_columns <- unique(unlist(lapply(data, names)))
if (length(vars) == 0) return(new_data_frame())
# For each layer, compute the facet values
values <- compact(lapply(data, eval_facets, facets = vars, possible_columns = possible_columns))
# Form the base data.frame which contains all combinations of faceting
# variables that appear in the data
has_all <- unlist(lapply(values, length)) == length(vars)
if (!any(has_all)) {
missing <- lapply(values, function(x) setdiff(names(vars), names(x)))
missing_txt <- vapply(missing, var_list, character(1))
name <- c("Plot", paste0("Layer ", seq_len(length(data) - 1)))
abort(glue(
"At least one layer must contain all faceting variables: {var_list(names(vars))}.\n",
glue_collapse(glue("* {name} is missing {missing_txt}"), "\n", last = "\n")
))
}
base <- unique(rbind_dfs(values[has_all]))
if (!drop) {
base <- unique_combs(base)
}
# Systematically add on missing combinations
for (value in values[!has_all]) {
if (empty(value)) next;
old <- base[setdiff(names(base), names(value))]
new <- unique(value[intersect(names(base), names(value))])
if (drop) {
new <- unique_combs(new)
}
base <- unique(rbind(base, df.grid(old, new)))
}
if (empty(base)) {
abort("Faceting variables must have at least one value")
}
base
}

Is this a hold-over from the times before tidy eval? Can we remove this/or assign NULL here?
plot_env = environment

It is already marked as deprecated in the documentation:

#' @param environment DEPRECATED. Used prior to tidy evaluation.

@clauswilke
Copy link
Member Author

Setting plot_env to NULL doesn't work, which means it is still used somewhere as an environment. However, setting it to an empty environment seems to work without issues.

library(ggplot2)

p <- ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point() +
  facet_wrap(~Species)

p$plot_env <- NULL
p
#> Error in exists(name, envir = env, mode = mode): use of NULL environment is defunct

p$plot_env <- rlang::new_environment()
p

Created on 2020-05-13 by the reprex package (v0.3.0)

It looks like the other place where the environment is used is for scale lookup. Maybe it would be better to use the global environment that is active when the plot is printed, rather than when the plot is created? That would be more predictable anyways, I assume.

ggplot2/R/scale-type.R

Lines 28 to 39 in 5a686c3

find_global <- function(name, env, mode = "any") {
if (exists(name, envir = env, mode = mode)) {
return(get(name, envir = env, mode = mode))
}
nsenv <- asNamespace("ggplot2")
if (exists(name, envir = nsenv, mode = mode)) {
return(get(name, envir = nsenv, mode = mode))
}
NULL
}

@clauswilke
Copy link
Member Author

On the last point, just setting plot_env = rlang::new_environment() means we cannot globally change color scales anymore. So we'd have to fix this appropriately.

library(ggplot2)

p <- ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point() +
  facet_wrap(~Species)

scale_colour_discrete <- scale_colour_viridis_d
p

p$plot_env <- rlang::new_environment()
p

Created on 2020-05-13 by the reprex package (v0.3.0)

@yutannihilation
Copy link
Member

Thanks for investigating this. I didn't know the mechanism around plot_env.

At first, ggplot2 only looks up in global environment, but started to check the package environment by this commit:

a9f5e0a

Then, find_global() was changed to use the parent environment instead of global environment by this commit:

64666ab

I agree it's nice if we can get rid of plot_env, but the fix might be tricky if we want to preserve the current behavior. For example, I'm afraid looking up only in global environment would break this:

library(ggplot2)

do_plot <- function() {
  p <- ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
    geom_point() +
    facet_wrap(~Species)
  
  scale_colour_discrete <- scale_colour_viridis_d
  
  p
}

p <- do_plot()
p

p$plot_env <- rlang::new_environment()
p

Created on 2020-05-14 by the reprex package (v0.3.0)

@clauswilke
Copy link
Member Author

I get that, but the question is do we want to encourage this usage pattern? Does anybody use it? Is it even logical? See the next example. Would you have expected the plot to use the brewer color scale?

library(ggplot2)

do_plot <- function() {
  p <- ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
    geom_point() +
    facet_wrap(~Species)
  
  scale_colour_discrete <- scale_colour_viridis_d
  
  p
}

p <- do_plot()

scale_colour_discrete <- scale_colour_brewer
p

Created on 2020-05-14 by the reprex package (v0.3.0)

My sense is that using the environment that is active when the plot is being printed is totally reasonable, and that doesn't require saving the environment in the plot itself. I also feel that if people write functions that return plots, then they should explicitly add the scales they want.

@yutannihilation
Copy link
Member

do we want to encourage this usage pattern? Does anybody use it?

I really don't know, sorry. I didn't even know (or completely forget) that we can override the default scale in such a way... I think your claim is 99% valid, but I just don't see the context why ggplot2 needed to be changed to use parent environment instead of global environment.

@yutannihilation
Copy link
Member

By the way, is this related to #2691? If we can set the default scale by setting the default theme, probably we don't need to tweak globals (or some environments) anymore.

@clauswilke
Copy link
Member Author

clauswilke commented May 15, 2020

@hadley Do you remember why scale lookup in the parent environment was added? 64666ab
Would it be terrible to get rid of this feature?

@hadley
Copy link
Member

hadley commented May 15, 2020

I have a vague recollection it was something to do with scale customisation. Maybe so that you could write your scale_colour_continuous() to override the defaults?

@thomasp85
Copy link
Member

I’m pretty sure that was the reason. I’m all for a better way of allowing users to set a default scale

@clauswilke
Copy link
Member Author

Sure, but the question is: Is it sufficient to look either in the global environment or the parent environment of the print() call, or do we need to capture the local environment that is active when the plot is defined?

In other words, in the reprex in #3994 (comment), would it be fine if the color scale being used upon printing was the brewer scale rather than the viridis scale?

@thomasp85
Copy link
Member

I think if we went ahead and nuked the plot_env we might as well come up with a better solution for setting scale defaults than looking for specifically named functions

@clauswilke
Copy link
Member Author

we might as well come up with a better solution for setting scale defaults

That creates a whole new set of problems, though. (#3962 (comment)) I should probably create an issue just for that. We will need some mechanism to override breaks, labels, expansion, etc. without adding a new scale function.

@combiz
Copy link

combiz commented Jun 9, 2020

After implementing the p$plot_env <- rlang::new_environment() quick fix, the issue seemed resolved. However, after a recent update (or, possibly, a result of adopting the best-practices suggested in https://ggplot2.tidyverse.org/dev/articles/ggplot2-in-packages.html) I'm finding that other components of a ggplot seem to hold copies of environment objects. Consequently, saving a small list of ggplots with saveRDS or qs::qsave writes >100Gb of data to disk in my case. I've used butcher::weigh to better understand object sizes in the ggplot object: -

> butcher::weigh(p)
# A tibble: 137 x 2
   object                          size
   <chr>                          <dbl>
 1 mapping.x                 3639.     
 2 mapping.y                 3639.     
 3 layers2                   3639.     
 4 scales                       0.727  
 5 layers1                      0.574  
 6 coordinates                  0.255  
 7 facet                        0.178  
 8 data.clusters                0.00404
 9 theme.text.margin            0.00076
10 theme.axis.title.x.margin    0.00076
# … with 127 more rows
> 
> p$layers
[[1]]
geom_bar: width = NULL, na.rm = FALSE, orientation = NA
stat_identity: na.rm = FALSE
position_stack 

[[2]]
mapping: ymin = ~mean - se, ymax = ~mean + se 
geom_errorbar: na.rm = FALSE, orientation = NA, width = 0.2
stat_identity: na.rm = FALSE
position_dodge 

> p$mapping$x
<quosure>
expr: ^.data[["clusters"]]
env:  0x561f534a4500
> p$mapping$y
<quosure>
expr: ^mean
env:  0x561f534a4500
> 

It appears that quosures within the mapping and layers elements are responsible. Has there been a recent change?
Each quosure environment contains large objects unrelated to the plot or quosure expression. Any suggestions? A simple ggplot of data from a 10 row / 3 column tibble is now occupying ~12Gb on disk.

@yutannihilation
Copy link
Member

It appears that quosures within the mapping and layers elements are responsible. Has there been a recent change?

I don't think so. Quosures capture environments in nature, and layers are environment itself.

@clauswilke
Let me clarify one thing, is the motivation of this issue to clean up the unused internals, not to enable saving plots on disk, right?

@clauswilke
Copy link
Member Author

Let me clarify one thing, is the motivation of this issue to clean up the unused internals, not to enable saving plots on disk, right?

That's correct. I think saving plots on disk is a fundamentally flawed workflow. I don't think it will ever work properly, and I don't see any good reason to do so in the first place.

@yutannihilation
Copy link
Member

Thanks, I totally agree with you.

@yutannihilation
Copy link
Member

Btw, as #3828 gets merged, I guess overriding scale_colour_discrete on the global environment is now superseded. Maybe we can take these two steps:

  1. Warn when scale_colour_discrete is found on different environment than ggplot2's namespace, and encourage users to use options(ggplot2.discrete.fill).
  2. Just drop plot_env.

@clauswilke
Copy link
Member Author

Just to be clear: I think there are two separate issues here. One is how to globally set color scales. The other is whether any scales (not just color) should be searched in the environment that was active when the plot was defined or in the current global environment. I think the latter choice (search in the current global environment, thus keeping a copy of the environment is not needed) is fine.

@yutannihilation
Copy link
Member

not just color

I see. Sorry for my confusion.

@woodwards
Copy link

Let me clarify one thing, is the motivation of this issue to clean up the unused internals, not to enable saving plots on disk, right?

That's correct. I think saving plots on disk is a fundamentally flawed workflow. I don't think it will ever work properly, and I don't see any good reason to do so in the first place.

One use case is to avoid having to redraw large and slow plots. If I can saveRDS the ggplot list object say then I can see whether it has changed and whether I need to redraw the entire plot. Also I can track this with git and avoid having to add large plots to my git repo. However, if I understand the above discussion, it's not that simple, since the entire R environment is potentially accessed during drawing; changes to the R env may or may not affect the drawn plot; not all the information is contained in the ggplot list.

@bersbersbers
Copy link
Contributor

bersbersbers commented May 30, 2022

I think saving plots on disk is a fundamentally flawed workflow [...] and I don't see any good reason to do so

The use case we handle with this is a Shiny app that allows users to download a plot in a format such that it can easily be "reprinted" later. Different scientific journals may have different requirements regarding dimensions, fonts, etc., and we believe storing an rds file that contains both the plot and the data is good way to have both in one file while being able to modify some aspects later (do correct me if that is not correct, or if there are better alternatives). Other options (bitmaps, PDF files, ...) are all pretty complex to postprocess.

We certainly improved things by replacing plot_env by a new environment, but replacing mapping and layers by something meaningful turns out to be more difficult. Any ideas on this one?

@hadley
Copy link
Member

hadley commented May 31, 2022

@bersbersbers this only works by coincidence and not by design. We make no guarantees about the ggplot2 internals and making it possible replaying saved plots from previous versions of ggplot2 would be a huge amount of work that we do not plan to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants