-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with column names in dplyr workflows #35
Comments
It's actually impossible() to know from static code inspection whether a symbol in R is a global variable or not. That is only possible to know at run time. This is regardless of parallel framework - it is the nature of the R language and the non-standard evaluation it supports. () By impossible I mean that you need to be able perfectly emulate how all code is executed, which basically mean you need to run it. Here is an example illustrating it: fcn1 <- function(x) 2 * x
fcn2 <- function(x) as.character(substitute(x)) There is no way to know from static code inspection that So, with this limitation we can only make clever guesses. Also, depending on your objectives, you can take a conservative approach where you're saying "I'm ok with missing a few global variables, but I absolute do not want to find false globals", or you can take a liberal approach where you're saying "I prefer to find all globals for the price of finding a few false ones". In parallel processing where you need to make sure your external R worker has access to all objects it needs, you have to take the liberal approach. That is the default setup of the 'globals' package and what the 'future' framework needs. So, in your code snippet, you will get that: > expr <- quote({ my_df %>% group_by(id_var) %>% left_join(other_df) %>% summarise(n = sum(a)) })
> expr
{
my_df %>% group_by(id_var) %>% left_join(other_df) %>% summarise(n = sum(a))
}
> vars <- globals::findGlobals(expr)
> vars
[1] "{" "%>%" "my_df" "group_by" "id_var" "left_join"
[7] "other_df" "summarise" "sum" "a" and there you see that it lists Now, does this exist or not? That completely depend on what objects exist in your environment, e.g. > library(dplyr)
> my_df <- data.frame(a = 1:10, id_var = rep(c("a", "b"), each = 5))
> other_df <- data.frame(a = 1:10, b = 10:1)
> globals <- globals::cleanup(globals::globalsOf(expr, mustExist = FALSE))
> str(globals)
List of 7
$ %>% :function (lhs, rhs)
$ my_df :'data.frame': 10 obs. of 2 variables:
..$ a : int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ id_var: Factor w/ 2 levels "a","b": 1 1 1 1 1 2 2 2 2 2
$ group_by :function (.data, ..., add = FALSE)
$ left_join:function (x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
$ other_df :'data.frame': 10 obs. of 2 variables:
..$ a: int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ b: int [1:10] 10 9 8 7 6 5 4 3 2 1
$ summarise:function (.data, ...)
$ a : num 1
[...] As you see, only some of the objects identify by Now, which might be what you observed but your claim didn't come with much background info, if you happen to have a global variable > id_var <- "dummy"
> globals <- globals::cleanup(globals::globalsOf(expr, mustExist = FALSE))
> str(globals)
List of 8
$ %>% :function (lhs, rhs)
$ my_df :'data.frame': 10 obs. of 2 variables:
..$ a : int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ id_var: Factor w/ 2 levels "a","b": 1 1 1 1 1 2 2 2 2 2
$ group_by :function (.data, ..., add = FALSE)
$ id_var : chr "dummy"
$ left_join:function (x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
$ other_df :'data.frame': 10 obs. of 2 variables:
..$ a: int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ b: int [1:10] 10 9 8 7 6 5 4 3 2 1
$ summarise:function (.data, ...)
$ a : num 1
[...] So, in this case, using a liberal, optimistic approach a future that look like: out %<-% { my_df %>% group_by(id_var) %>% left_join(other_df) %>% summarise(n = sum(a)) } will send of out <- { my_df %>% group_by(id_var) %>% left_join(other_df) %>% summarise(n = sum(a)) } Hope this clarifies. |
Hi, thanks for the detailed answer.
With an example closer to what I'm actually doing:
=> no issue Case B -- works fine
Case C -- doesn't work
=> throws a "simpleError" saying it can't find global objet A tiny bit more context: Thanks again !! |
Hi, thanks for further troubleshooting and details. I think the following provides a minimal reproducible example using only base R: library("future")
plan(sequential)
# Non-nested futures
y <- value(future( subset(iris, Species == "setosa") ))
print(head(y, 3L))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# Nested futures
f <- future( value(future( subset(iris, Species == "setosa") )) )
y2 <- value(f)
# Error: Identified global objects via static code inspection (subset(iris, Species == "setosa")).
# Failed to locate global object in the relevant environments: 'Species'
I assume this the type of error you are seeing(*). I will investigate; it could be an issue with the 'globals' package but it may be due to code in the 'future' package. PS. (*) Please avoid "paraphrasing" error messages when you report issues. Instead, cut'n'pasted error message are much preferred - that'll reduce ambiguity, lower the risk for misunderstandings, and cut corners when it comes to troubleshooting. Knowing the exact error message often allows you to find the line (or at least lines) in the code where the error occurs. |
…ables to exist, even when they were was false positives. H/T HenrikBengtsson/globals#35
This has been fixed in the develop branch of future - it'll be part of the next release, i.e. future 1.8.0. In the meanwhile, you can workaround the problem by doing: library("future")
options(future.globals.onMissing = "ignore") I'm closing this one, but if the above doesn't fix it, please report back. Thanks again for the report. |
Hi,
I'm having issues with global in dplyr workflows.
For example, running:
globals (used from the future package) identifies "id_var" as a global object to be exported, when it is only a column name from "my_df".
I'm going around it by explicitly listing globals (e.g. "my_df", and "other_df"), but shouldn't globals be able to pick up on that?
Thanks!
The text was updated successfully, but these errors were encountered: