Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with column names in dplyr workflows #35

Closed
gregleleu opened this issue Mar 20, 2018 · 4 comments
Closed

Issue with column names in dplyr workflows #35

gregleleu opened this issue Mar 20, 2018 · 4 comments
Labels

Comments

@gregleleu
Copy link

Hi,
I'm having issues with global in dplyr workflows.
For example, running:

out %<-% {my_df %>% group_by(id_var) %>% left_join(other_df) %>% summarise(n = sum(a))}

globals (used from the future package) identifies "id_var" as a global object to be exported, when it is only a column name from "my_df".

I'm going around it by explicitly listing globals (e.g. "my_df", and "other_df"), but shouldn't globals be able to pick up on that?

Thanks!

@HenrikBengtsson
Copy link
Owner

It's actually impossible() to know from static code inspection whether a symbol in R is a global variable or not. That is only possible to know at run time. This is regardless of parallel framework - it is the nature of the R language and the non-standard evaluation it supports. () By impossible I mean that you need to be able perfectly emulate how all code is executed, which basically mean you need to run it. Here is an example illustrating it:

fcn1 <- function(x) 2 * x
fcn2 <- function(x) as.character(substitute(x))

There is no way to know from static code inspection that fcn1(foo) needs foo whereas fcn2(foo) it only cares about the name ("foo").

So, with this limitation we can only make clever guesses. Also, depending on your objectives, you can take a conservative approach where you're saying "I'm ok with missing a few global variables, but I absolute do not want to find false globals", or you can take a liberal approach where you're saying "I prefer to find all globals for the price of finding a few false ones".

In parallel processing where you need to make sure your external R worker has access to all objects it needs, you have to take the liberal approach. That is the default setup of the 'globals' package and what the 'future' framework needs.

So, in your code snippet, you will get that:

> expr <- quote({ my_df %>% group_by(id_var) %>% left_join(other_df) %>% summarise(n = sum(a)) })
> expr
{
    my_df %>% group_by(id_var) %>% left_join(other_df) %>% summarise(n = sum(a))
}
> vars <- globals::findGlobals(expr)
> vars
 [1] "{"         "%>%"       "my_df"     "group_by"  "id_var"    "left_join"
 [7] "other_df"  "summarise" "sum"       "a"        

and there you see that it lists id_var as a global variable name.

Now, does this exist or not? That completely depend on what objects exist in your environment, e.g.

> library(dplyr)
> my_df <- data.frame(a = 1:10, id_var = rep(c("a", "b"), each = 5))
> other_df <- data.frame(a = 1:10, b = 10:1)
> globals <- globals::cleanup(globals::globalsOf(expr, mustExist = FALSE))
> str(globals)
List of 7
 $ %>%      :function (lhs, rhs)  
 $ my_df    :'data.frame':	10 obs. of  2 variables:
  ..$ a     : int [1:10] 1 2 3 4 5 6 7 8 9 10
  ..$ id_var: Factor w/ 2 levels "a","b": 1 1 1 1 1 2 2 2 2 2
 $ group_by :function (.data, ..., add = FALSE)  
 $ left_join:function (x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)  
 $ other_df :'data.frame':	10 obs. of  2 variables:
  ..$ a: int [1:10] 1 2 3 4 5 6 7 8 9 10
  ..$ b: int [1:10] 10 9 8 7 6 5 4 3 2 1
 $ summarise:function (.data, ...)  
 $ a        : num 1
[...]

As you see, only some of the objects identify by findGlobals() are actually picked up by globalsOf(). In other words, id_var is not a global variable exported to the external R worker.

Now, which might be what you observed but your claim didn't come with much background info, if you happen to have a global variable id_var sitting in your environment, then, yes, it'll be picked up, e.g.

> id_var <- "dummy"
> globals <- globals::cleanup(globals::globalsOf(expr, mustExist = FALSE))
> str(globals)
List of 8
 $ %>%      :function (lhs, rhs)  
 $ my_df    :'data.frame':	10 obs. of  2 variables:
  ..$ a     : int [1:10] 1 2 3 4 5 6 7 8 9 10
  ..$ id_var: Factor w/ 2 levels "a","b": 1 1 1 1 1 2 2 2 2 2
 $ group_by :function (.data, ..., add = FALSE)  
 $ id_var   : chr "dummy"
 $ left_join:function (x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)  
 $ other_df :'data.frame':	10 obs. of  2 variables:
  ..$ a: int [1:10] 1 2 3 4 5 6 7 8 9 10
  ..$ b: int [1:10] 10 9 8 7 6 5 4 3 2 1
 $ summarise:function (.data, ...)  
 $ a        : num 1
[...]

So, in this case, using a liberal, optimistic approach a future that look like:

out %<-% { my_df %>% group_by(id_var) %>% left_join(other_df) %>% summarise(n = sum(a)) }

will send of id_var (= "dummy") to the worker/environment where the future expression is evaluated. However, this will not be used in the evaluation; just as it won't be used if you evaluate it locally:

out <- { my_df %>% group_by(id_var) %>% left_join(other_df) %>% summarise(n = sum(a)) }

Hope this clarifies.

@gregleleu
Copy link
Author

Hi, thanks for the detailed answer.
I understand the complexity; so I went back to my code, and as per your code it turns out:

  • it all works fine when run "directly" (case A below),
  • also within a function called "directly" (case B)
  • but not when the function is called through future_lapply (case C)

With an example closer to what I'm actually doing:
Case A -- works fine

my_df_sub %<-% { my_df %>% filter(id_var %in% id_var_vec) }

=> no issue

Case B -- works fine

my_func <- function(id_var_vec) { 
  my_df_sub %<-% { my_df %>% filter(id_var %in% id_var_vec) }
  ## Expensive calculations not shown
  return(out)
}
res <- my_func(c("a"))

my_df is in .GlobalEnv
=> no issue

Case C -- doesn't work

my_func <- function(id_var_vec) { 
  my_df_sub %<-% { my_df %>% filter(id_var %in% id_var_vec) }
  ## Expensive calculations not shown
  return(out)
}
res <- c("a") %>% future_lapply(my_func)
# Or even just res %<-% my_func(c("a"))

=> throws a "simpleError" saying it can't find global objet id_var.
I get that this is creating a second level of futures, and I imagine this is where it breaks, I just can't understand why. I fixed it declaring the right %globals% for the%<-% assignment inside the function.

A tiny bit more context:
The computations are quite expensive in the function (geospatial analysis).
Sometimes I run just one "id_var" and I need the "single-id" version (case B) to be parallelized (but it only uses 2-4 of my 8 cores) and sometimes I need to run multiple "id_var"s, hence the 2nd layer of futures.
By the way, in the latter case I understand the deeper layer is run sequentially. My id_var subsets are not well balanced so I tried tweaking the plan to plan(list(tweak(multiprocess, workers = 4), tweak(multiprocess, workers = 2))) but I couldn't make it work: I always got a "bad error" message...

Thanks again !!

@HenrikBengtsson
Copy link
Owner

Hi, thanks for further troubleshooting and details. I think the following provides a minimal reproducible example using only base R:

library("future")
plan(sequential)

# Non-nested futures
y <- value(future( subset(iris, Species == "setosa") ))
print(head(y, 3L))
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa

# Nested futures
f <- future( value(future( subset(iris, Species == "setosa") )) )
y2 <- value(f)
# Error: Identified global objects via static code inspection (subset(iris, Species == "setosa")).
# Failed to locate global object in the relevant environments: 'Species'

=> throws a "simpleError" saying it can't find global objet id_var.

I assume this the type of error you are seeing(*).

I will investigate; it could be an issue with the 'globals' package but it may be due to code in the 'future' package.

PS. (*) Please avoid "paraphrasing" error messages when you report issues. Instead, cut'n'pasted error message are much preferred - that'll reduce ambiguity, lower the risk for misunderstandings, and cut corners when it comes to troubleshooting. Knowing the exact error message often allows you to find the line (or at least lines) in the code where the error occurs.

HenrikBengtsson added a commit to futureverse/future that referenced this issue Mar 24, 2018
…ables to exist,

even when they were was false positives.

H/T HenrikBengtsson/globals#35
@HenrikBengtsson
Copy link
Owner

This has been fixed in the develop branch of future - it'll be part of the next release, i.e. future 1.8.0. In the meanwhile, you can workaround the problem by doing:

library("future")
options(future.globals.onMissing = "ignore")

I'm closing this one, but if the above doesn't fix it, please report back. Thanks again for the report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants