Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

furrr stumbles over grouped dataframes #28

Closed
jschelbert opened this issue Aug 2, 2018 · 6 comments
Closed

furrr stumbles over grouped dataframes #28

jschelbert opened this issue Aug 2, 2018 · 6 comments

Comments

@jschelbert
Copy link

I've run into some wired behavior when using dplyr and mutate and the dataframe is grouped. Calculations with future_map take forever compared to usage of purrr::map. This becomes cumbersome for my workflow which resembles
df %>% group_by(some_var) %>% nest() %>% mutate(results = future_map(data, some_expensive_calculation)

I've attached a (hopefully helpful) reprex:
Example from github works as expected.

library(rsample)
#> Lade nötiges Paket: broom
#> Lade nötiges Paket: tidyr
#> 
#> Attache Paket: 'rsample'
#> The following object is masked from 'package:tidyr':
#> 
#>     fill
data("attrition")
names(attrition)
#>  [1] "Age"                      "Attrition"               
#>  [3] "BusinessTravel"           "DailyRate"               
#>  [5] "Department"               "DistanceFromHome"        
#>  [7] "Education"                "EducationField"          
#>  [9] "EnvironmentSatisfaction"  "Gender"                  
#> [11] "HourlyRate"               "JobInvolvement"          
#> [13] "JobLevel"                 "JobRole"                 
#> [15] "JobSatisfaction"          "MaritalStatus"           
#> [17] "MonthlyIncome"            "MonthlyRate"             
#> [19] "NumCompaniesWorked"       "OverTime"                
#> [21] "PercentSalaryHike"        "PerformanceRating"       
#> [23] "RelationshipSatisfaction" "StockOptionLevel"        
#> [25] "TotalWorkingYears"        "TrainingTimesLastYear"   
#> [27] "WorkLifeBalance"          "YearsAtCompany"          
#> [29] "YearsInCurrentRole"       "YearsSinceLastPromotion" 
#> [31] "YearsWithCurrManager"

set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
rs_obj
#> #  20-fold cross-validation repeated 10 times 
#> # A tibble: 200 x 3
#>    splits       id       id2   
#>    <list>       <chr>    <chr> 
#>  1 <S3: rsplit> Repeat01 Fold01
#>  2 <S3: rsplit> Repeat01 Fold02
#>  3 <S3: rsplit> Repeat01 Fold03
#>  4 <S3: rsplit> Repeat01 Fold04
#>  5 <S3: rsplit> Repeat01 Fold05
#>  6 <S3: rsplit> Repeat01 Fold06
#>  7 <S3: rsplit> Repeat01 Fold07
#>  8 <S3: rsplit> Repeat01 Fold08
#>  9 <S3: rsplit> Repeat01 Fold09
#> 10 <S3: rsplit> Repeat01 Fold10
#> # ... with 190 more rows

mod_form <- as.formula(Attrition ~ JobSatisfaction + Gender + MonthlyIncome)

library(broom)
## splits will be the `rsplit` object with the 90/10 partition
holdout_results <- function(splits, ...) {
    # Fit the model to the 90%
    mod <- glm(..., data = analysis(splits), family = binomial)
    # Save the 10%
    holdout <- assessment(splits)
    # `augment` will save the predictions with the holdout data set
    res <- broom::augment(mod, newdata = holdout)
    # Class predictions on the assessment set from class probs
    lvls <- levels(holdout$Attrition)
    predictions <- factor(ifelse(res$.fitted > 0, lvls[2], lvls[1]),
                          levels = lvls)
    # Calculate whether the prediction was correct
    res$correct <- predictions == holdout$Attrition
    # Return the assessment data set with the additional columns
    res
}


# old example ---------------------------------------------------------------------------------
library(purrr)
library(tictoc)
set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
tic()
rs_obj$results <- map(rs_obj$splits, holdout_results, mod_form)
toc()
#> 2.87 sec elapsed

library(furrr)
#> Lade nötiges Paket: future
plan(multiprocess, workers = 4)
set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
tic()
rs_obj$results <- future_map(rs_obj$splits, holdout_results, mod_form)
toc()
#> 1.78 sec elapsed

plan(multiprocess, workers = 8)
set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
tic()
rs_obj$results <- future_map(rs_obj$splits, holdout_results, mod_form)
toc()
#> 1.251 sec elapsed

Using dplyr's mutate for adding the new columns.

# using dplyr ---------------------------------------------------------------------------------
library(dplyr)
#> 
#> Attache Paket: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
tic()
rs_obj <- rs_obj %>% mutate(results = map(splits, holdout_results, mod_form))
toc()
#> 3.073 sec elapsed

set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
tic()
rs_obj <- rs_obj %>% mutate(results = future_map(splits, holdout_results, mod_form))
toc()
#> 1.088 sec elapsed

set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
tic()
rs_obj <- rs_obj %>% mutate(results = future_map(splits, function(x) holdout_results(x, mod_form)))
toc()
#> 0.793 sec elapsed

Now for the grouped dataframes:

# grouped data.frame --------------------------------------------------------------------------
set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10) %>% mutate(g_id = row_number()) %>% group_by(g_id)
tic()
rs_obj <- rs_obj %>% mutate(results = map(splits, holdout_results, mod_form))
toc()
#> 2.883 sec elapsed

set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10) %>% mutate(g_id = row_number()) %>% group_by(g_id)
tic()
rs_obj <- rs_obj %>% mutate(results = future_map(splits, holdout_results, mod_form))
toc()
#> 12.228 sec elapsed

set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10) %>% mutate(g_id = row_number()) %>% group_by(g_id)
tic()
rs_obj <- rs_obj %>% mutate(results = future_map(splits, function(x) holdout_results(x, mod_form)))
toc()
#> 11.633 sec elapsed

The calculation with furrr on the grouped dataframe takes considerably longer than purrr which I would not expect.

Created on 2018-08-02 by the reprex package (v0.2.0).

@DavisVaughan
Copy link
Collaborator

This is because the way you have it set up, future_map() can't do what it is good at, sharding the splits over the cores of your computer.

With a grouped data frame like this one, with 200 groups, future_map() is called 200 times, each with 1 split object. Rather, with the ungrouped version future_map() is called 1 time, with 200 splits, and it nicely shards them over your computer's workers.

If you don't believe me, run debugonce(future_map) right before you call it on the grouped version, and see what is in the .x variable. It should just be 1 split. This is not good.

So, yes! It is going to be slower this way, and hopefully in your real example you can think of another way to do it so that future_map() can actually see all splits at once.

@jschelbert
Copy link
Author

Hi @DavisVaughan,
thank you for your extraordinary fast answer. I already suspected something like that. I also noticed that nest() removes groups, thus, as you already suggested, I will not have to use groups... Thank you again for the answer.
Maybe a hint on the behavior with grouped dataframes would be nice somewhere in the documentation.

Anyway, keep up the good work.

@DavisVaughan
Copy link
Collaborator

I've added a new issue for documentation updates. Thanks!

@krltrl
Copy link

krltrl commented Mar 6, 2020

Wouldn't it be a good idea for furrr to ungroup() the dataframe as a first standard step?

It would align better with the tidyverse.

It is quite common to use grouped dataframes with nest() and map() (see broom and dplyr)

Also see nest vignette:

nest() specifies which variables should be nested inside; an alternative is to use dplyr::group_by() to describe which variables should be kept outside.

mtcars_nested <- mtcars %>% 
  group_by(cyl) %>% 
  nest()

mtcars_nested
#> # A tibble: 3 x 2
#> # Groups:   cyl [3]
#>     cyl            data
#>   <dbl> <list<df[,10]>>
#> 1     6        [7 × 10]
#> 2     4       [11 × 10]
#> 3     8       [14 × 10]

I think nesting is easiest to understand in connection to grouped data: each row in the output corresponds to one group in the input.

I found it quite surprising, that future_map() is executed sequentially when used in nested and grouped data frames. It took me a while to find out what the problem was.

@DavisVaughan
Copy link
Collaborator

I don't think so, that would be very different from what purrr does

@wbvguo
Copy link

wbvguo commented Jan 12, 2024

@DavisVaughan it is indeed frustrated to troubleshoot why furrr::future_map()is slow (the case when input dataframe is groupped), especially for a user who first used this function. Perhaps it would it be helpful to make future_map() first detect whether the input dataframe is grouped before spawning to workers? if the dataframe is groupped, then throw a warning to make the user aware this potential pitfall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants