furrr stumbles over grouped dataframes #28

jschelbert · 2018-08-02T14:45:48Z

I've run into some wired behavior when using dplyr and mutate and the dataframe is grouped. Calculations with future_map take forever compared to usage of purrr::map. This becomes cumbersome for my workflow which resembles
df %>% group_by(some_var) %>% nest() %>% mutate(results = future_map(data, some_expensive_calculation)

I've attached a (hopefully helpful) reprex:
Example from github works as expected.

library(rsample)
#> Lade nötiges Paket: broom
#> Lade nötiges Paket: tidyr
#> 
#> Attache Paket: 'rsample'
#> The following object is masked from 'package:tidyr':
#> 
#>     fill
data("attrition")
names(attrition)
#>  [1] "Age"                      "Attrition"               
#>  [3] "BusinessTravel"           "DailyRate"               
#>  [5] "Department"               "DistanceFromHome"        
#>  [7] "Education"                "EducationField"          
#>  [9] "EnvironmentSatisfaction"  "Gender"                  
#> [11] "HourlyRate"               "JobInvolvement"          
#> [13] "JobLevel"                 "JobRole"                 
#> [15] "JobSatisfaction"          "MaritalStatus"           
#> [17] "MonthlyIncome"            "MonthlyRate"             
#> [19] "NumCompaniesWorked"       "OverTime"                
#> [21] "PercentSalaryHike"        "PerformanceRating"       
#> [23] "RelationshipSatisfaction" "StockOptionLevel"        
#> [25] "TotalWorkingYears"        "TrainingTimesLastYear"   
#> [27] "WorkLifeBalance"          "YearsAtCompany"          
#> [29] "YearsInCurrentRole"       "YearsSinceLastPromotion" 
#> [31] "YearsWithCurrManager"

set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
rs_obj
#> #  20-fold cross-validation repeated 10 times 
#> # A tibble: 200 x 3
#>    splits       id       id2   
#>    <list>       <chr>    <chr> 
#>  1 <S3: rsplit> Repeat01 Fold01
#>  2 <S3: rsplit> Repeat01 Fold02
#>  3 <S3: rsplit> Repeat01 Fold03
#>  4 <S3: rsplit> Repeat01 Fold04
#>  5 <S3: rsplit> Repeat01 Fold05
#>  6 <S3: rsplit> Repeat01 Fold06
#>  7 <S3: rsplit> Repeat01 Fold07
#>  8 <S3: rsplit> Repeat01 Fold08
#>  9 <S3: rsplit> Repeat01 Fold09
#> 10 <S3: rsplit> Repeat01 Fold10
#> # ... with 190 more rows

mod_form <- as.formula(Attrition ~ JobSatisfaction + Gender + MonthlyIncome)

library(broom)
## splits will be the `rsplit` object with the 90/10 partition
holdout_results <- function(splits, ...) {
    # Fit the model to the 90%
    mod <- glm(..., data = analysis(splits), family = binomial)
    # Save the 10%
    holdout <- assessment(splits)
    # `augment` will save the predictions with the holdout data set
    res <- broom::augment(mod, newdata = holdout)
    # Class predictions on the assessment set from class probs
    lvls <- levels(holdout$Attrition)
    predictions <- factor(ifelse(res$.fitted > 0, lvls[2], lvls[1]),
                          levels = lvls)
    # Calculate whether the prediction was correct
    res$correct <- predictions == holdout$Attrition
    # Return the assessment data set with the additional columns
    res
}


# old example ---------------------------------------------------------------------------------
library(purrr)
library(tictoc)
set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
tic()
rs_obj$results <- map(rs_obj$splits, holdout_results, mod_form)
toc()
#> 2.87 sec elapsed

library(furrr)
#> Lade nötiges Paket: future
plan(multiprocess, workers = 4)
set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
tic()
rs_obj$results <- future_map(rs_obj$splits, holdout_results, mod_form)
toc()
#> 1.78 sec elapsed

plan(multiprocess, workers = 8)
set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
tic()
rs_obj$results <- future_map(rs_obj$splits, holdout_results, mod_form)
toc()
#> 1.251 sec elapsed

Using dplyr's mutate for adding the new columns.

# using dplyr ---------------------------------------------------------------------------------
library(dplyr)
#> 
#> Attache Paket: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
tic()
rs_obj <- rs_obj %>% mutate(results = map(splits, holdout_results, mod_form))
toc()
#> 3.073 sec elapsed

set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
tic()
rs_obj <- rs_obj %>% mutate(results = future_map(splits, holdout_results, mod_form))
toc()
#> 1.088 sec elapsed

set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
tic()
rs_obj <- rs_obj %>% mutate(results = future_map(splits, function(x) holdout_results(x, mod_form)))
toc()
#> 0.793 sec elapsed

Now for the grouped dataframes:

# grouped data.frame --------------------------------------------------------------------------
set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10) %>% mutate(g_id = row_number()) %>% group_by(g_id)
tic()
rs_obj <- rs_obj %>% mutate(results = map(splits, holdout_results, mod_form))
toc()
#> 2.883 sec elapsed

set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10) %>% mutate(g_id = row_number()) %>% group_by(g_id)
tic()
rs_obj <- rs_obj %>% mutate(results = future_map(splits, holdout_results, mod_form))
toc()
#> 12.228 sec elapsed

set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10) %>% mutate(g_id = row_number()) %>% group_by(g_id)
tic()
rs_obj <- rs_obj %>% mutate(results = future_map(splits, function(x) holdout_results(x, mod_form)))
toc()
#> 11.633 sec elapsed

The calculation with furrr on the grouped dataframe takes considerably longer than purrr which I would not expect.

Created on 2018-08-02 by the reprex package (v0.2.0).

The text was updated successfully, but these errors were encountered:

DavisVaughan · 2018-08-02T15:04:35Z

This is because the way you have it set up, future_map() can't do what it is good at, sharding the splits over the cores of your computer.

With a grouped data frame like this one, with 200 groups, future_map() is called 200 times, each with 1 split object. Rather, with the ungrouped version future_map() is called 1 time, with 200 splits, and it nicely shards them over your computer's workers.

If you don't believe me, run debugonce(future_map) right before you call it on the grouped version, and see what is in the .x variable. It should just be 1 split. This is not good.

So, yes! It is going to be slower this way, and hopefully in your real example you can think of another way to do it so that future_map() can actually see all splits at once.

jschelbert · 2018-08-02T15:19:13Z

Hi @DavisVaughan,
thank you for your extraordinary fast answer. I already suspected something like that. I also noticed that nest() removes groups, thus, as you already suggested, I will not have to use groups... Thank you again for the answer.
Maybe a hint on the behavior with grouped dataframes would be nice somewhere in the documentation.

Anyway, keep up the good work.

DavisVaughan · 2018-08-02T15:24:31Z

I've added a new issue for documentation updates. Thanks!

krltrl · 2020-03-06T15:48:10Z

Wouldn't it be a good idea for furrr to ungroup() the dataframe as a first standard step?

It would align better with the tidyverse.

It is quite common to use grouped dataframes with nest() and map() (see broom and dplyr)

Also see nest vignette:

nest() specifies which variables should be nested inside; an alternative is to use dplyr::group_by() to describe which variables should be kept outside.

mtcars_nested <- mtcars %>% 
  group_by(cyl) %>% 
  nest()

mtcars_nested
#> # A tibble: 3 x 2
#> # Groups:   cyl [3]
#>     cyl            data
#>   <dbl> <list<df[,10]>>
#> 1     6        [7 × 10]
#> 2     4       [11 × 10]
#> 3     8       [14 × 10]

I think nesting is easiest to understand in connection to grouped data: each row in the output corresponds to one group in the input.

I found it quite surprising, that future_map() is executed sequentially when used in nested and grouped data frames. It took me a while to find out what the problem was.

DavisVaughan · 2020-03-06T16:26:08Z

I don't think so, that would be very different from what purrr does

wbvguo · 2024-01-12T05:09:48Z

@DavisVaughan it is indeed frustrated to troubleshoot why furrr::future_map()is slow (the case when input dataframe is groupped), especially for a user who first used this function. Perhaps it would it be helpful to make future_map() first detect whether the input dataframe is grouped before spawning to workers? if the dataframe is groupped, then throw a warning to make the user aware this potential pitfall.

jschelbert closed this as completed Aug 2, 2018

DavisVaughan mentioned this issue Aug 2, 2018

Documentation updates: #29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

furrr stumbles over grouped dataframes #28

furrr stumbles over grouped dataframes #28

jschelbert commented Aug 2, 2018

DavisVaughan commented Aug 2, 2018

jschelbert commented Aug 2, 2018

DavisVaughan commented Aug 2, 2018

krltrl commented Mar 6, 2020

DavisVaughan commented Mar 6, 2020

wbvguo commented Jan 12, 2024 •

edited

Loading

furrr stumbles over grouped dataframes #28

furrr stumbles over grouped dataframes #28

Comments

jschelbert commented Aug 2, 2018

DavisVaughan commented Aug 2, 2018

jschelbert commented Aug 2, 2018

DavisVaughan commented Aug 2, 2018

krltrl commented Mar 6, 2020

DavisVaughan commented Mar 6, 2020

wbvguo commented Jan 12, 2024 • edited Loading

wbvguo commented Jan 12, 2024 •

edited

Loading