Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make balance = observations work with strata #364

Merged
merged 5 commits into from
Sep 16, 2022
Merged

Make balance = observations work with strata #364

merged 5 commits into from
Sep 16, 2022

Conversation

mikemahoney218
Copy link
Member

@mikemahoney218 mikemahoney218 commented Sep 8, 2022

Here's a draft implementation of how stratification might work with balance = observations for group_vfold_cv().

library(rsample)
set.seed(11)

group_table <- tibble::tibble(
  group = 1:100,
  outcome = sample(c(rep(0, 89), rep(1, 11)))
)
observation_table <- tibble::tibble(
  group = sample(1:100, 1e5, replace = TRUE),
  observation = 1:1e5
)
sample_data <- dplyr::full_join(group_table, observation_table, by = "group")
group_vfold_cv(sample_data, group, v = 5, strata = outcome, balance = "observations", pool = 0.1)
#> # Group 5-fold cross-validation 
#> # A tibble: 5 × 2
#>   splits                id       
#>   <list>                <chr>    
#> 1 <split [79958/20042]> Resample1
#> 2 <split [80731/19269]> Resample2
#> 3 <split [79898/20102]> Resample3
#> 4 <split [79299/20701]> Resample4
#> 5 <split [80114/19886]> Resample5

strata_rate <- purrr::map(
  1:100,
  \(x) {
    rs4 <- group_vfold_cv(sample_data, group, v = 5, strata = outcome, balance = "observations", pool = 0.1)
    purrr::map_dbl(
      rs4$splits,
      function(x) {
        dat <- as.data.frame(x)$outcome
        mean(dat == "1")
      }
    )
  }
) |> unlist()

base_rate <- purrr::map(
  1:100,
  \(x) {
    rs4 <- group_vfold_cv(sample_data, group, v = 5, balance = "observations")
    purrr::map_dbl(
      rs4$splits,
      function(x) {
        dat <- as.data.frame(x)$outcome
        mean(dat == "1")
      }
    )
  }
) |> unlist()

# Mean absolute error of strata proportions, versus expected
mean(abs(0.10 - strata_rate))
#> [1] 0.009782044
mean(abs(0.10 - base_rate))
#> [1] 0.01529062

Created on 2022-09-16 with reprex v2.0.2

I'm not 100% convinced this is the right way to do things; let me know if anything here seems wonky.

Addresses #317 (I'll say "fixed" when we have balance_prop as well).

@mikemahoney218 mikemahoney218 marked this pull request as ready for review September 8, 2022 17:15
Copy link
Member

@hfrick hfrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! 🙌 Looks good!

I think the reason it does "slightly worse" in your reprex is because both are unstratified? The call to group_vfold_cv() for calculating strata_rate is missing the strata argument. When we add that, we get a more fitting result.

R/vfold.R Outdated Show resolved Hide resolved
R/vfold.R Outdated Show resolved Hide resolved
tests/testthat/test-vfold.R Outdated Show resolved Hide resolved
tests/testthat/test-vfold.R Outdated Show resolved Hide resolved
Co-authored-by: Hannah Frick <hfrick@users.noreply.github.com>
@mikemahoney218
Copy link
Member Author

mikemahoney218 commented Sep 16, 2022

I think the reason it does "slightly worse" in your reprex is because both are unstratified

... 🤦 I edited my text, that is actively embarrassing 😆

@mikemahoney218
Copy link
Member Author

@hfrick CI is passing now 😄

It strikes me that balance_prop_strata() could work pretty much exactly the same way as this function. I'm only mostly recovered and my candidacy exam starts Monday, so I am probably not going to send a PR for that over the weekend... but we shall see 😆

@hfrick
Copy link
Member

hfrick commented Sep 16, 2022

How about you don't send a PR until you're past your exam and fully recovered? 😄 🤝

@hfrick hfrick merged commit c7874be into main Sep 16, 2022
@hfrick hfrick deleted the mike/balance_obs branch September 16, 2022 19:34
@github-actions
Copy link

github-actions bot commented Oct 1, 2022

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Oct 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants