Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behaviour from mice.mids when argument newdata is used #313

Closed
gerkovink opened this issue Feb 25, 2021 · 9 comments
Closed

Unexpected behaviour from mice.mids when argument newdata is used #313

gerkovink opened this issue Feb 25, 2021 · 9 comments
Assignees

Comments

@gerkovink
Copy link
Member

gerkovink commented Feb 25, 2021

@stefvanbuuren @prockenschaub

@paulinavonstackelberg and I seem to have stumbled upon some inconsistency in the usage of the newdata argument in mice.mids(). Take e.g.

library(mice)
set.seed(123)
which <- sample(1:25, 10)
train <- nhanes2[which, ]
test <-  nhanes2[-which, ]

# set point of departure with 4 iterations
imp0 <- mice(nhanes2, 
             ignore = 1:25 %in% which, 
             m = 4, 
             print = FALSE)

# add additional iteration, take imp0 as basis
imp1 <- mice.mids(imp0, maxit = 1)
#> 
#>  iter imp variable
#>   6   1  bmi  hyp  chl
#>   6   2  bmi  hyp  chl
#>   6   3  bmi  hyp  chl
#>   6   4  bmi  hyp  chl
imp1b <- mice.mids(imp0, maxit = 1)
#> 
#>  iter imp variable
#>   6   1  bmi  hyp  chl
#>   6   2  bmi  hyp  chl
#>   6   3  bmi  hyp  chl
#>   6   4  bmi  hyp  chl
identical(imp1, imp1b)
#> [1] TRUE

# now feed new data with imp0 as basis
imp2 <- mice.mids(imp0, maxit = 1, newdata = train)
#> 
#>  iter imp variable
#>   6   1  bmi  hyp  chl
#>   6   2  bmi  hyp  chl
#>   6   3  bmi  hyp  chl
#>   6   4  bmi  hyp  chl
imp2b <- mice.mids(imp0, maxit = 1, newdata = train)
#> 
#>  iter imp variable
#>   6   1  bmi  hyp  chl
#>   6   2  bmi  hyp  chl
#>   6   3  bmi  hyp  chl
#>   6   4  bmi  hyp  chl
identical(imp2, imp2b)
#> [1] FALSE

train # training set
#>      age  bmi  hyp chl
#> 15 20-39 29.6   no  NA
#> 19 20-39 35.3   no 218
#> 14 40-59 28.7  yes 204
#> 3  20-39   NA   no 187
#> 10 40-59   NA <NA>  NA
#> 18 40-59 26.3  yes 199
#> 11 20-39   NA <NA>  NA
#> 5  20-39 20.4   no 113
#> 23 20-39 27.5   no 131
#> 6  60-99   NA <NA> 184
complete(imp1)[which, ] # imputed set with ignore
#>      age  bmi hyp chl
#> 15 20-39 29.6  no 206
#> 19 20-39 35.3  no 218
#> 14 40-59 28.7 yes 204
#> 3  20-39 27.2  no 187
#> 10 40-59 33.2  no 206
#> 18 40-59 26.3 yes 199
#> 11 20-39 30.1  no 187
#> 5  20-39 20.4  no 113
#> 23 20-39 27.5  no 131
#> 6  60-99 21.7  no 184
complete(imp2) # imputed by with.mids
#>       age  bmi hyp chl
#> 151 20-39 29.6  no 187
#> 191 20-39 35.3  no 218
#> 141 40-59 28.7 yes 204
#> 31  20-39 33.2  no 187
#> 101 40-59 27.2 yes 187
#> 181 40-59 26.3 yes 199
#> 111 20-39 24.9  no 187
#> 51  20-39 20.4  no 113
#> 231 20-39 27.5  no 131
#> 61  60-99 22.7 yes 184

Created on 2021-02-25 by the reprex package (v1.0.0)

There are two unexpected results and I would like to check if this is deliberate/expected behaviour:

  1. imp1 and imp2 are not identical, yet they start from the same basis and have the same data. One is entered through the data/ignore argument. One though the newdata argument in mice.mids().
  2. imp2 and imp2b are not identical, yet imp1 and imp1b are.

It seems that mice.mids() does not inherit the seed in the mids object imp0. If we fix it manually before running mice.mids() this inconsistency disappears, yet the regular mice.mids object still differs from the one with the newdata argument:

library(mice)
set.seed(123)
which <- sample(1:25, 10)
train <- nhanes2[which, ]
test <-  nhanes2[-which, ]

# set point of departure with 4 iterations
imp0 <- mice(nhanes2, 
             ignore = 1:25 %in% which, 
             m = 4, 
             print = FALSE)

# add additional iteration, take imp0 as basis
set.seed(123)
imp3 <- mice.mids(imp0, maxit = 1)
#> 
#>  iter imp variable
#>   6   1  bmi  hyp  chl
#>   6   2  bmi  hyp  chl
#>   6   3  bmi  hyp  chl
#>   6   4  bmi  hyp  chl
set.seed(123)
imp3b <- mice.mids(imp0, maxit = 1)
#> 
#>  iter imp variable
#>   6   1  bmi  hyp  chl
#>   6   2  bmi  hyp  chl
#>   6   3  bmi  hyp  chl
#>   6   4  bmi  hyp  chl
identical(imp3, imp3b)
#> [1] TRUE

# now feed new data with imp0 as basis
set.seed(123)
imp4 <- mice.mids(imp0, maxit = 1, newdata = train)
#> 
#>  iter imp variable
#>   6   1  bmi  hyp  chl
#>   6   2  bmi  hyp  chl
#>   6   3  bmi  hyp  chl
#>   6   4  bmi  hyp  chl
set.seed(123)
imp4b <- mice.mids(imp0, maxit = 1, newdata = train)
#> 
#>  iter imp variable
#>   6   1  bmi  hyp  chl
#>   6   2  bmi  hyp  chl
#>   6   3  bmi  hyp  chl
#>   6   4  bmi  hyp  chl
identical(imp4, imp4b)
#> [1] TRUE
identical(imp3, imp4)
#> [1] FALSE

Created on 2021-02-25 by the reprex package (v1.0.0)

As it is currently programmed, using the newdata argument currently seems to yield some 'all-bets-are-offapproach to replicability. This makes one wonder if the inferences can then be trusted whennewdata` is used.

@prockenschaub
Copy link
Contributor

@gerkovink definitely not deliberate/expected behaviour. I'll have a look at it and post an update when I find the culprit.

@prockenschaub
Copy link
Contributor

prockenschaub commented Feb 26, 2021

Having looked at this in some more detail, maybe it isn't quite as unexpected. The main problem arises from when random numbers are generated, and how many of them.

Difference in results between imp2 and imp2b

The problem in this case is a call to initialize.imp() that picks starting values for all missing entries in newdata. Currently, this is done at the very beginning of mice.mids() and before .Random.seed is overwritten with obj$lastSeedValue. The initial repacements for the missing values in newdata therefore randomly differ between imp2 and imp2b, leading to differences in the imputation values picked during the iteration.

library(mice)
#> 
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following objects are masked from 'package:base':
#> 
#>     cbind, rbind
set.seed(123)
which <- sample(1:25, 10)
train <- nhanes2[which, ]
test <-  nhanes2[-which, ]

# set point of departure with 4 iterations
imp0 <- mice(nhanes2, 
             ignore = 1:25 %in% which, 
             m = 4, 
             print = FALSE)

# merely initializing the newdata already leads to differences
imp2 <- mice.mids(imp0, maxit = 0, newdata = train)
imp2b <- mice.mids(imp0, maxit = 0, newdata = train)
identical(imp2, imp2b)
#> [1] FALSE

Created on 2021-02-26 by the reprex package (v0.3.0)

This also explains why manually setting a seed before calling mice.mids() leads to identical results. This problem is notably absent for imp1 and imp1b in the original example, since the initial values were already chosen when creating imp0 and are therefore only picked once.

Solution: A simple fix for this would be to move the re-setting of .Random.seed in line 98 of mice.mids() to the beginning of the function.

Potential problems: This does come with a downside, however. In a prediction setting, it might be the case that I don't just want to impute values in a single test dataset (as would be done during model evaluation) but instead want to use this prospectively as data come in. By setting the .Random.seed always to the value it had after the last training iteration, future observations that have the same covariate pattern will always have the same imputation, which might be undesirable. There is a way around this by manually changing imp_train$lastSeedValue before calling mice.mids(), so maybe this isn't a big concern.

Differences between imp3 and imp4

This problem goes a bit deeper and might not be quite as easy to fix. The root of the problem (I think) is that the two objects have a different underlying data:

library(mice)
#> 
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following objects are masked from 'package:base':
#> 
#>     cbind, rbind
set.seed(123)
which <- sample(1:25, 10)
train <- nhanes2[which, ]
test <-  nhanes2[-which, ]

# set point of departure with 4 iterations
imp0 <- mice(nhanes2, 
             ignore = 1:25 %in% which, 
             m = 4, 
             print = FALSE)

# Without newdata, the number of rows remains the same as during training
imp3 <- mice.mids(imp0, maxit = 0)
nrow(imp3$data)
#> [1] 25

# If newdata is used, it is appended to the data available during training
imp4 <- mice.mids(imp0, maxit = 0, newdata = train)
nrow(imp4$data)
#> [1] 35

Created on 2021-02-26 by the reprex package (v0.3.0)

newdata is appended to the imp object used during training (which is needed to run the imputation iterations), and after imputation only the subset pertaining to newdata is returned. This will even lead to different results when the data is initialized the same. Reason for this is that the two objects generate a different number of random numbers during iterations (for the missing data of 25 rows in imp3 and for the missing data of 35 rows in imp4). Despite having the same test data, they are therefore different objects.

An example from base mice were we see something similar happen is if we shuffle the data. Similar to above, by drawing random numbers in a different order, the results change. Having more rows in imp4 changes when a specific value is drawn and thus also leads to different results.

library(mice)
#> 
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following objects are masked from 'package:base':
#> 
#>     cbind, rbind
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
set.seed(123)
which <- sample(1:25, 10)
train <- nhanes2[which, ]
test <-  nhanes2[-which, ]

# Shuffle nhanes2. The data is perturbed, but the 
# total information in it remains the same
nhanes2_shuffled <- nhanes2[sample(nrow(nhanes2)), ]
identical(nhanes2, nhanes2_shuffled)
#> [1] FALSE
setequal(nhanes2, nhanes2_shuffled)
#> [1] TRUE

# set point of departure with 4 iterations
set.seed(321)
imp0 <- mice(nhanes2, 
             ignore = 1:25 %in% which, 
             m = 4, 
             print = FALSE)

# Shuffling the data leads to imputations being drawn in a different order
set.seed(321)
imp0_shuffled <- mice(nhanes2_shuffled, 
                      ignore = rownames(nhanes2_shuffled) %in% as.character(which), 
                      m = 4, 
                      print = FALSE)

setequal(complete(imp0), complete(imp0_shuffled))
#> [1] FALSE

Created on 2021-02-26 by the reprex package (v0.3.0)

Solution: I don't have an easy solution for this. Neither ensuring that the initialisation in mice() and mice.mids() leads to the same results for imp3 and imp4 nor somehow fixing in which order random variables are drawn seems straightforward. Removing all rows with ignore == TRUE if newdata is provided might solve the latter but this would still lead the initialisation issue.

@prockenschaub
Copy link
Contributor

@stefvanbuuren @gerkovink what's your view on this? Would you be happy for me to create a PR, moving the setting of .Random.seed in mice.mids (thus fixing imp2 and imp2b)? Are there any other options that you would want to investigate regarding the differences between imp3 and imp4, or are we satisfied that the results will differ in that scenario?

@stefvanbuuren
Copy link
Member

The original proposal foresees the use of two streams of random numbers (in steps 1 and 5). These weren't implemented (because it's tricky), but perhaps that idea could help to achieve exact reproduction.

But should we do this? As long as the statistical properties do not change, it's probably not hugely important to have exact reproducibility of the imputed values. Do we have a case where "the problem" affects practice and interpretation?

@gerkovink
Copy link
Member Author

But should we do this?

I vote that we don't do this. Everything is exactly reproducible in simulation given a session-dependent seed; this flags simulation purposes. For inferences, I agree with Stef: exact reproducibility of the code can be achieved again by setting a seed for the session and not being able to exactly reproduce the imputations by means of the newdata argument is statistically redundant with respect to the validity of any obtained inferences.

@stefvanbuuren
Copy link
Member

Agree. Perhaps we can re-open if we obtain evidence that the issue affects practice.

@prockenschaub
Copy link
Contributor

This is also my preferred option.

The only change I would advocate for is to move the setting of the .Random.seed further up in the code of mice.mids(), thereby fixing the difference between imp2 and imp2b.

I will further try to finally try and create a short vignette for ingore and newdata, highlighting the (expected) difference between imp3 and imp4, since this is something that will likely continue to confuse many users.

@stefvanbuuren
Copy link
Member

Yes, agree to move up .Random.seed instead of having a free-flying mice() initialisation.

@stefvanbuuren
Copy link
Member

Now merged. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants