-
-
Notifications
You must be signed in to change notification settings - Fork 39
/
11-imputation.Rmd
2223 lines (1452 loc) · 108 KB
/
11-imputation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Imputation (Missing Data)
## Introduction to Missing Data
Missing data is a common problem in statistical analyses and data science, impacting the quality and reliability of insights derived from datasets. One widely used approach to address this issue is **imputation**, where missing data is replaced with *reasonable estimates*.
### Types of Imputation
Imputation can be categorized into:
1. **Unit Imputation**: Replacing an entire missing observation (i.e., all features for a single data point are missing).
2. **Item Imputation**: Replacing missing values for specific variables (features) within a dataset.
While imputation offers a means to make use of incomplete datasets, it has historically been viewed skeptically. This skepticism arises from:
1. Frequent **misapplication** of imputation techniques, which can introduce significant **bias** to estimates.
2. Limited **applicability**, as imputation works well only under certain assumptions about the missing data mechanism and research objectives.
**Biases in imputation** can arise from various factors, including:
- **Imputation method**: The chosen method can influence the results and introduce biases.
- **Missing data mechanism**: The nature of the missing data---whether it is [Missing Completely at Random (MCAR)](#missing-completely-at-random-mcar) or [Missing at Random (MAR)](#missing-at-random-mar)---affects the accuracy of imputation.
- **Proportion of missing data**: The amount of missing data significantly impacts the reliability of the imputation.
- **Available information in the dataset**: Limited information reduces the robustness of the imputed values.
### When and Why to Use Imputation
The appropriateness of imputation depends on the nature of the missing data and the research goal:
- **Missing Data in the Outcome Variable** ($y$): Imputation in such cases is generally problematic, as it can distort statistical models and lead to misleading conclusions. For example, imputing outcomes in regression or classification problems can alter the underlying relationship between the dependent and independent variables.
- **Missing Data in Predictive Variables** ($x$): Imputation is more commonly applied here, especially for **non-random missing data**. Properly handled, imputation can enable the use of incomplete datasets while minimizing bias.
#### Objectives of Imputation
The utility of imputation methods differs substantially depending on whether the goal of the analysis is *inference/explanation* or *prediction*. Each goal has distinct priorities and tolerances for bias, variance, and assumptions about the missing data mechanism:
##### Inference/Explanation
In causal inference or explanatory analyses, the primary objective is to ensure valid statistical inference, emphasizing unbiased estimation of parameters and accurate representation of uncertainty. The treatment of missing data must align closely with the assumptions about the mechanism behind the missing data---whether it is Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR):
- **Bias Sensitivity:** Inference analyses require that imputed data preserve the integrity of the relationships among variables. Poorly executed imputation can introduce bias, even when it addresses missingness superficially.
- **Variance and Confidence Intervals:** For inference, the quality of the standard errors, confidence intervals, and test statistics is critical. Naive imputation methods (e.g., mean imputation) often fail to appropriately reflect the uncertainty due to missingness, leading to overconfidence in parameter estimates.
- **Mechanism Considerations:** Imputation methods, such as multiple imputation (MI), attempt to generate values consistent with the observed data distribution while accounting for missing data uncertainty. However, MI's performance depends heavily on the validity of the MAR assumption. If the missingness mechanism is MNAR and not addressed adequately, the imputed data could yield biased parameter estimates, undermining the purpose of inference.
##### Prediction
In predictive modeling, the primary goal is to maximize model accuracy (e.g., minimizing mean squared error for continuous outcomes or maximizing classification accuracy). Here, the focus shifts to optimizing predictive performance rather than ensuring unbiased parameter estimates:
- **Loss of Information:** Missing data reduces the amount of usable information in a dataset. Imputation allows the model to leverage all available data, rather than excluding incomplete cases via listwise deletion, which can significantly reduce sample size and model performance.
- **Impact on Model Fit:** In predictive contexts, imputation can reduce standard errors of the predictions and stabilize model coefficients by incorporating plausible estimates for missing values.
- **Flexibility with Mechanism:** Predictive models are less sensitive to the missing data mechanism than inferential models, as long as the imputed values help reduce variability and align with patterns in the observed data. Methods like K-Nearest Neighbors (KNN), iterative imputation, or even machine learning models (e.g., random forests for imputation) can be valuable, regardless of strict adherence to MAR or MCAR assumptions.
- **Trade-offs:** Overimputation, where too much noise or complexity is introduced in the imputation process, can harm prediction by introducing artifacts that degrade model generalizability.
##### Key Takeaways
The usefulness of imputation depends on whether the goal of the analysis is **inference** or **prediction**:
- **Inference/Explanation:** The primary concern is valid statistical inference, where biased estimates are unacceptable. Imputation is often of limited value for this purpose, as it may not address the underlying missing data mechanism appropriately [@Rubin_1996].
- **Prediction:** Imputation can be more useful in predictive modeling, as it reduces the loss of information from incomplete cases. By leveraging observed data, imputation can lower standard errors and improve model accuracy.
------------------------------------------------------------------------
### Importance of Missing Data Treatment in Statistical Modeling
Proper handling of missing data ensures:
- **Unbiased Estimates:** Avoiding distortions in parameter estimates.
- **Accurate Standard Errors:** Ensuring valid hypothesis testing and confidence intervals.
- **Adequate Statistical Power:** Maximizing the use of available data.
Ignoring or mishandling missing data can lead to:
1. **Bias:** Systematic errors in parameter estimates, especially under MAR or MNAR mechanisms.
2. **Loss of Power:** Reduced sample size leads to larger standard errors and weaker statistical significance.
3. **Misleading Conclusions:** Over-simplistic imputation methods (e.g., mean substitution) can distort relationships among variables.
------------------------------------------------------------------------
### Prevalence of Missing Data Across Domains
Missing data affects virtually all fields:
- **Business:** Non-responses in customer surveys, incomplete sales records, and transactional errors.
- **Healthcare:** Missing data in electronic health records (EHRs) due to incomplete patient histories or inconsistent data entry.
- **Social Sciences:** Non-responses or partial responses in large-scale surveys, leading to biased conclusions.
------------------------------------------------------------------------
### Practical Considerations for Imputation
- **Diagnostic Checks:** Always examine the patterns and mechanisms of missing data before applying imputation ([Diagnosing the Missing Data Mechanism]).
- **Model Selection:** Align the imputation method with the missing data mechanism and research goal.
- **Validation:** Assess the impact of imputation on results through sensitivity analyses or cross-validation.
------------------------------------------------------------------------
## Theoretical Foundations of Missing Data
### Definition and Classification of Missing Data {#definition-and-classification-of-missing-data}
Missing data refers to the absence of values for some variables in a dataset. The mechanisms underlying missingness significantly impact the validity of statistical analyses and the choice of handling methods. These mechanisms are classified into three categories:
1. [Missing Completely at Random (MCAR)]: Missingness occurs entirely by chance and is unrelated to any observed or unobserved variables. In this case, the likelihood of a value being missing is the same for all observations. For example, if survey respondents fail to answer a question because of an unrelated technical glitch, the data may be classified as MCAR.
2. [Missing at Random (MAR)]: Missingness is systematically related to observed variables but not to the missing values themselves. This means the probability of missing data can be explained using other observed information in the dataset. For instance, if younger participants are more likely to skip a specific survey question, but age is recorded, the missingness is MAR.
3. [Missing Not at Random (MNAR)](#missing-not-at-random-mnar): Missingness depends on unobserved variables or the missing values themselves.
#### Missing Completely at Random (MCAR) {#missing-completely-at-random-mcar}
MCAR occurs when the probability of missingness is entirely random and unrelated to either observed or unobserved variables. Under this mechanism, missing data do not introduce bias in parameter estimates when ignored, although statistical efficiency is reduced due to the smaller sample size.
**Mathematical Definition:** The missingness is independent of all data, both observed and unobserved:
$$
P(Y_{\text{missing}} | Y, X) = P(Y_{\text{missing}})
$$
**Characteristics of MCAR:**
- Missingness is completely unrelated to both observed and unobserved data.
- Analyses remain unbiased even if missing data are ignored, though they may lack efficiency due to reduced sample size.
- The missing data points represent a random subset of the overall data.
**Examples:**
- A sensor randomly fails at specific time points, unrelated to environmental or operational conditions.
- Survey participants randomly omit responses to certain questions without any systematic pattern.
**Methods for Testing MCAR:**
1. **Little's MCAR Test:** A formal statistical test to assess whether data are MCAR. A significant result suggests deviation from MCAR.
2. **Mean Comparison Tests:**
- T-tests or similar approaches compare observed and missing data groups on other variables. Significant differences indicate potential bias.
- Failure to reject the null hypothesis of no difference does not confirm MCAR but suggests consistency with the MCAR assumption.
**Handling MCAR:**
Since MCAR data introduce no bias, they can be handled using the following techniques:
1. **Complete Case Analysis (Listwise Deletion):**
- Analyses are performed only on cases with complete data. While unbiased under MCAR, this method reduces sample size and efficiency.
2. **Universal Singular Value Thresholding (USVT):**
- This technique is effective for MCAR data recovery but can only recover the mean structure, not the entire true distribution [@chatterjee2015matrix].
3. **SoftImpute:**
- A matrix completion method useful for some missing data problems but less effective when missingness is not MCAR [@hastie2015matrix].
4. **Synthetic Nearest Neighbor Imputation:**
- A robust method for imputing missing data. While primarily designed for MCAR, it can also handle certain cases of missing not at random (MNAR) [@agarwal2023causal]. Available on GitHub: [syntheticNN](https://github.com/deshen24/syntheticNN).
**Notes:**
- The "missingness" on one variable can be correlated with the "missingness" on another variable without violating the MCAR assumption.
- Absence of evidence for bias (e.g., failing to reject a t-test) does not confirm that the data are MCAR.
#### Missing at Random (MAR) {#missing-at-random-mar}
Missing at Random (MAR) occurs when missingness depends on observed variables but not the missing values themselves. This mechanism assumes that observed data provide sufficient information to explain the missingness. In other words, there is a systematic relationship between the propensity of missing values and the observed data, but not the missing data.
**Mathematical Definition**:
The probability of missingness is conditional only on observed data:
$$
P(Y_{\text{missing}} | Y, X) = P(Y_{\text{missing}} | X)
$$
This implies that whether an observation is missing is unrelated to the missing values themselves but is related to the observed values of other variables.
**Characteristics of MAR**:
- Missingness is systematically related to observed variables.
- The propensity for a data point to be missing is not related to the missing data but is related to some of the observed data.
- Analyses must account for observed data to mitigate bias.
**Examples**:
- Women are less likely to disclose their weight, but their gender is recorded. In this case, weight is MAR.
- Missing income data is correlated with education, which is observed. For example, individuals with higher education levels might be less likely to reveal their income.
**Challenges in MAR**:
- MAR is weaker than Missing Completely at Random (MCAR).
- It is impossible to directly test for MAR. Evidence for MAR relies on domain expertise and indirect statistical checks rather than direct tests.
**Handling MAR**:
Common methods for handling MAR include:
- **Multiple Imputation by Chained Equations (MICE):** Iteratively imputes missing values based on observed data.
- **Maximum Likelihood Estimation:** Estimates model parameters directly while accounting for MAR assumptions.
- **Regression-Based Imputation:** Predicts missing values using observed covariates.
These methods assume that observed variables fully explain the missingness. Effective handling of MAR requires careful modeling and often domain-specific knowledge to validate the assumptions underlying the analysis.
#### Missing Not at Random (MNAR) {#missing-not-at-random-mnar}
Missing Not at Random (MNAR) is the most complex missing data mechanism. Here, missingness depends on unobserved variables or the values of the missing data themselves. This makes MNAR particularly challenging, as ignoring this dependency introduces significant bias in analyses.
**Mathematical Definition**:
The probability of missingness depends on the missing values:
$$
P(Y_{\text{missing}} | Y, X) \neq P(Y_{\text{missing}} | X)
$$
**Characteristics of MNAR**:
- Missingness cannot be fully explained by observed data.
- The cause of missingness is directly related to the unobserved values.
- Ignoring MNAR introduces significant bias in parameter estimates, often leading to invalid conclusions.
**Examples**:
- High-income individuals are less likely to disclose their income, and income itself is unobserved.
- Patients with severe symptoms drop out of a clinical study, leaving their health outcomes unrecorded.
**Challenges in MNAR**:
- MNAR is the most difficult missingness mechanism to address because the missing data mechanism must be explicitly modeled.
- Identifying MNAR often requires domain knowledge and auxiliary information beyond the observed dataset.
**Handling MNAR**:
MNAR requires explicit modeling of the missingness mechanism. Common approaches include:
- **Heckman Selection Models:** These models explicitly account for the selection process leading to missing data, adjusting for potential bias [@Heckman_1976].
- **Instrumental Variables:** Variables predictive of missingness but unrelated to the outcome can be used to mitigate bias [@sun2018semiparametric; @tchetgen2017general].
- **Pattern-Mixture Models**: These models separate the data into groups (patterns) based on missingness and model each group separately. They are particularly useful when the relationship between missingness and missing values is complex.
- **Sensitivity Analysis:** Examines how conclusions change under different assumptions about the missing data mechanism.
- **Use of Auxiliary Data**
Auxiliary data refers to external data sources or variables that can help explain the missingness mechanism.
- **Surrogate Variables**: Adding variables that correlate with missing data can improve imputation accuracy and mitigate the MNAR challenge.
- **Linking External Datasets**: Merging datasets from different sources can provide additional context or predictors for missingness.
- **Applications in Business**: In marketing, customer demographics or transaction histories often serve as auxiliary data to predict missing responses in surveys.
Additionally, data collection strategies, such as follow-up surveys or targeted sampling, can help mitigate MNAR effects by collecting information that directly addresses the missingness mechanism. However, such approaches can be resource-intensive and require careful planning.
### Missing Data Mechanisms {#missing-data-mechanisms}
| **Mechanism** | **Missingness Depends On** | **Implications** | **Examples** |
|------------------|------------------|-------------------|------------------|
| **MCAR** | Neither observed nor missing data | No bias; simplest to handle; decreases efficiency due to data loss. | Random sensor failure. |
| **MAR** | Observed data only | Requires observed data to explain missingness; common assumption in imputation methods. | Gender-based missingness of weight. |
| **MNAR** | Missing data itself or unobserved variables | Requires explicit modeling of the missingness mechanism; significant bias if ignored. | High-income individuals not disclosing income. |
### Relationship Between Mechanisms and Ignorability {#relationship-between-mechanisms-and-ignorability}
The concept of ignorability is central to determining whether the missingness process must be explicitly modeled. Ignorability impacts the choice of methods for handling missing data and whether the missing data mechanism can be safely disregarded or must be explicitly accounted for.
#### Ignorable Missing Data
Missing data is **ignorable** under the following conditions:
1. The missing data mechanism is [MAR](#missing-at-random-mar) or [MCAR](#missing-completely-at-random-mcar).
2. The parameters governing the missing data process are unrelated to the parameters of interest in the analysis.
In cases of ignorable missing data, there is no need to model the missingness mechanism explicitly unless you aim to improve the efficiency or precision of parameter estimates. Common imputation techniques, such as multiple imputation or maximum likelihood estimation, rely on the assumption of ignorability to produce unbiased parameter estimates.
**Practical Considerations for Ignorable Missingness**
Even though ignorable mechanisms simplify analysis, researchers must rigorously assess whether the missingness mechanism meets the MAR or MCAR criteria. Violations can lead to biased results, even if unintentionally overlooked.
For example: A survey on income may assume MAR if missingness is associated with respondent age (observed variable) but not income itself (unobserved variable). However, if income directly influences nonresponse, the assumption of MAR is violated.
------------------------------------------------------------------------
#### Non-Ignorable Missing Data {#non-ignorable}
Missing data is **non-ignorable** when:
1. The missingness mechanism depends on the values of the missing data themselves or on unobserved variables.
2. The missing data mechanism is related to the parameters of interest, resulting in bias if the mechanism is not modeled explicitly.
This type of missingness (i.e., [Missing Not at Random (MNAR)](#missing-not-at-random-mnar) requires modeling the missing data mechanism directly to produce unbiased estimates.
**Characteristics of Non-Ignorable Missingness**
- **Dependence on Missing Values**: The likelihood of missingness is associated with the missing values themselves.
- Example: In a study on health, individuals with more severe conditions are more likely to drop out, leading to an underrepresentation of the sickest individuals in the data.
- **Bias in Complete Case Analysis**: Analyses based solely on complete cases can lead to substantial bias.
- Example: In income surveys, if wealthier individuals are less likely to report their income, the estimated mean income will be systematically lower than the true population mean.
- **Need for Explicit Modeling**: To address MNAR, the analyst must model the missing data mechanism. This often involves specifying relationships between observed data, missing data, and the missingness process itself.
#### Implications of Non-Ignorable Missingness
Non-ignorable mechanisms are often associated with sensitive or personal data:
- **Examples**:
- Individuals with lower education levels may omit their education information.
- Participants with controversial or stigmatized health conditions might opt out of surveys entirely.
- **Impact on Policy and Decision-Making**:
- Biases introduced by MNAR can have serious consequences for policymaking, such as underestimating the prevalence of poverty or mischaracterizing population health needs.
By explicitly addressing non-ignorable missingness, researchers can mitigate biases and ensure that findings accurately reflect the underlying population.
------------------------------------------------------------------------
## Diagnosing the Missing Data Mechanism
Understanding the mechanism behind missing data is critical to choosing the appropriate methods for handling it. The three main mechanisms for missing data are **MCAR (Missing Completely at Random)**, **MAR (Missing at Random)**, and **MNAR (Missing Not at Random)**. This section discusses methods for diagnosing these mechanisms, including descriptive and inferential approaches.
------------------------------------------------------------------------
### Descriptive Methods
#### Visualizing Missing Data Patterns
Visualization tools are essential for detecting patterns in missing data. Heatmaps and correlation plots can help identify systematic missingness and provide insights into the underlying mechanism.
```{r}
# Example: Visualizing missing data
library(Amelia)
missmap(
airquality,
main = "Missing Data Heatmap",
col = c("yellow", "black"),
legend = TRUE
)
```
- **Heatmaps**: Highlight where missingness occurs in a dataset.
- **Correlation Plots**: Show relationships between missingness indicators of different variables.
**Exploring Univariate and Multivariate Missingness**
- **Univariate Analysis**: Calculate the proportion of missing data for each variable.
```{r}
# Example: Proportion of missing values
missing_proportions <- colSums(is.na(airquality)) / nrow(airquality)
print(missing_proportions)
```
- **Multivariate Analysis**: Examine whether missingness in one variable is related to others. This can be visualized using scatterplots of observed vs. missing values.
```{r}
# Example: Missingness correlation
library(naniar)
vis_miss(airquality)
gg_miss_upset(airquality) # Displays a missingness upset plot
```
### Statistical Tests for Missing Data Mechanisms
#### Diagnosing MCAR: Little's Test
Little's test is a hypothesis test to determine if the missing data mechanism is **MCAR**. It tests whether the means of observed and missing data are significantly different. The null hypothesis is that the data are MCAR.
$$
\chi^2 = \sum_{i=1}^n \frac{(O_i - E_i)^2}{E_i}
$$
Where:
- $O_i$= Observed frequency
- $E_i$= Expected frequency under MCAR
```{r}
# Example: Little's test
naniar::mcar_test(airquality)
misty::na.test(airquality)
```
#### Diagnosing MCAR via Dummy Variables
Creating a binary indicator for missingness allows you to test whether the presence of missing data is related to observed data. For instance:
1. Create a dummy variable:
- 1 = Missing
- 0 = Observed
2. Conduct a chi-square test or t-test:
- Chi-square: Compare proportions of missingness across groups.
- T-test: Compare means of (other) observed variables with missingness indicators.
```{r}
# Example: Chi-square test
airquality$missing_var <- as.factor(ifelse(is.na(airquality$Ozone), 1, 0))
# Across groups of months
table(airquality$missing_var, airquality$Month)
chisq.test(table(airquality$missing_var, airquality$Month))
# Example: T-test (of other variable)
t.test(Wind ~ missing_var, data = airquality)
```
### Assessing MAR and MNAR
#### Sensitivity Analysis
Sensitivity analysis involves simulating different scenarios of missing data and assessing how the results change. For example, imputing missing values under different assumptions can provide insight into whether the data are MAR or MNAR.
#### Proxy Variables and External Data
Using proxy variables or external data sources can help assess whether missingness depends on unobserved variables (MNAR). For example, in surveys, follow-ups with non-respondents can reveal systematic differences.
#### Practical Challenges in Distinguishing MAR from MNAR
Distinguishing between Missing at Random (MAR) and Missing Not at Random (MNAR) is a critical and challenging task in data analysis. Properly identifying the nature of the missing data has significant implications for the choice of imputation strategies, model robustness, and the validity of conclusions. While statistical tests can sometimes aid in this determination, the process often relies heavily on domain knowledge, intuition, and exploratory analysis. Below, we discuss key considerations and examples that highlight these challenges:
- **Sensitive Topics**: Missing data related to sensitive or stigmatized topics, such as income, drug use, or health conditions, are often MNAR. For example, individuals with higher incomes might deliberately choose not to report their earnings due to privacy concerns. Similarly, participants in a health survey may avoid answering questions about smoking if they perceive social disapproval. In such cases, the probability of missingness is directly related to the unobserved value itself, making MNAR likely.
- **Field-Specific Norms**: Understanding norms and typical data collection practices in a specific field can provide insights into missingness patterns. For instance, in marketing surveys, respondents may skip questions about spending habits if they consider the questions intrusive. Prior research or historical data from the same domain can help infer whether missingness is more likely MAR (e.g., random skipping due to survey fatigue) or MNAR (e.g., deliberate omission by higher spenders).
- **Analyzing Auxiliary Variables**: Leveraging auxiliary variables---those correlated with the missing variable---can help infer the missingness mechanism. For example, if missing income data strongly correlates with employment status, this suggests a MAR mechanism, as the missingness depends on observed variables. However, if missingness persists even after accounting for observable predictors, MNAR might be at play.
- **Experimental Design and Follow-Up**: In longitudinal studies, dropout rates can signal MAR or MNAR patterns. For example, if dropouts occur disproportionately among participants reporting lower satisfaction in early surveys, this indicates an MNAR mechanism. Designing follow-up surveys to specifically investigate dropout reasons can clarify missingness patterns.
- **Sensitivity Analysis**: To account for uncertainty in the missingness mechanism, researchers can conduct sensitivity analyses by comparing results under different assumptions (e.g., imputing data using both MAR and MNAR approaches). This process helps to quantify the potential impact of misclassifying the missingness mechanism on study conclusions.
- **Real-World Examples**:
- In customer feedback surveys, higher ratings might be overrepresented due to non-response bias. Customers with negative experiences might be less likely to complete surveys, leading to an MNAR scenario.
- In financial reporting, missing audit data might correlate with companies in financial distress, a classic MNAR case where the missingness depends on unobserved financial health metrics.
Summary
- **MCAR**: No pattern in missingness; use Little's test or dummy variable analysis.
- **MAR**: Missingness related to observed data; requires modeling assumptions or proxy analysis.
- **MNAR**: Missingness depends on unobserved data; requires external validation or sensitivity analysis.
## Methods for Handling Missing Data
### Basic Methods
#### Complete Case Analysis (Listwise Deletion)
Listwise deletion retains only cases with complete data for all features, discarding rows with any missing values.
**Advantages**:
- Universally applicable to various statistical tests (e.g., SEM, multilevel regression).
- When data are Missing Completely at Random (MCAR), parameter estimates and standard errors are unbiased.
- Under specific Missing at Random (MAR) conditions, such as when the probability of missing data depends only on independent variables, listwise deletion can still yield unbiased estimates. For instance, in the model $y = \beta_{0} + \beta_1X_1 + \beta_2X_2 + \epsilon$, if missingness in $X_1$ is independent of $y$ but depends on $X_1$ and $X_2$, the estimates remain unbiased [@Little_1992].
- This aligns with principles of stratified sampling, which does not bias estimates.
- In logistic regression, if missing data depend only on the dependent variable but not on independent variables, listwise deletion produces consistent slope estimates, though the intercept may be biased [@Vach_1994].
- For regression analysis, listwise deletion is more robust than Maximum Likelihood (ML) or Multiple Imputation (MI) when the MAR assumption is violated.
**Disadvantages**:
- Results in larger standard errors compared to advanced methods.
- If data are MAR but not MCAR, biased estimates can occur.
- In non-regression contexts, more sophisticated methods often outperform listwise deletion.
------------------------------------------------------------------------
#### Available Case Analysis (Pairwise Deletion)
Pairwise deletion calculates estimates using all available data for each pair of variables, without requiring complete cases. It is particularly suitable for methods like linear regression, factor analysis, and SEM, which rely on correlation or covariance matrices.
**Advantages**:
- Under MCAR, pairwise deletion produces consistent and unbiased estimates in large samples.
- Compared to listwise deletion [@Glasser_1964]:
- When variable correlations are low, pairwise deletion provides more efficient estimates.
- When correlations are high, listwise deletion becomes more efficient.
**Disadvantages**:
- Yields biased estimates under MAR conditions.
- In small samples, covariance matrices might not be positive definite, rendering coefficient estimation infeasible.
- Software implementation varies in how sample size is handled, potentially affecting standard errors.
**Note**: Carefully review software documentation to understand how sample size is treated, as this influences standard error calculations.
------------------------------------------------------------------------
#### Indicator Method (Dummy Variable Adjustment)
Also known as the Missing Indicator Method, this approach introduces an additional variable to indicate missingness in the dataset.
**Implementation**:
1. Create an indicator variable:
$$
D =
\begin{cases}
1 & \text{if data on } X \text{ are missing} \\
0 & \text{otherwise}
\end{cases}
$$
2. Modify the original variable to accommodate missingness:
$$
X^* =
\begin{cases}
X & \text{if data are available} \\
c & \text{if data are missing}
\end{cases}
$$
**Note**: A common choice for $c$ is the mean of $X$.
**Interpretation**:
- The coefficient of $D$ represents the difference in the expected value of $Y$ between cases with missing data and those without.
- The coefficient of $X^*$ reflects the effect of $X$ on $Y$ for cases with observed data.
**Disadvantages**:
- Produces biased estimates of coefficients, even under MCAR conditions [@jones1996indicator].
- May lead to overinterpretation of the "missingness effect," complicating model interpretation.
------------------------------------------------------------------------
#### Advantages and Limitations of Basic Methods
| **Method** | **Advantages** | **Disadvantages** |
|------------------|---------------------------|---------------------------|
| **Listwise Deletion** | Simple and universally applicable; unbiased under MCAR; robust in certain MAR scenarios. | Inefficient (larger standard errors); biased under MAR in many cases; discards potentially useful data. |
| **Pairwise Deletion** | Utilizes all available data; efficient under MCAR with low correlations; avoids discarding all cases. | Biased under MAR; prone to non-positive-definite covariance matrices in small samples. |
| **Indicator Method** | Simple implementation; explicitly models missingness effect. | Biased even under MCAR; complicates interpretation; may not reflect true underlying relationships. |
### Single Imputation Techniques
Single imputation methods replace missing data with a single value, generating a complete dataset that can be analyzed using standard techniques. While convenient, single imputation generally underestimates variability and risks biasing results.
------------------------------------------------------------------------
#### Deterministic Methods
##### Mean, Median, Mode Imputation
This method replaces missing values with the mean, median, or mode of the observed data.
**Advantages**:
- Simplicity and ease of implementation.
- Useful for quick exploratory data analysis.
**Disadvantages**:
- **Bias in Variances and Relationships**: Mean imputation reduces variance and disrupts relationships among variables, leading to biased estimates of variances and covariances [@haitovsky1968missing].
- **Underestimated Standard Errors**: Results in overly optimistic conclusions and increased risk of Type I errors.
- **Dependency Structure Ignored**: Particularly problematic in high-dimensional data, as it fails to capture dependencies among features.
##### Forward and Backward Filling (Time Series Contexts)
Used in time series analysis, this method replaces missing values using the preceding (forward filling) or succeeding (backward filling) values.
**Advantages**:
- Simple and preserves temporal ordering.
- Suitable for datasets where adjacent values are strongly correlated.
**Disadvantages**:
- Biased if missingness spans long gaps or occurs systematically.
- Cannot capture trends or changes in the underlying process.
------------------------------------------------------------------------
#### Statistical Prediction Models
##### Linear Regression Imputation
Missing values in a variable are imputed based on a linear regression model using observed values of other variables.
**Advantages**:
- Preserves relationships between variables.
- More sophisticated than mean or median imputation.
**Disadvantages**:
- Assumes linear relationships, which may not hold in all datasets.
- Fails to capture variability, leading to downwardly biased standard errors.
##### Logistic Regression for Categorical Variables
Similar to linear regression imputation but used for categorical variables. The missing category is predicted using a logistic regression model.
**Advantages**:
- Useful for binary or multinomial categorical data.
- Preserves relationships with other variables.
**Disadvantages**:
- Assumes the underlying logistic model is appropriate.
- Does not account for uncertainty in the imputed values.
------------------------------------------------------------------------
#### Non-Parametric Methods
##### Hot Deck Imputation
Hot Deck Imputation is a method of handling missing data where missing values are replaced with observed values from "donor" cases that are similar in other characteristics. This technique has been widely used in survey data, including by organizations like the U.S. Census Bureau, due to its flexibility and ability to maintain observed data distributions.
**Advantages of Hot Deck Imputation**
- **Retains observed data distributions**: Since missing values are imputed using actual observed data, the overall distribution remains realistic.
- **Flexible**: This method is applicable to both categorical and continuous variables.
- **Constrained imputations**: Imputed values are always feasible, as they come from observed cases.
- **Adds variability**: By randomly selecting donors, this method introduces variability, which can aid in robust standard error estimation.
**Disadvantages of Hot Deck Imputation**
- **Sensitivity to similarity definitions**: The quality of imputed values depends on the criteria used to define similarity between cases.
- **Computational intensity**: Identifying similar cases and randomly selecting donors can be computationally expensive, especially for large datasets.
- **Subjectivity**: Deciding how to define "similar" can introduce subjectivity or bias.
**Algorithm for Hot Deck Imputation**
Let $n_1$ represent the number of cases with complete data on the variable $Y$, and $n_0$ represent the number of cases with missing data on $Y$. The steps are as follows:
1. From the $n_1$ cases with complete data, take a random sample (with replacement) of $n_1$ cases.
2. From this sampled pool, take another random sample (with replacement) of size $n_0$.
3. Assign the values from the sampled $n_0$ cases to the cases with missing data in $Y$.
4. Repeat this process for every variable in the dataset.
5. For multiple imputation, repeat the above four steps multiple times to create multiple imputed datasets.
**Variations and Considerations**
- **Skipping Step 1**: If Step 1 is skipped, the variability of imputed values is reduced. This approach might not fully account for the uncertainty in missing data, which can underestimate standard errors.
- **Defining similarity**: A major challenge in this method is deciding what constitutes "similarity" between cases. Common approaches include matching based on distance metrics (e.g., Euclidean distance) or grouping cases by strata or clusters.
**Practical Example**
The U.S. Census Bureau employs an approximate Bayesian bootstrap variation of Hot Deck Imputation. In this approach:
- Similar cases are identified based on shared characteristics or grouping variables.
- A randomly chosen value from a similar individual in the sample is used to replace the missing value.
This method ensures imputed values are plausible while incorporating variability.
**Key Notes**
- **Good aspects**:
- Imputed values are constrained to observed possibilities.
- Random selection introduces variability, helpful for multiple imputation scenarios.
- **Challenges**:
- Defining and operationalizing "similarity" remains a critical step in applying this method effectively.
Below is an example code snippet illustrating Hot Deck Imputation in R:
```{r}
library(Hmisc)
# Example dataset with missing values
data <- data.frame(
ID = 1:10,
Age = c(25, 30, NA, 40, NA, 50, 60, NA, 70, 80),
Gender = c("M", "F", "F", "M", "M", "F", "M", "F", "M", "F")
)
# Perform Hot Deck Imputation using Hmisc::impute
data$Age_imputed <- impute(data$Age, "random")
# Display the imputed dataset
print(data)
```
This code randomly imputes missing values in the `Age` column based on observed data using the `Hmisc` package's `impute` function.
##### Cold Deck Imputation
Cold Deck Imputation is a systematic variant of Hot Deck Imputation where the donor pool is predefined. Instead of selecting donors dynamically from within the same dataset, Cold Deck Imputation relies on an external reference dataset, such as historical data or other high-quality external sources.
**Advantages of Cold Deck Imputation**
- **Utilizes high-quality external data**: This method is particularly useful when reliable external reference datasets are available, allowing for accurate and consistent imputations.
- **Consistency**: If the same donor pool is used across multiple datasets, imputations remain consistent, which can be advantageous in longitudinal studies or standardized processes.
**Disadvantages of Cold Deck Imputation**
- **Lack of adaptability**: External data may not adequately reflect the unique characteristics or variability of the current dataset.
- **Potential for systematic bias**: If the donor pool is significantly different from the target dataset, imputations may introduce bias.
- **Reduces variability**: Unlike Hot Deck Imputation, Cold Deck Imputation systematically selects values, which removes random variation. This can affect the estimation of standard errors and other inferential statistics.
**Key Characteristics**
- **Systematic Selection**: Cold Deck Imputation selects donor values systematically based on predefined rules or matching criteria, rather than using random sampling.
- **External Donor Pool**: Donors are typically drawn from a separate dataset or historical records.
**Algorithm for Cold Deck Imputation**
1. Identify an external reference dataset or predefined donor pool.
2. Define the matching criteria to find "similar" cases between the donor pool and the current dataset (e.g., based on covariates or stratification).
3. Systematically assign values from the donor pool to missing values in the current dataset based on the matching criteria.
4. Repeat the process for each variable with missing data.
**Practical Considerations**
- Cold Deck Imputation works well when external data closely resemble the target dataset. However, when there are significant differences in distributions or relationships between variables, imputations may be biased or unrealistic.\
- This method is less useful for datasets without access to reliable external reference data.
Suppose we have a current dataset with missing values and a historical dataset with similar variables. The following example demonstrates how Cold Deck Imputation can be implemented:
```{r}
# Current dataset with missing values
current_data <- data.frame(
ID = 1:5,
Age = c(25, 30, NA, 45, NA),
Gender = c("M", "F", "F", "M", "M")
)
# External reference dataset (donor pool)
reference_data <- data.frame(
Age = c(28, 35, 42, 50),
Gender = c("M", "F", "F", "M")
)
# Perform Cold Deck Imputation
library(dplyr)
# Define a matching function to find closest donor
impute_cold_deck <- function(missing_row, reference_data) {
# Filter donors with the same gender
possible_donors <- reference_data %>%
filter(Gender == missing_row$Gender)
# Return the mean age of matching donors as an example of systematic imputation
return(mean(possible_donors$Age, na.rm = TRUE))
}
# Apply Cold Deck Imputation to the missing rows
current_data <- current_data %>%
rowwise() %>%
mutate(
Age_imputed = ifelse(
is.na(Age),
impute_cold_deck(cur_data(), reference_data),
Age
)
)
# Display the imputed dataset
print(current_data)
```
**Comparison to Hot Deck Imputation**
| Feature | Hot Deck Imputation | Cold Deck Imputation |
|------------------|---------------------------|---------------------------|
| **Donor Pool** | Internal (within the dataset) | External (predefined dataset) |
| **Selection** | Random | Systematic |
| **Variability** | Retained | Reduced |
| **Bias Potential** | Lower | Higher (if donor pool differs) |
This method suits situations where external reference datasets are trusted and representative. However, careful consideration is required to ensure alignment between the donor pool and the target dataset to avoid systematic biases.
##### Random Draw from Observed Distribution
This imputation method replaces missing values by randomly sampling from the observed distribution of the variable with missing data. It is a simple, non-parametric approach that retains the variability of the original data.
**Advantages**
- **Preserves variability**:
- By randomly drawing values from the observed data, this method ensures that the imputed values reflect the inherent variability of the variable.
- **Computational simplicity**:
- The process is straightforward and does not require model fitting or complex calculations.
**Disadvantages**
- **Ignores relationships among variables**:
- Since the imputation is based solely on the observed distribution of the variable, it does not consider relationships or dependencies with other variables.
- **May not align with trends**:
- Imputed values are random and may fail to align with patterns or trends present in the data, such as time series structures or interactions.
**Steps in Random Draw Imputation**
1. Identify the observed (non-missing) values of the variable.
2. For each missing value, randomly sample one value from the observed distribution with or without replacement.
3. Replace the missing value with the randomly sampled value.
The following example demonstrates how to use random draw imputation to fill in missing values:
```{r}
# Example dataset with missing values
set.seed(123)
data <- data.frame(
ID = 1:10,
Value = c(10, 20, NA, 30, 40, NA, 50, 60, NA, 70)
)
# Perform random draw imputation
random_draw_impute <- function(data, variable) {
observed_values <- data[[variable]][!is.na(data[[variable]])] # Observed values
data[[variable]][is.na(data[[variable]])] <- sample(observed_values,
sum(is.na(data[[variable]])),
replace = TRUE)
return(data)
}
# Apply the imputation
imputed_data <- random_draw_impute(data, variable = "Value")
# Display the imputed dataset
print(imputed_data)
```
**Considerations**
- **When to Use**:
- This method is suitable for exploratory analysis or as a quick way to handle missing data in univariate contexts.
- **Limitations**:
- Random draws may result in values that do not fit well in the broader context of the dataset, especially in cases where the variable has strong relationships with others.
| Feature | Random Draw from Observed Distribution | Regression-Based Imputation |
|----------------------|----------------------------|----------------------|
| **Complexity** | Simple | Moderate to High |
| **Preserves Variability** | Yes | Limited in deterministic forms |
| **Considers Relationships** | No | Yes |
| **Risk of Implausible Values** | Low (if observed values are plausible) | Moderate to High |
This method is a quick and computationally efficient way to address missing data but is best complemented by more sophisticated methods when relationships between variables are important.
#### Semi-Parametric Methods
##### Predictive Mean Matching (PMM)
Predictive Mean Matching (PMM) imputes missing values by finding observed values closest in predicted value (based on a regression model) to the missing data. The donor values are then used to fill in the gaps.
**Advantages**:
- Maintains observed variability in the data.
- Ensures imputed values are realistic since they are drawn from observed data.
**Disadvantages**:
- Requires a suitable predictive model.
- Computationally intensive for large datasets.
**Steps for PMM**:
1. Regress $Y$ on $X$ (matrix of covariates) for the $n_1$ (non-missing cases) to estimate coefficients $\hat{b}$ and residual variance $s^2$.
2. Draw from the posterior predictive distribution of residual variance: $$s^2_{[1]} = \frac{(n_1-k)s^2}{\chi^2},$$ where $\chi^2$ is a random draw from $\chi^2_{n_1-k}$.
3. Randomly sample from the posterior distribution of $\hat{b}$: $$b_{[1]} \sim MVN(\hat{b}, s^2_{[1]}(X'X)^{-1}).$$
4. Standardize residuals for $n_1$ cases: $$e_i = \frac{y_i - \hat{b}x_i}{\sqrt{s^2(1-k/n_1)}}.$$
5. Randomly draw a sample (with replacement) of $n_0$ residuals from Step 4.
6. Calculate imputed values for $n_0$ missing cases: $$y_i = b_{[1]}x_i + s_{[1]}e_i.$$
7. Repeat Steps 2--6 (except Step 4) to create multiple imputations.
**Notes**:
- PMM can handle heteroskedasticity
- works for multiple variables, imputing each using all others as predictors.
**Example**:
Example from [Statistics Globe](https://statisticsglobe.com/predictive-mean-matching-imputation-method/)
```{r}
set.seed(1) # Seed
N <- 100 # Sample size
y <- round(runif(N,-10, 10)) # Target variable Y
x1 <- y + round(runif(N, 0, 50)) # Auxiliary variable 1
x2 <- round(y + 0.25 * x1 + rnorm(N,-3, 15)) # Auxiliary variable 2
x3 <- round(0.1 * x1 + rpois(N, 2)) # Auxiliary variable 3
# (categorical variable)
x4 <- as.factor(round(0.02 * y + runif(N))) # Auxiliary variable 4
# Insert 20% missing data in Y
y[rbinom(N, 1, 0.2) == 1] <- NA
data <- data.frame(y, x1, x2, x3, x4) # Store data in dataset
head(data) # First 6 rows of our data
library("mice") # Load mice package
##### Impute data via predictive mean matching (single imputation)#####
imp_single <- mice(data, m = 1, method = "pmm") # Impute missing values
data_imp_single <- complete(imp_single) # Store imputed data
# head(data_imp_single)
# Since single imputation underestiamtes stnadard errors,
# we use multiple imputaiton
##### Predictive mean matching (multiple imputation) #####
# Impute missing values multiple times
imp_multi <- mice(data, m = 5, method = "pmm")
data_imp_multi_all <-
# Store multiply imputed data
complete(imp_multi,
"repeated",
include = TRUE)
data_imp_multi <-
# Combine imputed Y and X1-X4 (for convenience)
data.frame(data_imp_multi_all[, 1:6], data[, 2:5])
head(data_imp_multi)
```
Example from [UCLA Statistical Consulting](https://stats.idre.ucla.edu/r/faq/how-do-i-perform-multiple-imputation-using-predictive-mean-matching-in-r/)
```{r}
library(mice)
library(VIM)
library(lattice)
library(ggplot2)
## set observations to NA
anscombe <- within(anscombe, {
y1[1:3] <- NA
y4[3:5] <- NA
})
## view
head(anscombe)
## check missing data patterns
md.pattern(anscombe)
## Number of observations per patterns for all pairs of variables
p <- md.pairs(anscombe)
p
```
- `rr` = number of observations where both pairs of values are observed
- `rm` = the number of observations where both variables are missing values
- `mr` = the number of observations where the first variable's value (e.g. the row variable) is observed and second (or column) variable is missing
- `mm` = the number of observations where the second variable's value (e.g. the col variable) is observed and first (or row) variable is missing
```{r}
## Margin plot of y1 and y4
marginplot(anscombe[c(5, 8)], col = c("blue", "red", "orange"))
## 5 imputations for all missing values
imp1 <- mice(anscombe, m = 5)
## linear regression for each imputed data set - 5 regression are run
fitm <- with(imp1, lm(y1 ~ y4 + x1))
summary(fitm)
## pool coefficients and standard errors across all 5 regression models
pool(fitm)
## output parameter estimates
summary(pool(fitm))
```
##### Stochastic Imputation
Stochastic Imputation is an enhancement of regression imputation that introduces randomness into the imputation process by adding a random residual to the predicted values from a regression model. This approach aims to retain the variability of the original data while reducing the bias introduced by deterministic regression imputation.
Stochastic Imputation can be described as:
$$
\text{Imputed Value} = \text{Predicted Value (from regression)} + \text{Random Residual}
$$
This method is commonly used as a foundation for multiple imputation techniques.
**Advantages of Stochastic Imputation**
- **Retains all the benefits of regression imputation**:
- Preserves relationships between variables in the dataset.
- Utilizes information from observed data to inform imputations.
- **Introduces randomness**:
- Adds variability by including a random residual term, making imputed values more realistic and better representing the uncertainty of missing data.
- **Supports multiple imputation**:
- By generating different random residuals for each iteration, it facilitates the creation of multiple plausible datasets for robust statistical analysis.