forked from DickBrus/SpatialSamplingwithR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
04-STSI.Rmd
883 lines (668 loc) · 62.8 KB
/
04-STSI.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
# Stratified simple random sampling {#STSI}
In stratified random sampling\index{Stratified random sampling} the population is divided into subpopulations, for instance, soil mapping units, areas with the same land use or land cover, administrative units, etc. The subareas are mutually exclusive, i.e., they do not overlap, and are jointly exhaustive, i.e., their union equals the entire population (study area). Within each subpopulation, referred to as a stratum\index{Stratum}, a probability sample is selected by some sampling design. If these probability samples are selected by simple random sampling, as described in the previous chapter, the design is stratified *simple* random sampling\index{Stratified random sampling!stratified simple random sampling}, the topic of this chapter. If sampling units are selected by cluster random sampling, then the design is stratified *cluster* random sampling.
Stratified simple random sampling is illustrated with Voorst (Figure \@ref(fig:SampleSTSI)). Tibble `grdVoorst` with simulated data contains variable `stratum`. The strata are combinations of soil class and land use, obtained by overlaying a soil map and a land use map. To select a stratified simple random sample, we set the total sample size $n$. The sampling units must be apportioned to the strata. I chose to apportion the units proportionally to the size (area, number of grid cells) of the strata (for details, see Section \@ref(STSIallocation)). The larger a stratum, the more units are selected from this stratum. The sizes of the strata, i.e., the total number of grid cells, are computed with function `tapply`.
```{r stsi}
N_h <- tapply(grdVoorst$stratum, INDEX = grdVoorst$stratum, FUN = length)
w_h <- N_h / sum(N_h)
n <- 40
print(n_h <- round(n * w_h))
```
The sum of the stratum sample sizes is 41; we want 40, so we reduce the largest stratum sample size by 1.
```{r}
n_h[1] <- n_h[1] - 1
```
The stratified simple random sample is selected with function `strata` of package **sampling** [@Tille2016]. Argument `size` specifies the stratum sample sizes.
```{block2, type = 'rmdcaution'}
The stratum sample sizes must be in the order the strata are encountered in tibble `grdVoorst`, which is determined first with function `unique`.
```
Within the strata, the grid cells are selected by simple random sampling *with replacement* (`method = "srswr"`), so that in principle more than one point can be selected within a grid cell, see Chapter \@ref(SI) for a motivation of this. Function `getdata` extracts the observations of the selected units from the sampling frame, as well as the spatial coordinates and the stratum of these units. The coordinates of the centres of the selected grid cells are jittered by an amount equal to half the side of the grid cells. In the next code chunk, this is done with function `mutate` of package **dplyr** [@dplyr] which is part of package **tidyverse** [@Wickham2019]. We have seen the pipe operator `%>%` of package **magrittr** [@magrittr] before in Subsection \@ref(CDF). If you are not familiar with **tidyverse** I recommend reading the excellent book \emph{R for Data Science} [@Wickham2017].
```{r}
library(sampling)
ord <- unique(grdVoorst$stratum)
set.seed(314)
units <- sampling::strata(
grdVoorst, stratanames = "stratum", size = n_h[ord], method = "srswr")
mysample <- getdata(grdVoorst, units) %>%
mutate(s1 = s1 %>% jitter(amount = 25 / 2),
s2 = s2 %>% jitter(amount = 25 / 2))
```
```{block2, type = 'rmdnote'}
The name of the package is added to function `strata` (`sampling::strata`), as `strata` is also a function of another package. Not adding the name of the package may result in an error message.
```
Figure \@ref(fig:SampleSTSI) shows the selected sample.
```{r SampleSTSI, echo = FALSE, out.width = '100%', fig.cap = "Stratified simple random sample of size 40 from Voorst. Strata are combinations of soil class and land use."}
ggplot(data = grdVoorst, mapping = aes(x = s1 / 1000, y = s2 / 1000, fill = stratum)) +
geom_raster() +
geom_point(data = mysample, size = 1.5) +
scale_fill_manual(
name = "",
values = c(BA = "darkgreen", EA = "brown", PA = "orange", RA = "green", XF = "grey")
) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed() +
theme(panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.background = element_blank())
```
## Estimation of population parameters {#EstimatorsSTSI}
With simple random sampling within strata, the estimator of the population mean for simple random sampling (Equation \@ref(eq:HTMeanSI)) is applied at the level of the strata. The estimated stratum means are then averaged, using the relative sizes or areas of the strata as weights:
\begin{equation}
\hat{\bar{z}}= \sum\limits_{h=1}^{H} w_{h}\,\hat{\bar{z}}_{h} \;,
(\#eq:HTMeanSTSI)
\end{equation}
where $H$ is the number of strata, $w_{h}$ is the relative size (area) of stratum $h$ (stratum weight)\index{Stratum weight}: $w_h = N_h/N$, and $\hat{\bar{z}}_{h}$ is the estimated mean of stratum $h$, estimated by the sample mean for stratum $h$:
\begin{equation}
\hat{\bar{z}}_{h}=\frac{1}{n_h}\sum_{k \in \mathcal{S}_h} z_k\;,
(\#eq:HTStratumMeanSI)
\end{equation}
with $\mathcal{S}_h$ the sample selected from stratum $h$.
The same estimator is found when the $\pi$ estimator is worked out for stratified simple random sampling. With stratified simple random sampling without replacement and different sampling fractions for the strata, the inclusion probabilities differ among the strata and equal $\pi_{k} = n_h/N_h$ for all $k$ in stratum $h$, with $n_h$ the sample size of stratum $h$ and $N_h$ the size of stratum $h$. Inserting this in the $\pi$ estimator of the population mean (Equation \@ref(eq:HTMean)) gives
\begin{equation}
\hat{\bar{z}}= \frac{1}{N}\sum\limits_{h=1}^{H}\sum\limits_{k \in \mathcal{S}_h} \frac{z_{k}}{\pi_{k}} = \frac{1}{N}\sum\limits_{h=1}^{H} \frac{N_h}{n_h}\sum\limits_{k \in \mathcal{S}_h} z_{k} = \sum\limits_{h=1}^{H} w_{h}\,\hat{\bar{z}}_{h} \;.
(\#eq:HTMeanSTSI2)
\end{equation}
```{block2, type = 'rmdnote'}
The sampling fractions are usually slightly different, even with proportional allocation (Section \@ref(STSIallocation)), because $n_h/N_h$ cannot be made exactly equal for all strata. Sample sizes necessarily are integers, so $n_h/N_h$ must be rounded to integers.
```
The sampling variance of the estimator of the population mean is estimated by first estimating the sampling variances of the estimated stratum means, followed by computing the weighted average of the estimated sampling variances of the estimated stratum means. Note that we must square the stratum weights:
\begin{equation}
\widehat{V}\!\left(\hat{\bar{z}}\right)=\sum\limits_{h=1}^{H}w_{h}^{2}\,\widehat{V}\!\left(\hat{\bar{z}}_{h}\right)\;,
(\#eq:EstVarMeanSTSI)
\end{equation}
where $\widehat{V}\!\left(\hat{\bar{z}}_{h}\right)$ is the estimated sampling variance of $\hat{\bar{z}}_{h}$:
\begin{equation}
\widehat{V}\!\left(\hat{\bar{z}}_{h}\right)= (1-\frac{n_h}{N_h}) \frac{\widehat{S^2}_h(z)}{n_h}\;,
(\#eq:EstVarstratummean)
\end{equation}
with $\widehat{S^2}_h(z)$ the estimated variance of $z$ within stratum $h$:
\begin{equation}
\widehat{S^2}_h(z)=\frac{1}{n_h-1}\sum\limits_{k \in \mathcal{S}_h}\left(z_{k}-\hat{\bar{z}}_{h}\right)^{2}\;.
(\#eq:EstStratumVar)
\end{equation}
For stratified simple random sampling with replacement of finite populations and stratified simple random sampling of infinite populations the fpcs $1-(n_h/N_h)$ can be dropped.
```{r}
mz_h <- tapply(mysample$z, INDEX = mysample$stratum, FUN = mean)
mz <- sum(w_h * mz_h)
S2z_h <- tapply(mysample$z, INDEX = mysample$stratum, FUN = var)
v_mz_h <- S2z_h / n_h
se_mz <- sqrt(sum(w_h^2 * v_mz_h))
```
```{r stratummeans, echo = FALSE}
N_h <- as.numeric(N_h)
nms <- names(n_h)
n_h_copy <- n_h
names(n_h_copy) <- NULL
names(mz_h) <- NULL
names(S2z_h) <- NULL
names(v_mz_h) <- NULL
df <- data.frame(nms, Nh = formatC(N_h, 0, format = "f", big.mark = ","), nh = n_h_copy, Mean = round(mz_h, 1), Var = formatC(S2z_h, 1, format = "f", big.mark = ","), se = round(sqrt(v_mz_h), 1))
knitr::kable(
df, caption = "Size (Nh), sample size (nh), estimated mean (Mean), estimated variance (Variance), and estimated standard error of estimator of mean (se) of the five strata in Voorst.",
col.names = c("Stratum", "Nh", "nh", "Mean", "Variance", "se"),
booktabs = TRUE,
linesep = ""
) %>%
kable_classic()
```
Table \@ref(tab:stratummeans) shows per stratum the estimated mean, variance, and sampling variance of the estimated mean of the SOM concentration. We can see large differences in the within-stratum variances\index{Within-stratum variance}. For the stratified sample of Figure \@ref(fig:SampleSTSI), the estimated population mean equals `r formatC(mz, 1, format = "f")` g kg^-1^, and the estimated standard error of this estimator equals `r formatC(se_mz, 1, format = "f")` g kg^-1^.
The population mean can also be estimated directly using the basic $\pi$ estimator (Equation \@ref(eq:HTMean)). The inclusion probabilities are included in `data.frame` `mysample`, obtained with function `getdata` (see code chunk above), as variable `Prob`.
```{r}
head(mysample)
```
The population total is estimated first, and by dividing this estimated total by the total number of population units $N$ an estimate of the population mean is obtained.
```{r}
tz <- sum(mysample$z / mysample$Prob)
print(mz <- tz / sum(N_h))
```
The two estimates of the population mean are not exactly equal. This is due to rounding errors in the inclusion probabilities. This can be shown by computing the sum of the inclusion probabilities over all population units. This sum should be equal to the sample size $n=40$, but as we can see below, this sum is slightly smaller.
```{r}
pi_h <- tapply(mysample$Prob, INDEX = mysample$stratum, FUN = unique)
print(sum(pi_h * N_h))
```
Now suppose we ignore that the sample data come from a stratified sampling design and we use the (unweighted) sample mean as an estimate of the population mean.
```{r}
print(mean(mysample$z))
```
The sample mean slightly differs from the proper estimate of the population mean (7.238). The sample mean is a *biased* estimator, but the bias is small. The bias is only small because the stratum sample sizes are about proportional to the sizes of the strata, so that the inclusion probabilities (sampling intensities) are about equal for all strata: `r pi_h`. The probabilities are not exactly equal because the stratum sample sizes are necessarily rounded to integers and because we reduced the largest sample size by one unit. The bias would have been substantially larger if an equal number of units would have been selected from each stratum, leading to much larger differences in the inclusion probabilities among the strata. Sampling intensity in stratum BA, for instance, then would be much smaller compared to the other strata, and so would be the inclusion probabilities of the units in this stratum as compared to the other strata. Stratum BA then would be underrepresented in the sample. This is not a problem as long as we account for the difference in inclusion probabilities of the units in the estimation of the population mean. The estimated mean of stratum BA then gets the largest weight, equal to the inverse of the inclusion probability. If we do not account for these differences in inclusion probabilities, the estimator of the mean will be seriously biased.
The next code chunk shows how the population mean and its standard error can be estimated with package **survey** [@Lumley2020]. Note that the stratum weights $N_h/n_h$ must be passed to function `svydesign` using argument `weight`. These are first attached to `data.frame` `mysample` by creating a look-up table `lut`, which is then merged with function `merge` to `data.frame` `mysample`.
```{r}
library(survey)
labels <- sort(unique(mysample$stratum))
lut <- data.frame(stratum = labels, weight = N_h / n_h)
mysample <- merge(x = mysample, y = lut)
design_stsi <- svydesign(
id = ~ 1, strata = ~ stratum, weight = ~ weight, data = mysample)
svymean(~ z, design_stsi)
```
### Population proportion, cumulative distribution function, and quantiles
The proportion of a population satisfying some condition can be estimated by Equations \@ref(eq:HTMeanSTSI) and \@ref(eq:HTStratumMeanSI), substituting for the study variable $z_k$ an 0/1 indicator $y_k$ with value 1 if for unit $k$ the condition is satisfied, and 0 otherwise (Subsection \@ref(PopProportion)). In general, with stratified simple random sampling the inclusion probabilities are not exactly equal, so that the estimated population proportion is not equal to the sample proportion.
These unequal inclusion probabilities must also be accounted for when estimating the cumulative distribution function (CDF) and quantiles (Subsection \@ref(CDF)), as shown in the next code chunk for the CDF.
```{r}
thresholds <- sort(unique(mysample$z))
cumfreq <- numeric(length = length(thresholds))
for (i in seq_len(length(thresholds))) {
ind <- mysample$z <= thresholds[i]
mh_ind <- tapply(ind, INDEX = mysample$stratum, FUN = mean)
cumfreq[i] <- sum(w_h * mh_ind)
}
df <- data.frame(x = thresholds, y = cumfreq)
```
Figure \@ref(fig:EstimatedCDFVoorstSTSI) shows the estimated CDF, estimated from the stratified simple random sample of 40 units from Voorst (Figure \@ref(fig:SampleSTSI)).
(ref:EstimatedCDFVoorstSTSIlabel) Estimated cumulative distribution function of the SOM concentration (g kg^-1^) in Voorst, estimated from the stratified simple random sample of 40 units.
```{r EstimatedCDFVoorstSTSI, echo = FALSE, fig.asp = 0.7, fig.width = 5, fig.cap = "(ref:EstimatedCDFVoorstSTSIlabel)"}
dx <- mean(diff(df$x))
dat <- data.frame(
x = c(df$x[1] - dx, df$x),
xend = c(df$x, df$x[nrow(df)] + dx),
yend = c(df$y, 1)
)
ggplot(dat) +
geom_segment(mapping = aes(x = x, y = yend, xend = xend, yend = yend)) +
scale_x_continuous(name = "SOM") +
scale_y_continuous(name = "F")
```
The estimated proportions, or cumulative frequencies, are used to estimate a quantile. These estimates are easily obtained with function `svyquantile` of package **survey**.
```{r}
svyquantile(~ z, design_stsi, quantile = c(0.5, 0.8))
```
### Why should we stratify? {#WhyStratify}
There can be two reasons for stratifying the population:
1. we are interested in the mean or total per stratum; or
2. we want to increase the precision of the estimated mean or total for the entire population.
Figure \@ref(fig:SamplingDistributionSTSI) shows boxplots of the approximated sampling distributions of the $\pi$ estimator of the mean SOM concentration for stratified simple random sampling and simple random sampling, both of size 40, obtained by repeating the random sampling with each design and estimation 10,000 times.
```{r, eval = FALSE, echo = FALSE}
number_of_samples <- 10000
mz_STSI <- mz_SI <- numeric(length = number_of_samples)
for (i in 1:number_of_samples) {
units <- sampling::strata(grdVoorst, stratanames = "stratum", size = n_h[unique(grdVoorst$stratum)], method = "srswr")
mysample <- getdata(grdVoorst, units)
mz_h <- tapply(mysample$z, INDEX = mysample$stratum, FUN = mean)
mz_STSI[i] <- sum(w_h * mz_h)
units <- sample(nrow(grdVoorst), size = n, replace = TRUE)
mz_SI[i] <- mean(grdVoorst$z[units])
}
save(mz_STSI, mz_SI, file = "results/STSI_Voorst.rda")
```
(ref:SamplingDistributionSTSIlabel) Approximated sampling distribution of the $\pi$ estimator of the mean SOM concentration (g kg^-1^) in Voorst for stratified simple random sampling (STSI) and simple random sampling (SI) of size 40.
```{r SamplingDistributionSTSI, echo = FALSE, fig.asp = .8, fig.width = 5, fig.cap = "(ref:SamplingDistributionSTSIlabel)"}
load(file = "results/STSI_Voorst.rda")
estimates <- data.frame(mz_STSI, mz_SI)
names(estimates) <- c("STSI", "SI")
df <- estimates %>% pivot_longer(cols = c("STSI", "SI"))
df$name <- factor(df$name, levels = c("STSI", "SI"), ordered = TRUE)
ggplot(data = df) +
geom_boxplot(mapping = aes(y = value, x = name)) +
geom_hline(yintercept = mean(grdVoorst$z), colour = "red") +
scale_x_discrete(name = "Sampling design") +
scale_y_continuous(name = "Estimated mean SOM")
```
```{r, echo = FALSE}
S2z_h_pop <- tapply(grdVoorst$z, INDEX = grdVoorst$stratum, FUN = var)
v_mz_h <- S2z_h_pop / n_h
v_mz_STSI <- sum(w_h^2 * v_mz_h)
v_mz_SI <- var(grdVoorst$z) / n
mz_h_pop <- tapply(grdVoorst$z, INDEX = grdVoorst$stratum, FUN = mean)
```
The approximated sampling distributions of the estimators of the population mean with the two designs are not very different. With stratified random sampling, the spread of the estimated means is somewhat smaller. The horizontal red line is the population mean (`r formatC(mean(grdVoorst$z), 1, format = "f")` g kg^-1^). The gain in precision due to the stratification, referred to as the stratification effect\index{Stratification effect}, can be quantified by the ratio of the variance with simple random sampling and the variance with stratified simple random sampling. So, when this variance ratio is larger than 1, stratified simple random sampling is more precise than simple random sampling. For Voorst the stratification effect with proportional allocation (Section \@ref(STSIallocation)) equals `r formatC(v_mz_SI/v_mz_STSI, 3, format = "f")`. This means that with simple random sampling we need `r formatC(v_mz_SI/v_mz_STSI, 3, format = "f")` more sampling units than with stratified simple random sampling to obtain an estimate of the same precision.
The stratification effect can be computed from the population variance $S^2(z)$ (Equation \@ref(eq:PopulationVariance)) and the variances within the strata $S^2_h(z)$. In the sampling experiment, these variances are known without error because we know the $z$-values for all units in the population. In practice, we only know the $z$-values for the sampled units. However, a design-unbiased estimator of the population variance is [@gru06]
\begin{equation}
\widehat{S^{2}}(z)= \widehat{\overline{z^{2}}}-\left(\hat{\bar{z}}\right)^{2}+
\widehat{V}\!\left(\hat{\bar{z}}\right) \;,
(\#eq:EstimatorPopulationVariancefromSTSI)
\end{equation}
where $\widehat{\overline{z^{2}}}$ denotes the estimated population mean of the study variable squared ($z^2$), obtained in the same way as $\hat{\bar{z}}$ (Equation \@ref(eq:HTMeanSTSI)), but using squared values, and $\widehat{V}\!\left(\hat{\bar{z}}\right)$ denotes the estimated variance of the estimator of the population mean (Equation \@ref(eq:EstVarMeanSTSI)).
The estimated population variance is then divided by the sum of the stratum sample sizes to get an estimate of the sampling variance of the estimator of the mean with simple random sampling of an equal number of units:
\begin{equation}
\widehat{V}(\hat{\bar{z}}_{\text{SI}}) = \frac{\widehat{S^2}(z)}{\sum_{h=1}^{H}n_h}\;.
(\#eq:stratificationeffect)
\end{equation}
```{r, echo = FALSE}
ord <- unique(grdVoorst$stratum)
set.seed(314)
units <- sampling::strata(grdVoorst, stratanames = "stratum", size = n_h[ord], method = "srswr")
mysample <- getdata(grdVoorst, units)
labels <- sort(unique(mysample$stratum))
lut <- data.frame(stratum = labels, weight = N_h / n_h)
mysample <- merge(x = mysample, y = lut)
```
The population variance can be estimated with function `s2` of package **surveyplanning** [@surveyplanning]. However, this function is an implementation of an alternative, consistent estimator\index{Consistent estimator} of the population variance [@sar92]:
\begin{equation}
\widehat{S^2}(z) = \frac{N-1}{N} \frac{n}{n-1} \frac{1}{N-1} \sum_{k \in \mathcal{S}} \frac{(z_k - \hat{\bar{z}}_{\pi})^2}{\pi_k} \;.
(\#eq:EstimatorPopulationVariance4AnyDesign)
\end{equation}
```{block2, type = 'rmdnote'}
An estimator is consistent if it converges in probability to the true value of the parameter as the sample size tends to infinity [@sar92].
```
```{r}
library(surveyplanning)
S2z <- s2(mysample$z, w = mysample$weight)
```
The design effect\index{Design effect} is defined as the variance of an estimator of the population mean with the sampling design under study divided by the variance of the $\pi$ estimator of the mean with simple random sampling of an equal number of units (Section \@ref(DesignEffect)). So, the design effect of stratified random sampling is the reciprocal of the stratification effect. For the stratified simple random sample of Figure \@ref(fig:SampleSTSI), the design effect can then be estimated as follows. Function `SE` extracts the estimated standard error of the estimator of the mean from the output of function `svymean`. The extracted standard error is then squared to obtain an estimate of the sampling variance of the estimator of the population with stratified simple random sampling. Finally, this variance is divided by the variance with simple random sampling of an equal number of units.
```{r}
v_mz_SI <- S2z / n
res <- svymean(~ z, design_stsi)
SE(res)^2 / v_mz_SI
```
The same value is obtained with argument `deff` of function `svymean`.
```{r designeffectSTSI}
design_stsi <- svydesign(
id = ~ 1, strata = ~ stratum, weight = ~ weight, data = mysample)
svymean(~ z, design_stsi, deff = "replace")
```
So, when using package **survey**, estimation of the population variance is not needed to estimate the design effect. I only added this to make clear how the design effect is computed with functions in package **survey**. In following chapters I will skip the estimation of the population variance.
The estimated design effect as estimated from the stratified sample is smaller than 1, showing that stratified simple random sampling is more efficient than simple random sampling. The reciprocal of the estimated design effect (1.448) is somewhat larger than the stratification effect as computed in the sampling experiment, but this is an estimate of the design effect from one stratified sample only. The estimated population variance varies among stratified samples, and so does the estimated design effect.
Stratified simple random sampling with proportional allocation (Section \@ref(STSIallocation)) is more precise than simple random sampling when the sum of squares of the stratum means is larger than the sum of squares within strata [@loh99]:
\begin{equation}
SSB > SSW\;,
(\#eq:STSImoreprecisewhen)
\end{equation}
with SSB the weighted sum-of-squares between the stratum means:
\begin{equation}
SSB = \sum_{h=1}^H N_h (\bar{z}_h-\bar{z})^2 \;,
(\#eq:SSB)
\end{equation}
and SSW the sum over the strata of the weighted variances within strata (weights equal to $1-N_h/N$):
\begin{equation}
SSW = \sum_{h=1}^H (1-\frac{N_h}{N})S^2_h\;.
(\#eq:SSW)
\end{equation}
In other words, the smaller the differences in the stratum means and the larger the variances within the strata, the smaller the stratification effect will be. Figure \@ref(fig:boxplotsSOMstrata) shows a boxplot of the SOM concentration per stratum (soil-land use combination). The stratum means are equal to `r formatC(mz_h_pop, 1, format = "f")` g kg^-1^. The stratum variances are `r formatC(S2z_h_pop, 1, format = "f")` (g kg^-1^)^2^. The large stratum variances explain the modest gain in precision realised by stratified simple random sampling compared to simple random sampling in this case.
(ref:boxplotsSOMstratalabel) Boxplots of the SOM concentration (g kg^-1^) for the five strata (soil-land use combinations) in Voorst.
```{r boxplotsSOMstrata, out.width = '100%', fig.asp = 0.5, echo = FALSE, fig.cap = "(ref:boxplotsSOMstratalabel)"}
ggplot(data = grdVoorst) +
geom_boxplot(aes(y = z, x = stratum)) +
scale_y_continuous(name = "SOM") +
scale_x_discrete(name = "Stratum")
```
## Confidence interval estimate {#CISTSI}
The $100(1-\alpha)$\% confidence interval for $\bar{z}$ is given by
\begin{equation}
\hat{\bar{z}} \pm t_{\alpha /2, df}\cdot
\sqrt{\widehat{V}\!\left(\hat{\bar{z}}\right)} \;,
(\#eq:CISTSI)
\end{equation}
where $t_{\alpha /2,df}$ is the $(1-\alpha /2)$ quantile of a *t* distribution with $df$ degrees of freedom. The degrees of freedom $df$ can be approximated by $n-H$, as proposed by @loh99. This is the number of the degrees of freedom if the variances within the strata are equal. With unequal variances within strata, $df$ can be approximated by Sattherwaite's method\index{Sattherwaite's method} [@nan04]:
\begin{equation}
df \approx \frac {\left(\sum_{h=1}^H w_h^2
\frac{\widehat{S^2}_h(z)}{n_h}\right)^2} {\sum_{h=1}^H w_h^4
\left(\frac{\widehat{S^2}_h(z)}{n_h}\right)^2 \frac {1}{n_h-1}} \;.
(\#eq:dfSattherwaite)
\end{equation}
A confidence interval estimate of the population mean can be extracted with method `confint` of package **survey**. It uses $n-H$ degrees of freedom.
```{r}
res <- svymean(~ z, design_stsi)
df_stsi <- degf(design_stsi)
confint(res, df = df_stsi, level = 0.95)
```
## Allocation of sample size to strata {#STSIallocation}
After we have decided on the total sample size $n$, we must decide how to apportion the units to the strata. It is reasonable to allocate more sampling units to large strata and fewer to small strata. The simplest way to achieve this is proportional allocation\index{Allocation!proportional allocation}:
\begin{equation}
n_{h}=n \cdot \frac{N_{h}}{\sum N_{h}}\;,
(\#eq:propallocation)
\end{equation}
with $N_h$ the total number of population units (size) of stratum $h$. With infinite populations $N_h$ is replaced by the area $A_h$. The sample sizes computed with this equation are rounded to the nearest integers.
If we have prior information on the variance of the study variable within the strata, then it makes sense to account for differences in variance. Heterogeneous strata should receive more sampling units than homogeneous strata, leading to Neyman allocation\index{Allocation!Neyman allocation}:
\begin{equation}
n_{h}= n \cdot \frac{N_{h}\,S_{h}(z)}{\sum\limits_{h=1}^{H} N_{h}\,S_{h}(z)} \;,
(\#eq:Neymanallocation)
\end{equation}
with $S_h(z)$ the standard deviation (square root of variance) of the study variable $z$ in stratum $h$.
Finally, the costs of sampling may differ among the strata. It can be relatively expensive to sample nearly inaccessible strata, and we do not want to sample many units there. This leads to optimal allocation\index{Allocation!optimal allocation}:
\begin{equation}
n_{h}= n \cdot \frac{\frac{N_{h}\,S_{h}(z)}{\sqrt{c_{h}}}}{\sum\limits_{h=1}^{H} \frac{N_{h}\,S_{h}(z)}{\sqrt{c_{h}}}} \;,
(\#eq:optallocation)
\end{equation}
with $c_h$ the costs per sampling unit in stratum $h$. Optimal means that given the total costs this allocation type leads to minimum sampling variance, assuming a linear costs model\index{Linear costs model}:
\begin{equation}
C = c_0 + \sum_{h=1}^H n_h c_h \;,
(\#eq:linearcostmodel)
\end{equation}
with $c_0$ overhead costs. So, the more variable a stratum and the lower the costs, the more units will be selected from this stratum.
```{r allocation}
S2z_h <- tapply(X = grdVoorst$z, INDEX = grdVoorst$stratum, FUN = var)
n_h_Neyman <- round(n * N_h * sqrt(S2z_h) / sum(N_h * sqrt(S2z_h)))
```
These optimal sample sizes can be computed with function `optsize` of package **surveyplanning**.
```{r}
labels <- sort(unique(mysample$stratum))
res <- optsize(labels, n, N_h, S2z_h)
round(res$nh, 0)
```
Table \@ref(tab:tableallocation) shows the proportional and optimal sample sizes for the five strata of the study area Voorst, for a total sample size of 40. Stratum XF is the one-but-smallest stratum and therefore receives only seven sampling units. However, the standard deviation in this stratum is the largest, and as a consequence with optimal allocation the sample size in this stratum is increased by three points, at the cost of stratum EA which is relatively homogeneous.
```{r tableallocation, echo = FALSE}
strata <- sort(unique(grdVoorst$stratum))
taballoc <- data.frame(Stratum = strata, Nh = formatC(as.numeric(N_h), 0, format = "f", big.mark = ","), Sh = as.numeric(round(sqrt(S2z_h), 1)), nhprop = as.numeric(n_h), nhNeyman = as.numeric(n_h_Neyman))
knitr::kable(
taballoc, caption = "Proportional and Neyman sample sizes in stratified simple random sampling of Voorst with a total sample size of 40.",
booktabs = TRUE,
linesep = ""
) %>%
kable_classic() %>%
add_footnote("Nh: stratum size; Sh: stratum standard deviation.", notation = "none")
```
```{r sdmean_allocation, echo = FALSE}
n <- 20:50
S2z_h <- tapply(X = grdVoorst$z, INDEX = grdVoorst$stratum, FUN = var)
se_mz_pro <- se_mz_opt <- numeric(length = length(n))
for (i in seq_len(length(n))) {
n_h_pro <- n[i] * w_h
n_h_Neyman <- n[i] * N_h * sqrt(S2z_h) / sum(N_h * sqrt(S2z_h))
se_mz_pro[i] <- sqrt(sum(w_h^2 * S2z_h / n_h_pro))
se_mz_opt[i] <- sqrt(sum(w_h^2 * S2z_h / n_h_Neyman))
}
sdmean_SI <- sqrt(var(grdVoorst$z) / n)
df <- tibble(n = n, SI = sdmean_SI, STSIpro = se_mz_pro, STSIopt = se_mz_opt)
```
Figure \@ref(fig:plotsdmeanallocation) shows the standard error of the $\pi$ estimator of the mean SOM concentration as a function of the total sample size, for simple random sampling and for stratified simple random sampling with proportional and Neyman allocation. A small extra gain in precision can be achieved using Neyman allocation instead of proportional allocation. However, in practice often Neyman allocation is not achievable, because we do not know the standard deviations of the study variable within the strata. If a quantitative covariate $x$ is used for stratification (see Sections \@ref(cumrootf) and \@ref(Ospats)), the standard deviations $S_h(z)$ are approximated by $S_h(x)$, resulting in approximately optimal stratum sample sizes. The gain in precision compared to proportional allocation is then partly or entirely lost.
(ref:plotsdmeanallocationlabel) Standard error of the $\pi$ estimator of the mean SOM concentration (g kg^-1^) as a function of the total sample size, for simple random sampling (SI) and for stratified simple random sampling with proportional (STSI(prop)) and Neyman allocation (STSI(Neyman)) for Voorst.
```{r plotsdmeanallocation, echo = FALSE, out.width = "100%", fig.asp = 0.7, fig.cap = "(ref:plotsdmeanallocationlabel)"}
df_lf <- df %>% pivot_longer(cols = c("SI", "STSIpro", "STSIopt"))
ggplot(df_lf) +
geom_point(mapping = aes(x = n, y = value, shape = name), size = 2.5) +
scale_shape_manual(values = c(1, 0, 2), name = "Design", labels = c("SI", "STSI(Neyman)", "STSI(prop)")) +
scale_x_continuous(name = "Sample size") +
scale_y_continuous(name = "Standard error")
```
Optimal allocation and Neyman allocation assume univariate stratification\index{Univariate stratification}, i.e., the stratified simple random sample is used to estimate the mean of a single study variable. If we have multiple study variables, optimal allocation becomes more complicated. In Bethel allocation\index{Allocation!Bethel allocation}, the total sampling costs, assuming a linear costs model (Equation \@ref(eq:linearcostmodel)), are minimised given a constraint on the precision of the estimated mean for each study variable [@Bethel1989], see Section \@ref(MultivariateStratification). Bethel allocation can be computed with function `bethel` of package **SamplingStrata** [@Barcaroli2020].
#### Exercises {-}
1. Use function `strata` of package **sampling** to select a stratified simple random sample with replacement of size 40 from Voorst, using proportional allocation. Check that the sum of the stratum sample sizes is 40.
+ Estimate the population mean and the standard error of the estimator.
+ Compute the true standard error of the estimator. Hint: compute the population variances of the study variable $z$ per stratum, and divide these by the stratum sample sizes.
+ Compute a 95% confidence interval estimate of the population mean, using function `confint` of package **survey**.
2. Looking at Figure \@ref(fig:boxplotsSOMstrata), which strata do you expect can be merged without losing much precision of the estimated population mean?
3. Use function `fct_collapse` of package **forcats** [@forcats] to merge the strata EA and PA.
+ Compute the true sampling variance of the estimator of the mean for this new stratification, for a total sample size of 40 and proportional allocation.
+ Compare this true sampling variance with the true sampling variance using the original five strata (same sample size, proportional allocation). What is your conclusion about the new stratification?
4. Proof that the sum of the inclusion probabilities over all population units with stratified simple random sampling equals the sample size $n$.
## *Cum-root-f* stratification {#cumrootf}
When we have a quantitative covariate $x$ related to the study variable $z$ and $x$ is known for all units in the population, strata can be constructed with the *cum-root-f* method using this covariate as a stratification variable, see @Dalenius1959 and @coc77. Population units with similar values for the covariate (stratification variable) are grouped into a stratum. Strata are computed as follows:
1. Compute a frequency histogram of the stratification variable using a large number of bins.
2. Compute the square root of the frequencies.
3. Cumulate the square root of the frequencies, i.e., compute $\sqrt{f_1}$, $\sqrt{f_1} + \sqrt{f_2}$, $\sqrt{f_1} + \sqrt{f_2} + \sqrt{f_3}$, etc.
4. Divide the cumulative sum of the last bin by the number of strata, multiply this value by $1,2, \dots, H-1$, with *H* the number of strata, and select the boundaries of the histogram bins closest to these values.
In *cum-root-f* stratification\index{\emph{Cum-root-f} stratification}, it is assumed that the covariate values are nearly perfect predictions of the study variable, so that the prediction errors do not affect the stratification. Under this assumption the stratification is optimal.
*Cum-root-f* stratification is illustrated with the data of Xuancheng in China. We wish to estimate the mean organic matter concentration in the topsoil (SOM, g kg^-1^) of this area. Various covariates are available that are correlated with SOM, such as elevation, yearly average temperature, slope, and various other terrain attributes. Elevation, the name of this variable in the tibble is dem, is used as as a single stratification variable, see Figure \@ref(fig:DEMXuancheng). The correlation coefficient of SOM and elevation in a sample of 183 observations is 0.59. The positive correlation can be explained as follows. Temperature is decreasing with elevation, leading to a smaller decomposition rate of organic matter in the soil.
(ref:DEMXuanchenglabel) Elevation used as a stratification variable in *cum-root-f* stratification of Xuancheng.
```{r DEMXuancheng, echo = FALSE, out.width = "100%", fig.cap = "(ref:DEMXuanchenglabel)"}
ggplot(data = grdXuancheng) +
geom_raster(mapping = aes(x = s1 / 1000, y = s2 / 1000, fill = dem)) +
scale_fill_viridis_c(name = "Elevation") +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed()
```
The strata can be constructed with the package **stratification** [@Baillargeon2011]. Care should be taken that the data are sorted in ascending order by the variable used for stratification, see help of function `strata.cumrootf`. Argument `n` of this function is the total sample size, but this value has no effect on the stratification. Argument `Ls` is the number of strata. I arbitrarily chose to construct five strata. Argument `nclass` is the number of bins of the histogram. The output object of function `strata.cumrootf` is a list containing amongst others a numeric vector with the stratum bounds (`bh`) and a factor with the stratum levels of the grid cells (`stratumID`). Finally, note that the values of the stratification variable must be positive. The minimum elevation is -5 m, so I added the absolute value of this minimum to elevation.
```{r, echo = FALSE, eval = FALSE}
library(stratification)
grdXuancheng <- grdXuancheng[order(grdXuancheng$dem), ]
dem_new <- grdXuancheng$dem + abs(min(grdXuancheng$dem))
crfstrata <- strata.cumrootf(x = dem_new, n = 100, Ls = 5, nclass = 500)
bh <- crfstrata$bh
grdXuancheng$crfstrata <- crfstrata$stratumID
```
```{r}
library(stratification)
grdXuancheng <- grdXuancheng %>%
arrange(dem) %>%
mutate(dem_new = dem + abs(min(dem)))
crfstrata <- strata.cumrootf(
x = grdXuancheng$dem_new, n = 100, Ls = 5, nclass = 500)
bh <- crfstrata$bh
grdXuancheng$crfstrata <- crfstrata$stratumID
```
Stratum bounds are threshold values of the stratification variable elevation; these stratum bounds\index{Stratum bound} are equal to `r formatC(bh, 1, format = "f")`. Note that the number of stratum bounds is one less than the number of strata. The resulting stratification is shown in Figure \@ref(fig:optstrata). Note that most strata are not single polygons, but are made up of many smaller polygons. This may be even more so if the stratification variable shows a noisy spatial pattern. This is not a problem at all, as a stratum is just a collection of population units (raster cells) and need not be spatially contiguous.
(ref:optstratalabel) Stratification of Xuancheng obtained with the *cum-root-f* method, using elevation as a stratification variable.
```{r optstrata, echo = FALSE, out.width = "100%", fig.cap = "(ref:optstratalabel)"}
labels <- as.character(round(bh, 0))
first <- paste("<", labels[1])
second <- paste(labels[1], "-", labels[2])
third <- paste(labels[2], "-", labels[3])
fourth <- paste(labels[3], "-", labels[4])
fifth <- paste(">", labels[4])
labs <- c(first, second, third, fourth, fifth)
ggplot(data = grdXuancheng) +
geom_raster(mapping = aes(x = s1 / 1000, y = s2 / 1000, fill = factor(crfstrata))) +
scale_fill_viridis_d(name = "Elevation", labels = labs) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed()
```
#### Exercises {-}
5. Write an **R** script to compute five *cum-root-f* strata for Eastern Amazonia (`grdAmazonia` in package **sswr**) to estimate the population mean of aboveground biomass (AGB), using log-transformed short-wave infrared radiation (SWIR2) as a stratification variable.
+ Compute ten *cum-root-f* strata, using function `strata` of package **sampling**. Sort the units first in ascending order on lnSWIR2. Use the stratum sample sizes as computed by function `strata.cumrootf`. What allocation is used for computing the stratum sample sizes?
+ Select a stratified simple random sample of 100 units. First, compute the stratum sample sizes for proportional allocation.
+ Estimate the population mean of AGB and its sampling variance.
+ Compute the true sampling variance of the estimator of the mean for this sampling design (see Exercise 1 for a hint).
+ Compute the stratification effect (gain in precision). Hint: compute the sampling variance for simple random sampling by computing the population variance of AGB, and divide this by the total sample size.
## Stratification with multiple covariates {#kmeansstratification}
If we have multiple variables that are possibly related to the study variable, we may want to use them all or a subset of them as stratification variables. Using the quantitative variables one-by-one in *cum-root-f* stratification, followed by overlaying the maps with univariate strata, may lead to numerous cross-classification strata.
A simple solution is to construct homogeneous groups, referred to as clusters, of population units (raster cells). The units within a cluster are more similar to each other than to the units in other clusters. Various clustering techniques are available. Here, I use hard k-means.
This is illustrated again with the Xuancheng case study. Five quantitative covariates are used for constructing the strata. Besides elevation, which was used as a single stratification variable in the previous section, now also temperature, slope, topographic wetness index (twi), and profile curvature are used to construct clusters that are used as strata in stratified simple random sampling. To speed up the computations, a subgrid with a spacing of 0.4 km is selected, using function `spsample` of package **sp**, see Chapter \@ref(SY) [@Bivand2013].
```{r}
library(sp)
gridded(grdXuancheng) <- ~ s1 + s2
subgrd <- spsample(
grdXuancheng, type = "regular", cellsize = 400, offset = c(0.5, 0.5))
subgrd <- data.frame(coordinates(subgrd), over(subgrd, grdXuancheng))
```
```{r, echo = FALSE}
grdXuancheng <- as.data.frame(grdXuancheng)
```
Five clusters are computed with k-means using as clustering variables the five covariates mentioned above. The scale of these covariates is largely different, and for this reason they must be scaled before being used in clustering. The k-means algorithm is a deterministic algorithm, i.e., the same initial clustering will end in the same final, optimised clustering. This final clustering can be suboptimal, and therefore it is recommended to repeat the clustering as many times as feasible, with different initial clusterings. Argument `nstart` is the number of initial clusterings. The best clustering, i.e., the one with the smallest within-cluster sum-of-squares, is kept.
```{r}
x <- c("dem", "temperature", "slope", "profile.curvature", "twi")
set.seed(314)
myClusters <- kmeans(
scale(subgrd[, x]), centers = 5, iter.max = 1000, nstart = 100)
subgrd$cluster <- myClusters$cluster
```
Figure \@ref(fig:kmeansstrataXuancheng) shows the five clusters obtained by k-means clustering\index{k-means clustering} of the raster cells. These clusters can be used as strata in random sampling.
(ref:kmeansstrataXuanchenglabel) Five clusters obtained by k-means clustering of the raster cells of Xuancheng, using five scaled covariates in clustering.
```{r kmeansstrataXuancheng, echo = FALSE, out.width = "100%", fig.cap = "(ref:kmeansstrataXuanchenglabel)"}
ggplot(subgrd) +
geom_raster(mapping = aes(x = x1 / 1000, y = x2 / 1000, fill = as.character(cluster))) +
scale_fill_viridis_d(name = "Stratum") +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed()
```
The size of the clusters used as strata is largely different (Table \@ref(tab:tablekmeansstrata)). This table also shows means of the unscaled covariates used in clustering.
```{r tablekmeansstrata, echo = FALSE}
N_h <- tapply(subgrd$dem, INDEX = list(subgrd$cluster), FUN = length)
mean.dem <- tapply(subgrd$dem, INDEX = list(subgrd$cluster), FUN = mean)
mean.temp <- tapply(subgrd$temperature, INDEX = list(subgrd$cluster), FUN = mean)
mean.slope <- tapply(subgrd$slope, INDEX = list(subgrd$cluster), FUN = mean)
mean.pc <- tapply(subgrd$profile.curvature, INDEX = list(subgrd$cluster), FUN = mean)
mean.twi <- tapply(subgrd$twi, INDEX = list(subgrd$cluster), FUN = mean)
df <- data.frame(Stratum = seq(1:5), Nh = formatC(N_h, 0, format = "f", big.mark = ","), Elevation = round(mean.dem, 0), Temperature = round(mean.temp, 2), Slope = round(mean.slope, 2), Profilecurv = round(mean.pc, 5), Twi = round(mean.twi, 2))
knitr::kable(
df, caption = "Size (Nh) and means of clustering variables of the five strata of Xuancheng obtained with k-means clustering of raster cells.",
booktabs = TRUE,
linesep = ""
) %>%
kable_classic()
```
Categorical variables can be accommodated in clustering using the technique proposed by @Huang1998, implemented in package **clustMixType** [@clustMixType].
In the situation that we already have some data of the study variable, an alternative solution is to calibrate a model for the study variable, for instance a multiple linear regression model, using the covariates as predictors, and to use the predictions of the study variable as a single stratification variable in *cum-root-f* stratification or in optimal spatial stratification, see Section \@ref(Ospats).
## Geographical stratification {#geostrata}
When no covariate is available, we may still decide to apply a *geographical stratification*\index{Geographical stratification}. For instance, a square study area can be divided into 4 $\times$ 4 equal-sized subsquares that are used as strata. When we select one or two points per subsquare, we avoid strong spatial clustering of the sampling points. Geographical stratification improves the *spatial coverage*\index{Spatial coverage}. When the study variable is spatially structured, think for instance of a spatial trend, then geographical stratification will lead to more precisely estimated means (smaller sampling variances).
A simple method for constructing geographical strata is k-means clustering [@bru99]. See Section \@ref(SpatialCoverage) for a simple illustrative example of how geographical strata are computed with k-means clustering. In this approach, the study area is discretised by a large number of grid cells. These grid cells are the objects that are clustered. The clustering variables are simply the spatial coordinates of the centres of the grid cells. This method leads to compact geographical strata\index{Compact geographical stratum}, briefly referred to as geostrata\index{Geostratum|see{Compact geographical stratum}}. Geostrata can be computed with function `kmeans`, as shown in Section \@ref(kmeansstratification). The two clustering variables have the same scale, so they should not be scaled because this would lead to an arbitrary distortion of geographical distances. The geostrata generally will not have the same number of grid cells. Geostrata of equal size can be attractive, as then the sample becomes selfweighting, i.e., the sample mean is an unbiased estimator of the population mean.
Geostrata of the same size can be computed with function `stratify` of the package **spcosa** (@spcosa, @walvoort2010), with argument `equalArea = TRUE`.
If the total number of grid cells divided by the number of strata is an integer, the stratum sizes are exactly equal; otherwise, the difference is one grid cell. @walvoort2010 describe the k-means algorithms implemented in this package in detail. Argument `object` of function `stratify` specifies a spatial object of the population units. In the **R** code below `grdVoorst` is converted to a `SpatialPixelsDataFrame` with function `gridded` of the package **sp**. The spatial object can also be of class `SpatialPolygons`. In that case, either argument `nGridCells` or argument `cellSize` must be set, so that the vector map in `object` can be discretised by a finite number of grid cells. Argument `nTry` specifies the number of initial stratifications in k-means clustering, and therefore is comparable with argument `nstart` of function `kmeans`. For more details on spatial stratification using k-means clustering, see Section \@ref(SpatialCoverage). The k-means algorithm used with `equalArea = TRUE` takes much more computing time than the one used with `equalArea = FALSE`.
```{r spcosa, eval = FALSE}
library(spcosa)
library(sp)
set.seed(314)
gridded(subgrd) <- ~ x1 + x2
mygeostrata <- stratify(
object = subgrd, nStrata = 50, nTry = 1, equalArea = TRUE)
```
```{r, eval = FALSE, echo = FALSE}
write_rds(mygeostrata, file = "results/geostrata_Xuancheng.rds")
```
```{r, echo = FALSE}
subgrd <- as(subgrd, "data.frame")
mygeostrata <- read_rds(file = "results/geostrata_Xuancheng.rds")
```
Function `spsample` of package **spcosa** is used to select from each geostratum a simple random sample of two points.
```{r}
set.seed(314)
mysample <- spcosa::spsample(mygeostrata, n = 2)
```
Figure \@ref(fig:GeoStrata) shows fifty compact geostrata of equal size for Xuancheng with the selected sampling points. Note that the sampling points are reasonably well spread throughout the study area^[The compact geostrata and the sample are plotted with package **ggplot2**. A simple alternative is to use method `plot` of **spcosa**: `plot(mygeostrata, mysample)`.].
```{r GeoStrata, echo = FALSE, out.width = "100%", fig.cap = "Compact geostrata of equal size for Xuancheng and stratified simple random sample of two points per stratum."}
ggplot(as(mygeostrata, "data.frame")) +
geom_raster(mapping = aes(x = x1 / 1000, y = x2 / 1000, fill = factor(stratumId))) +
scale_fill_viridis_d(name = "geostratum") +
geom_point(data = as(mysample, "data.frame"), mapping = aes(x = x1 / 1000, y = x2 / 1000), size = 2, colour = "orange") +
coord_fixed() +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
theme(legend.position = "none")
```
Once the observations are done, the population mean can be estimated with function `estimate`. For Xuancheng I simulated data from a normal distribution, just to illustrate estimation with function `estimate`. Various statistics can be estimated, among which the population mean (spatial mean), the standard error, and the CDF. The CDF is estimated by transforming the data into indicators (Subsection \@ref(CDF)).
```{r}
library(spcosa)
mysample <- spcosa::spsample(mygeostrata, n = 2)
mydata <- data.frame(z = rnorm(100, mean = 10, sd = 2))
mean <- estimate("spatial mean", mygeostrata, mysample, data = mydata)
se <- estimate("standard error", mygeostrata, mysample, data = mydata)
cdf <- estimate("scdf", mygeostrata, mysample, data = mydata)
```
The estimated population mean equals `r formatC(mean, 1, format = "f")` with an estimated standard error of `r formatC(se, 1, format = "f")`.
#### Exercises {-}
6. Why is it attractive to select at least two points per geostratum?
7. The alternative to 50 geostrata and two points per geostratum is 100 geostrata and one point per geostratum. Which sampling strategy will be more precise?
8. The geostrata in Figure \@ref(fig:GeoStrata) have equal size (area), which can be enforced by argument `equalArea = TRUE`. Why are equal sizes attractive? Work out the estimator of the population mean for strata of equal size.
9. Write an **R** script to construct 20 compact geographical strata of equal size for agricultural field Leest. The geopackage file of this field can be read with **sf** function `read_sf(system.file("extdata/leest.gpkg", package = "sswr"))`. Remove the projection attributes with `st_set_crs(NA_crs_)`, and convert the simple feature object to a spatial object with method `as_Spatial`. Select two points per geostratum, using function `spsample` of package **spcosa**. Repeat this with 40 strata of equal size, and randomly select one point per stratum.
+ If only one point per stratum is selected, the sampling variance can be approximated by the collapsed strata\index{Collapsed strata} estimator. In this method, pairs of strata are formed, and the two strata of a pair are joined. In each new stratum we now have two points. With an odd number of strata there will be one group of three strata and three points. The sample is then analysed as if it were a random sample from the new collapsed strata. Suppose we group the strata on the basis of the measurements of the study variable. Do you think this is a proper way of grouping?
+ In case you think this is not a proper way of grouping the strata, how would you group the strata?
+ Is the sampling variance estimator unbiased? If not, is the sampling variance overestimated or underestimated?
10. Laboratory costs for measuring the study variable can be saved by bulking the soil aliquots\index{Soil aliquot} (composite sampling\index{Composite sampling}). There are two options: bulking all soil aliquots from the same stratum (bulking within strata) or bulking by selecting one aliquot from each stratum (bulking across strata). In **spcosa** bulking across strata is implemented. Write an **R** script to construct 20 compact geographical strata for study area Voorst. Use argument `equalArea = TRUE`. Select four points per stratum using argument `type = "composite"`, and convert the resulting object to `SpatialPoints`. Extract the $z$-values in `grdVoorst` at the selected sampling points using function `over`. Add a variable to the resulting data frame indicating the composite (points 1 to 4 are from the first stratum, points 5 to 8 from the second stratum, etc.), and estimate the means for the four composites using function `tapply`. Finally, estimate the population mean and its standard error.
+ Can the sampling variance of the estimator of the mean be estimated for bulking within the strata?
+ The alternative to analysing the concentration of four composite samples obtained by bulking across strata is to analyse all 20 $\times$ 4 aliquots separately. The strata have equal size, so the inclusion probabilities are equal. As a consequence, the sample mean is an unbiased estimator of the population mean. Is the precision of this estimated population mean equal to that of the estimated population mean with composite sampling? If not, is it smaller or larger, and why?
+ If you use argument `equalArea = FALSE` in combination with argument `type = "composite"`, you get an error message. Why does this combination of arguments not work?
## Multiway stratification
In Section \@ref(kmeansstratification) multiple continuous covariates are used to construct clusters of raster cells using k-means. These clusters are then used as strata. This section considers the case where we have multiple categorical and/or continuous variables that we would like to use as stratification variables. The continuous stratification variables are first used to compute strata based on that stratification variable, e.g., using the *cum-root-f* method. What could be done then is to compute the cross-classification of each unit and use these cross-classifications as strata in random sampling. However, this may lead to numerous strata, maybe even more than the intended sample size. To reduce the total number of strata, we may aggregate cross-classification strata\index{Cross-classification stratum} with similar means of the study variable, based on our prior knowledge.
An alternative to aggregation of cross-classification strata is to use the separate strata, i.e., the strata based on an individual stratification variable, as *marginal* strata in random sampling. How this works is explained in Subsection \@ref(Multiwaystratification).
## Multivariate stratification {#MultivariateStratification}
Another situation is where we have multiple study variables and would like to optimise the stratification and allocation for estimating the population means of all study variables. Optimal stratification for multiple study variables is only relevant if we would like to use different stratification variables for the study variables\index{Multivariate stratification}. In many cases, we do not have reliable prior information about the different study variables justifying the use of multiple stratification variables. We are already happy to have one stratification variable that may serve to increase the precision of the estimated means of all study variables.
However, in case we do have multiple stratification variables tailored to different study variables, the objective is to partition the population in strata, so that for a given allocation, the total sampling costs, assuming a linear costs model (Equation \@ref(eq:linearcostmodel)), are minimised given a constraint on the precision of the estimated mean for each study variable.
Package **SamplingStrata** [@Barcaroli2020] can be used to optimise multivariate strata. @Barcaroli2014 gives details about the objective function and the algorithm used for optimising the strata. Sampling units are allocated to the strata by Bethel allocation\index{Allocation!Bethel allocation} [@Bethel1989]. The required precision is specified in terms of a coefficient of variation\index{Coefficient of variation}, one per study variable.
Multivariate stratification is illustrated with the Meuse data set of package **gstat** [@peb04]. The prior data of heavy metal concentrations of Cd and Zn are used in spatial prediction to create maps of these two study variables.
The maps of natural logs of the two metal concentrations are created by kriging with an external drift, using the square root of the distance to the Meuse river as a predictor for the mean, see Section \@ref(IntroKED) for how this spatial prediction method works.
```{r, echo = FALSE}
library(sp)
library(gstat)
data("meuse")
coordinates(meuse) <- ~x + y
lcdr_vgm <- variogram(log(cadmium) ~ sqrt(dist), meuse)
lcdr_fit <- fit.variogram(lcdr_vgm, model = vgm(1, "Exp", 300, 1))
```
```{r, echo = FALSE}
lznr_vgm <- variogram(log(zinc) ~ sqrt(dist), meuse)
lznr_fit <- fit.variogram(lznr_vgm, model = vgm(1, "Exp", 300, 1))
```
Figure \@ref(fig:PredictedCd) shows the map with the predicted log Cd and log Zn concentrations.
```{r, echo = FALSE, results = 'hide'}
data(meuse.grid)
gridded(meuse.grid) <- ~ x + y
#kriging of Cd
lcd_kriged <- krige(
formula = log(cadmium) ~ sqrt(dist),
locations = meuse,
newdata = meuse.grid,
model = lcdr_fit,
debug.level = 0) %>% as("data.frame")
#kriging of Zn
lzn_kriged <- krige(
log(zinc) ~ sqrt(dist),
meuse, meuse.grid,
model = lznr_fit,
debug.level = 0) %>% as("data.frame")
```
```{r PredictedCd, echo = FALSE, out.width = "100%", fig.cap = "Kriging predictions of natural logs of Cd and Zn concentrations in the study area Meuse, used as stratification variables in bivariate stratification."}
df <- data.frame(x = lcd_kriged$x, y = lcd_kriged$y, lnCd = lcd_kriged$var1.pred, lnZn = lzn_kriged$var1.pred)
plt1 <- ggplot(data = df) +
geom_raster(mapping = aes(x = x / 1000, y = y / 1000, fill = lnCd)) +
scale_fill_viridis_c(name = "lnCd") +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed()
plt2 <- ggplot(data = df) +
geom_raster(mapping = aes(x = x / 1000, y = y / 1000, fill = lnZn)) +
scale_fill_viridis_c(name = "lnZn") +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed()
grid.arrange(plt1, plt2, nrow = 1)
```
The predicted log concentrations of the two heavy metals are used as stratification variables in designing a new sample for design-based estimation of the population means of Cd and Zn. For the log of Cd, there are negative predicted concentrations (Figure \@ref(fig:PredictedCd)). This leads to an error when running function ` optimStrata`. The minimum predicted log Cd concentration is -1.7, so I added 2 to the predictions. A variable indicating the domains of interest is added to the data frame. The value of this variable is 1 for all grid cells, so that a sample is designed for estimating the mean of the entire population. As a first step, function ` buildFrameDF` is used to create a data frame that can be handled by function `optimStrata`. Argument `X` specifies the stratification variables, and argument `Y` the study variables. In our case, the stratification variables and the study variables are the same. This is typical for the situation where the stratification variables are obtained by mapping the study variables.
```{r}
library(SamplingStrata)
df <- data.frame(cd = lcd_kriged$var1.pred + 2,
zn = lzn_kriged$var1.pred,
dom = 1,
id = seq_len(nrow(lcd_kriged)))
frame <- buildFrameDF(
df = df, id = "id",
X = c("cd", "zn"), Y = c("cd", "zn"),
domainvalue = "dom")
```
Next, a data frame with the precision requirements for the estimated means is created. The precision requirement is given as a coefficient of variation, i.e., the standard error of the estimated population mean, divided by the estimated mean. The study variables as specified in `Y` are used to compute the estimated means and the standard errors for a given stratification and allocation.
```{r}
cv <- as.data.frame(list(DOM = "DOM1", CV1 = 0.02, CV2 = 0.02, domainvalue = 1))
```
Finally, the multivariate stratification is optimised by searching for the optimal stratum bounds using a genetic algorithm [@ger99].
```{r, eval = FALSE}
set.seed(314)
res <- optimStrata(
method = "continuous", errors = cv, framesamp = frame, nStrata = 5,
iter = 50, pops = 20, showPlot = FALSE)
```
```{r, echo = FALSE, eval = FALSE}
save(res, file = "results/multivariatestrata_Meuse.rda")
```
```{r, echo = FALSE}
load(file = "results/multivariatestrata_Meuse.rda")
```
A summary of the strata can be obtained with function `summaryStrata`.
```{r}
smrstrata <- summaryStrata(res$framenew, res$aggr_strata, progress = FALSE)
```
```{r, echo = FALSE}
smrstrata <- smrstrata[, c(2, 3, 4, 6, 7, 8, 9)]
smrstrata[, c(4, 5, 6, 7)] <- round(smrstrata[, c(4, 5, 6, 7)], 3)
print(smrstrata)
```
Column `Population` contains the sizes of the strata, i.e., the number of grid cells. The total sample size equals `r sum(smrstrata$Allocation)`. The sample size per stratum is computed with Bethel allocation, see Section \@ref(STSIallocation). The last four columns contain the lower and upper bounds of the orthogonal intervals.
Figure \@ref(fig:2dplotbivariatestrata) shows a 2D-plot of the bivariate strata. The strata can be plotted as a series of nested rectangles. All population units in the smallest rectangle belong to stratum 1; all units in the one-but-smallest rectangle that are not in the smallest rectangle belong to stratum 2, etc. If we have more than two stratification variables, the strata form a series of nested hyperrectangles or boxes. The strata are obtained as the Cartesian product of orthogonal intervals.
```{r 2dplotbivariatestrata, fig.cap = "2D-plot of optimised bivariate strata of the study area Meuse."}
plt <- plotStrata2d(res$framenew, res$aggr_strata,
domain = 1, vars = c("X1", "X2"), labels = c("Cd", "Zn"))
```
```{block2, type = 'rmdnote'}
It may happen that after the optimisation of the stratum bounds in some resulting strata, no units are contained. If the stratification with a smaller number of strata requires fewer sampling units so that the sampling costs are lower (and still the precision requirement is met), then this is retained as the optimal stratification.
```
Figure \@ref(fig:OptimisedstrataMeuse) shows a map of the optimised strata.
```{r OptimisedstrataMeuse, echo = FALSE, fig.width = 5, fig.cap = "Map of optimised bivariate strata of the study area Meuse."}
dfnew <- data.frame(coordinates(meuse.grid), res$framenew)
ggplot(dfnew) +
geom_raster(mapping = aes(x = x / 1000, y = y / 1000, fill = factor(STRATO))) +
scale_fill_viridis_d(name = "Stratum") +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed()
```
The expected coefficient of variation can be extracted with function `expected_CV`.
```{r}
expected_CV(res$aggr_strata)
```
The coefficient of variation of Cd is indeed equal to the desired level of 0.02, for Zn it is smaller. So, in this case Cd is the study variable that determines the total sample size of 26 units.
Note that these coefficients of variation are computed from the stratification variables, which are predictions of the study variable. Errors in these predictions are not accounted for. It is well known that kriging is a smoother, so that the variance of the predicted values within a stratum is smaller than the variance of the true values. As a consequence, the coefficients of variation of the predictions underestimate the coefficients of variation of the study variables. See Section \@ref(Ospats) for how prediction errors and spatial correlation of prediction errors can be accounted for in optimal stratification. An additional problem is that I added a value of 2 to the log Cd concentrations. This does not affect the standard error of the estimated mean, but does affect the estimated mean, so that also for this reason the coefficient of variation of the study variable Cd is underestimated.
```{r, echo = FALSE}
rm(list = ls())
```