-
Notifications
You must be signed in to change notification settings - Fork 0
/
chap03_statistical_inference.tex
757 lines (677 loc) · 55.4 KB
/
chap03_statistical_inference.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
\setchapterpreamble[u]{\margintoc}
\chapter{Statistical inference}
\label{chap:statistical_inference}
In this Chapter, we introduce the three main \emph{statistical inference} problems: \emph{estimation}, \emph{confidence intervals} and \emph{tests}.
Each problem will be instantiated with the simple Bernoulli model, where we have iid samples $X_1, \ldots, X_n$ distributed as $\ber(\theta)$ with $\theta \in (0, 1)$.
Let us start with the first inference problem: \emph{estimation}.
\section{Estimation} % (fold)
\label{sec:estimation}
We want to \emph{infer} $\theta$, or \emph{estimate} it by finding a statistic which is a measurable function of $(X_1, \ldots, X_n)$%
\sidenote[][*4]{Once again, since we are doing statistics, the only thing we are allowed to use is the data.}%
or a measurable function of $S_n = \sum_{i=1}^n X_i$ thereof, since $S_n$ is sufficient, see Section~\ref{sec:statistics}.
We will denote such a statistic as
\begin{equation*}
\wh \theta_n = \wh \theta_n(X_1, \ldots, X_n).
\end{equation*}
This function \emph{does not depend} on $\theta$, but of course its distribution does.
Ideally, we want $\wh \theta_n$ to be ``close'' to $\theta$, since we want a good estimator, so that the first thing we need to do is to quantify ``closeness''.
For instance, we could want $|\wh \theta_n - \theta|$ to be close to $0$ with a large probability, since we do not forget that $\wh \theta_n$ is a random variable, as a function of the data $(X_1, \ldots, X_n)$.
The most natural distance is arguably the Euclidean one, in this context the $L^2$ distance,
which leads to the \emph{quadratic risk}.%
\sidenote{Although the quadratic risk corresponds to a \emph{squared} $L^2$ norm.}
\begin{definition}[Quadratic risk]
\label{def:quadratic_risk}
Consider a statistical model with data $X$ and set of parameters $\Theta \subset \R$ and an estimator $\wh \theta(X)$.
The quadratic risk of $\wh \theta$ is given by
\begin{equation*}
R(\wh \theta, \theta) = \E_\theta[ (\wh \theta - \theta)^2 ] = \int_E (\wh \theta(x) - \theta)^2 P_\theta(dx).
\end{equation*}
We consider the quadratic risk as a function $\Theta \goes \R^+$ of the parameter given by $\theta \mapsto R(\wh \theta, \theta)$.
\end{definition}
At this point, it's useful to recall some classical inequalities on the queues of random variables.
The Markov inequality tells us that if $Y$ is a real random variable such that $\E |Y|^p < +\infty$ for some $p > 0$ then
\begin{equation*}
\P[|Y| > t] \leq \frac{\E |Y|^p}{t^p}
\end{equation*}
for any $t > 0$.
This tells us that the more $Y$ has moments%
\sidenote{We say that $Y$ as moments up to order $p$ if $\E |Y|^p < +\infty$.
Note that this entails $\E |Y|^q < +\infty$ for any $q < p$ since $\E |Y|^p = \E [|Y|^q]^{p/q} \geq (\E |Y|^q)^{p/q}$ using Jensen's inequality.}%
the more the queue of $Y$ is tight (it goes faster to $0$ with $t \goes +\infty$).
Markov's inequality with $p=2$ entails
\begin{equation}
\label{eq:l2_entrails_proba}
\P[|\wh \theta - \theta| > t] \leq \frac{R(\wh \theta, \theta)}{t^2}
\end{equation}
which tells us that whenever the quadratic risk is small, then $\wh \theta$ is close to $\theta$ with a large probability.
Whenever $R(\wh \theta_n, \theta) \rightarrow 0$ with $n \rightarrow +\infty$, we will write $\wh \theta_n \goqr \theta$, which stands for convergence in $L^2$ norm, which entails, because of Inequality~\eqref{eq:l2_entrails_proba}, that $\wh \theta_n \gopro \theta$, which stands for convergence in probability.\sidenote{More precisely, in $\P_\theta$-probability, namely $\P_\theta[|\wh \theta_n - \theta| > \eps] \rightarrow 0$ as $n \rightarrow +\infty$ for any $\eps > 0$, but we will write $\wh \theta_n \gopro \theta$ in order to keep the notations as simple as possible.}%
\begin{definition}
\label{def:consistent}
We say that $\wh \theta_n$ is \emph{consistent} whenever $\P_\theta[|\wh \theta_n - \theta| > \eps] \rightarrow 0$ as $n \rightarrow +\infty$ for any $\eps > 0$ and any $\theta \in \Theta$.
We say that it is strongly consistent whenever $\P_\theta[\wh \theta_n \rightarrow \theta] = 1$ for any $\theta \in \Theta$.
\end{definition}
In Definitions~\ref{def:quadratic_risk} and~\ref{def:consistent} above, if $\Theta \subset \R^d$, it suffices to replace $|\cdot|$ by the Euclidean norm $\norm{\cdot}_2$, where $\norm{x}_2 = (x^\top x)^{1/2} = (\sum_{j=1}^d x_j^2)^{1/2}$.
\paragraph{Bias variance decomposition.} % (fold)
The \emph{bias-variance decomposition} is the following decomposition of the quadratic risk between two terms: a bias term denoted $b(\wh \theta, \theta)$ (squared in the formula) and a variance term:
\begin{equation}
\label{eq:bias-variance-decomposition}
\begin{split}
R(\wh \theta, \theta) &= \E_\theta[(\wh \theta - \theta)^2] = (\E_\theta [\wh \theta] - \theta)^2 + \var_\theta[ \wh \theta ] \\
&= b(\wh \theta, \theta)^2 + \var_\theta [\wh \theta].
\end{split}
\end{equation}
When $b(\wh \theta, \theta) = 0$ for all $\theta \in \Theta$ we say that the estimator $\wh \theta$ is \emph{unbiased}.
This means that this estimator will not over or under-estimate $\theta$, since its expectation equals $\theta$.
\paragraph{Back to Bernoulli.} % (fold)
Going back to the $\ber(\theta)$ model, we consider the estimator $\wh \theta_n = S_n / n = n^{-1} \sum_{i=1}^n X_i$.
We already know many things about this estimator:
\begin{enumerate}
\item We have $\E_\theta [\wh \theta_n] = \theta$ which means that $\wh \theta_n$ is unbiased;
\item The bias-variance decomposition gives
\begin{equation}
\label{eq:bernoulli-quadratic-risk}
R(\wh \theta_n, \theta) = \var_\theta[\wh \theta_n] = \frac{\theta (1 - \theta)}{n} \leq
\frac{1}{4 n} \rightarrow 0
\end{equation}
which means that $\wh \theta_n \goqr \theta$ and which entails that $\wh \theta_n$ is consistent;
\item The law of large number tells us that $\wh \theta_n \goas \theta$, hence $\wh \theta_n$ is strongly consistent;
\item The central limit theorem tells us that
\begin{equation}
\label{eq:tcl-bernoulli}
\sqrt n (\wh \theta_n - \theta) \leadsto \nor(0, \theta(1 - \theta)).
\end{equation}
\end{enumerate}
The points 2--4 from above are all different ways of saying that when $n$ is large, then $\wh \theta_n$ is close to $\theta$.
In practice, an estimator leads to a value: for the Bernoulli experiment with $n=100$ and $42$ ones you end up with a single estimated value $0.42$.
But what if we want to include uncertainty in this estimation?
Namely how confident are we about this $0.42$ value?
Moreover, what do we mean by ``when $n$ is large enough''?
Can we quantify this somehow?
These questions can be answered by considering another inference problem: confidence intervals.
\section{Confidence intervals} % (fold)
\label{sec:confidence_intervals}
Here, we don't only want to build an estimator $\wh \theta_n$ but also to quantify the uncertainty associated to this estimation.
\subsection{Non-asymptotic coverage} % (fold)
\label{sub:non_asymptotic_coverage}
% subsection non_asymptotic_coverage (end)
Combining Inequalities~\eqref{eq:l2_entrails_proba} and~\eqref{eq:bernoulli-quadratic-risk} leads to
\begin{equation*}
\P_\theta[ |\wh \theta_n - \theta| > t] \leq \frac{1}{4 n t^2},
\end{equation*}
so that for $\alpha \in (0, 1)$ and the choice $t_\alpha = 1 / (2 \sqrt{n \alpha})$ we have
\begin{equation}
\label{eq:bernoulli-first-ci}
\P_\theta \big\{ \theta \in [ \wh \theta_n^L, \wh \theta_n^R ] \big\} \geq 1 - \alpha
\end{equation}
for any $\theta \in (0,1 )$, where
\begin{equation*}
\wh \theta_n^L := \wh \theta_n - \frac{1}{2 \sqrt{n \alpha}} \quad \text{ and }
\quad \wh \theta_n^R := \wh \theta_n + \frac{1}{2 \sqrt{n \alpha}}.
\end{equation*}
Therefore, if we choose $\alpha = 0.05 = 5\%$, we know that $\theta \in [\wh \theta_n^L, \wh \theta_n^R]$ with a probability larger than $95\%$.
We say in this case that the interval $[\wh \theta_n^L, \wh \theta_n^R]$ is a \emph{confidence interval} with \emph{coverage} $95\%$.%
\sidenote{If we toss the coin $1000$ times and get $420$ heads, the realization of this confidence interval at $95\%$ is $[0.35, 0.49]$.}
If $\alpha = 0$ we have no other choice than using the whole $\R$ as a confidence interval: $\alpha$ provides us with some slack, so that we can build a non-absurdly large confidence interval.
We have that $|\wh \theta_n^R - \wh \theta_n^L|$ increases as $\alpha$ decreases, since a smaller $\alpha$ means more confidence, hence a larger interval.
On the contrary, $|\wh \theta_n^R - \wh \theta_n^L|$ decreases with the sample size $n$.
\begin{definition}[Confidence interval]
Consider a statistical model with data $X$ and set of parameters $\Theta \subset \R$.
Fix a \emph{confidence level} $\alpha \in (0, 1)$ and consider two statistics $\wh \theta^L(X)$ and $\wh \theta^R(X)$. Whenever
\begin{equation}
\label{eq:coverage-property}
\P_\theta \big\{ \theta \in [\wh \theta^L(X), \wh \theta^R(X)] \big\} \geq 1 - \alpha
\end{equation}
for any $\theta \in \Theta$, we say that $[\wh \theta^L(X), \wh \theta^R(X)]$ is a \emph{confidence interval} at \emph{level} or \emph{coverage} $1 - \alpha$.
\end{definition}
Inequality~\eqref{eq:coverage-property} is called the \emph{coverage} property of the confidence interval.
More generally, when $\Theta \subset \R^d$, we will say that $S(X)$ is a \emph{confidence set} if it is a statistic satisfying the coverage property $\P_\theta[ \theta \in S(X) ] \geq 1 - \alpha$ for any $\theta \in \Theta$.
\begin{remark}
Whenever we need only an upper or lower bound on $\theta$ (for instance, when we need to check statistically that some toxicity level is below some threshold), we build a \emph{unilateral} or \emph{one-sided} confidence interval, where we choose either $\wh \theta^L = -\infty$ ($0$ for the Bernoulli model) or $\wh \theta^R = +\infty$ ($1$ for the Bernoulli model).
Indeed, at a fixed level $1 - \alpha$, the bound provided by a one-sided confidence interval is tighter than the bound of a two-sided interval.
\end{remark}
But, we can do better for the Bernoulli model (or any model where samples are bounded almost surely) thanks to the following Hoeffding inequality.
\begin{theorem}[Hoeffding]
\label{thm:hoeffding}
Let $X_1, \ldots, X_n$ be independent random variables such that $X_i \in [a_i, b_i]$ almost surely and let $S = \sum_{i=1}^n X_i$. Then,
\begin{equation*}
\P[ S \geq \E S + t] \leq \exp\Big( - \frac{2 t^2}{\sum_{i=1}^n (b_i - a_i)^2} \Big)
\end{equation*}
holds for any $t >0$.
\end{theorem}
Theorem~\ref{thm:hoeffding} is something called a deviation inequality: it provides a control on the probability of deviation of $S$ with respect to its mean.
It shows that bounded random variables are \emph{sub-Gaussian}, since it shows that the queue of $S - \E S$ is bounded by $\exp(-c t^2)$ for some constant $c$ (that depends on $n$).
The proof of Theorem~\ref{thm:hoeffding} is provided in Section~\ref{sec:chap03-proofs}.
\paragraph{Back to Bernoulli.} % (fold)
% paragraph back_to_bernoulli (end)
Let's apply Theorem~\ref{thm:hoeffding} to the Bernoulli model $X_i \sim \ber(\theta)$ so that $a_i = 0$, $b_i = 1$ and therefore $\P[ S \geq \E S + t] \leq e^{-2 t^2 / n}$.
Using again Theorem~\ref{thm:hoeffding} with $X_i$ replaced by $-X_i$ together with an union bound%
\sidenote[][*-3]{Using Theorem~\ref{thm:hoeffding} with $X_i$ replaced by $-X_i$ gives $\P[-S + \E S \geq t] \leq e^{-2 t^2 / n}$, so that $\P[|S - \E S| \geq t] \leq \P[S - \E S \geq t] + \P[S - \E S \leq -t] \leq 2 e^{-2 t^2 / n}$.}%
leads to $\P[ | S - \E S | \geq t] \leq 2 e^{-2 t^2 / n}$.
So, for some $\alpha \in (0, 1)$, we obtain another confidence interval, since the following coverage property holds:
\begin{equation*}
\P \bigg[ \wh \theta_n - \sqrt{\frac{\log(2 / \alpha)}{2n}} \leq \theta \leq \wh \theta_n
+ \sqrt{\frac{\log(2 / \alpha)}{2n}} \bigg] \geq 1 - \alpha.
\end{equation*}
This proves that $[\wh \theta_n \pm \sqrt{\log(2 / \alpha) / (2n)}]$ is a confidence interval at level $1 - \alpha$.%
\sidenote[][*-3]{For $1000$ tosses and $420$ heads, the realization of this interval at level $95\%$ is $[0.37, 0.46]$. It's a bit more precise than the previous one, which was based on Markov's inequality.}
Let's compare the two confidence intervals we obtained so far for the Bernoulli model.
It can bee seen that
\begin{equation*}
\frac{1}{2 \sqrt{n \alpha}} > \sqrt{\frac{\log(2 / \alpha)}{2n}}
\end{equation*}
for $\alpha < 0.23$, although both sides are $O(1 / \sqrt n)$.
Only the dependence on the level $\alpha$ is improved with the confidence interval obtained through Hoeffding's inequality, since it exploits the sub-Gaussianity of the Bernoulli distribution, while the first confidence interval~\eqref{eq:bernoulli-first-ci} only used the upper bound~\eqref{eq:l2_entrails_proba} on the variance.%
There is yet another way to build a confidence interval, called \emph{exact} confidence interval.
Let us denote by $F_{n, \theta}$ the distribution function of $\bin(n, \theta)$.
It is given by
\begin{equation}
\label{eq:binomial_distribution}
F_{n, \theta}(x) = \sum_{k=0}^{[x]}\binom{n}{k} \theta^k (1 - \theta)^{n - k}
\end{equation}
for $x \in [0, n]$, where $[x]$ stands for the integer part of $x$, while $F_{n, \theta}(x) = 0$ if $x < 0$ and $F_{n, \theta}(x) = 1$ if $x \geq n$.
We can consider the generalized inverse $F_{n, \theta}^{-1}$ of $F_{n, \theta}$, also called the \emph{quantile function} of $\bin(n, \theta)$, for which we know that $F_{n, \theta}^{-1}(\alpha) \leq F_{n, \theta'}^{-1}(\alpha)$ for any $\theta \leq \theta'$ and $\alpha \in (0, 1)$.%
\sidenote{See Proposition~\ref{prop:stochastic-ordering} below and its proof for details on this generalized inverse and its properties, together with Example~\ref{ex:coupling-binomial}.}
Because of this, we know that the set $\{ \theta \in (0, 1) : F_{n, \theta}^{-1}(\alpha / 2) \leq n \wh \theta_n \leq F_{n, \theta}^{-1}(1 - \alpha / 2) \}$ is an interval, so that defining
\begin{equation*}
\wh \theta^L = \inf\{ \theta \in (0, 1) : F_{n, \theta}^{-1}(1 - \alpha / 2) \geq n \wh \theta_n \}
\end{equation*}
and
\begin{equation*}
\wh \theta^R = \sup\{ \theta \in (0, 1) : F_{n, \theta}^{-1}(\alpha / 2) \leq n \wh \theta_n \}
\end{equation*}
leads to the coverage property
\begin{align*}
\P_\theta \big\{ \theta \in [\wh \theta^L, \wh \theta^R] \big\}
&= \P_\theta\big[ F_{n, \theta}^{-1}(\alpha / 2) \leq n \wh \theta_n \leq F_{n, \theta}^{-1}(1 - \alpha / 2) \big] \\
&= 1 - \alpha / 2 - \alpha / 2 = 1 - \alpha
\end{align*}
since $n \wh \theta_n \sim \bin(n, \theta)$.
This confidence interval is called ``exact'' since it uses the exact quantile function of $n \wh \theta_n$. It is therefore even tighter than the previous ones.
% \todo{Value for 1000 tosses and 420 heads}
\subsection{Asymptotic coverage} % (fold)
\label{sub:asymptotic_coverage}
% subsection asymptotic_coverage (end)
% subsection subsection_name (end)
% \paragraph{Asymptotic coverage.}
For the previous confidence intervals, we adopted a \emph{non-asymptotic} approach: the coverage properties hold for any value of $n \geq 1$.
This was possible since the distribution of $S_n$ is a simple $\bin(n, \theta)$ distribution, for which many computations can be made explicit.
However, in general, the \emph{exact} distribution of an estimator $\wh \theta_n$ cannot always be exhibited, and in such cases, we often use Gaussian approximations, thanks to the central limit theorem.
Let's do this for the Bernoulli model.
We know from~\eqref{eq:tcl-bernoulli} that
\begin{equation}
\label{eq:portemanteau-bernoulli}
\P_\theta \bigg[ \sqrt{\frac{n}{\theta (1 - \theta)}} (\wh \theta_n - \theta) \in I
\bigg] \rightarrow \P[Z \in I]
\end{equation}
where $Z \sim \nor(0, 1)$ for any interval $I \subset \R$.%
\sidenote{This uses the porte-manteau theorem, which says that $X_n \gosto X$ if and only if $\P[X_n \in A] \goes \P[X \in A]$ for any Borelian set $A$ such that $\P[X \in \partial A] = 0$, where $\partial A$ stands for the boundary of $A$.}%
Using $I = [-q_\alpha, q_\alpha]$ with $q_\alpha = \Phi^{-1}(1 - \alpha / 2)$ we end up%
\sidenote{We recall that $\Phi^{-1}$ is the \emph{quantile} function of $\nor(0, 1)$, namely the inverse of the distribution function $\Phi(x) = \P[Z \leq x]$ with $Z \sim \nor(0, 1)$.}%
with
\begin{equation}
\label{eq:not-and-ic}
\P_\theta\bigg\{ \theta \in \Big[ \wh \theta_n \pm q_\alpha \sqrt{\frac{\theta(1 - \theta)}{n}} \Big] \bigg\} \rightarrow 1 - \alpha.
\end{equation}
This is interesting, but not enough to build a confidence interval, since the interval in~\eqref{eq:not-and-ic} depends on $\theta$ through the variance term $\theta(1 - \theta)$.
Indeed, a confidence interval must be something that does \emph{not} depend on $\theta$.
We need to work a little bit more in order to remove the dependence on $\theta$ from this interval.
We can do the same as before: we use the fact that $\theta (1 - \theta) \leq 1 / 4$ for any $\theta \in [0, 1]$, so that
\begin{equation}
\label{eq:binomial-ci-excess}
\liminf_n \P_\theta \bigg\{ \theta \in \Big [\wh \theta_n \pm \frac{q_\alpha}{2 \sqrt n} \Big] \bigg\} \geq 1 - \alpha.
\end{equation}
This is what we call a confidence interval \emph{asymptotically of level} $1 - \alpha$ constructed \emph{by excess}.
In the asymptotic confidence interval~\eqref{eq:binomial-ci-excess}, we used the central limit theorem to approximate the distribution of $\sqrt n(S_n / n - \theta)$ by a Gaussian distribution.
This requires $n$ to be ``large enough'', but the central limit theorem does not tell us how large.
We can quantify this better by assessing how close the distribution function of $\sqrt n(S_n / n - \theta)$ is to the one of the Gaussian distribution, using the following theorem.
\begin{theorem}[Berry-Ess\'een]
\label{thm:berry-esseen}
Let $X_1, \ldots, X_n$ be i.i.d random variables such that $\E [X_i] = 0$ and $\var[X_i] = \sigma^2$ and introduce the distribution function%
\begin{equation*}
F_n(x) = \P \bigg[ \frac{\sum_{i=1}^n X_i}{\sqrt{n \sigma^2}} \leq x \bigg]
\end{equation*}
for any $x \in \R$. Then, the following inequality holds:
\begin{equation*}
\sup_{x \in \R} |F_n(x) - \Phi(x)| \leq \frac{c \kappa}{\sigma^3 \sqrt n},
\end{equation*}
where $\kappa = \E |X_1|^3$ (assumed finite) and where $c$ is a purely numerical constant (the best known one is $c = 0.4748$).
\marginnote[*-3]{The best known constant $c = 0.4748$ is from~\cite{shevtsova2011absolute}, which almost matches the lower bound $c \geq 0.4097$ from~\cite{esseen1956moment}.
Note also that a similar result holds if the $X_i$ are independent but not identically distributed.}%
\end{theorem}
A nice proof of this theorem with worse constants, which relies on Fourier analysis and approximation by Schwartz functions, can be found in~\sidecite{tao-berry-esseen}.
For Bernoulli we have $\E|X_1|^3 = \theta$ and $\sigma^3 = (\theta(1 - \theta))^{3/2}$ so that
\begin{equation*}
|F_n(x) - \Phi(x)| \leq \frac{3}{\sqrt{n \theta (1 - \theta)^3}}
\end{equation*}
which shows that the approximation by the Gaussian distribution deteriorates whenever $\theta$ is close to $0$ or $1$, which is expected since in this case the sequence $X_1, \ldots, X_n$ is almost deterministically constant and equal to $0$ (when $\theta \approx 0$) or $1$ (when $\theta \approx 1$).
\paragraph{Reparametrization.} % (fold)
Another tool used in the construction of confidence intervals with asymptotic coverage is the idea of reparametrization.
Indeed, given a statistical model $\{ P_\theta : \theta \in \Theta \}$ and a bijective function $g : \Theta \goes \Lambda$ we can use instead the ``reparametrized''model $\{ Q_\lambda : \lambda \in \Lambda \}$ where $Q_\lambda = P_{g^{-1}(\lambda)}$ for which the construction of a confidence interval $[\wh \lambda^L, \wh \lambda^R]$ for $\lambda$ is easier.
If $g$ is a monotonic function, we can easily derive from $[\wh \lambda^L, \wh \lambda^R]$ a confidence interval for $\theta$.
In order to use this reparametrization idea, a natural question is to understand if the convergence in distribution (involved in the central limit theorem) is stable under such a reparametrization.
\begin{example}
\label{ex:expo}
Consider a iid dataset $X_1, \ldots, X_n$ with distribution $\expo(\theta)$ with scale parameter $\theta > 0$, namely the distribution $P_\theta(dx) = \theta e^{-\theta x} \ind{x \geq 0} dx$.
We have $\E(X_1) = 1 / \theta$ and $\var(X_1) = 1 / \theta^2$, so that using the law of large numbers and the central limit theorem we have
\marginnote{We recall that $\bar X_n = n^{-1} \sum_{i=1}^n X_i$.}
\begin{equation*}
\bar X_n \goas \theta^{-1} \quad \text{ and } \quad \sqrt n (\bar X_n - \theta^{-1}) \leadsto \nor(0, \theta^{-2})
\end{equation*}
when $n \rightarrow +\infty$.
Since $x \mapsto 1 / x$ is a continuous function on $(0, +\infty)$, we know that $(\bar X_n)^{-1} \goas \theta$ so that a strongly consistent estimator is given by $\wh \theta_n = (\bar X_n)^{-1}$.
But what can be said about the convergence in distribution of $\sqrt n (\wh \theta_n - \theta)$?
\end{example}
This is answered by so-called $\Delta$-method, described in the next theorem.
\begin{theorem}[$\Delta$-method]
\label{thm:delta-method}
Let $(Z_n)_{n \geq 1}$ be a sequence of real random variables and assume that
$a_n(Z_n - z) \leadsto Z$, where $(a_n)_{n \geq 1}$ is a positive sequence such that $a_n \goes +\infty$, where $z \in \R$ and where $Z$ is a real random variable.
If $g$ is a function defined on a neighborhood of $z$ and differentiable at $z$, we have
\begin{equation}
a_n (g(Z_n) - g(z)) \leadsto g'(z) Z
\end{equation}
as $n \goes +\infty$.
\end{theorem}
The proof of Theorem~\ref{thm:delta-method} is given in Section~\ref{sec:chap03-proofs}.
It holds also for a sequence $(Z_n)$ of random vectors in $\R^d$ and a differentiable function $g : \R^d \go \R^{d'}$, and it reads in this case
\begin{equation}
\label{eq:delta-method-vector}
a_n (g(Z_n) - g(z)) \leadsto J_g(z) Z,
\end{equation}
where $J_g(z)$ is the Jacobian matrix of $g$ at $z$.
A particularly useful case is when $Z$ is Gaussian.
For instance, if $\sqrt n (\wh \theta_n - \theta) \leadsto \nor(0, \sigma(\theta)^2)$, we have
\begin{equation*}
\sqrt n (g(\wh \theta_n) - g(\theta)) \leadsto
\nor(0, \sigma(\theta)^2 (g'(\theta))^2)
\end{equation*}
whenever $g$ satisfies the conditions of Theorem~\ref{thm:delta-method}.
Going back to the $\expo(\theta)$ case of Example~\ref{ex:expo}, we obtain with $g(x) = 1 / x$ and since $\wh \theta_n = g(\bar X_n)$ that $\sqrt n (\wh \theta_n - \theta) \leadsto \nor(0, \theta^2)$.
Another result which provides stability for the convergence in distribution under a smooth mapping is the so-called Slutsky theorem.%
\sidenote{This theorem will be \emph{very} useful for the study of limit distributions. For instance, it entails that $Z_n \gopro z$ whenever $\sqrt n (Z_n - z)$ converges in distribution, and particular cases such as $X_n + Y_n \gosto X_n + y$ and $X_n Y_n \gosto X_n y$ will be used repeatedly, starting with the proof of Theorem~\ref{thm:delta-method} for instance.}
\begin{theorem}[Slutsky]
\label{thm:slutsky}
Let $(X_n)_{n \geq 1}$ and $(Y_n)_{n \geq 1}$ be sequences of random vectors in $\R^d$ and $\R^{d'}$ respectively, such that $X_n \leadsto X$ and $Y_n \leadsto y$ where $X \in \R^d$ is some random vector and $y \in \R^{d'}$.
Then, we have that $Y_n \gopro y$ and $(X_n, Y_n) \leadsto (X, y)$ as $n \goes +\infty$. In particular, we have $f(X_n, Y_n) \leadsto f(X, y)$ for any continuous function $f$.
\end{theorem}
The proof of Theorem~\ref{thm:slutsky} is given in Section~\ref{sec:chap03-proofs} below.
The $\Delta$-method provides stability for the convergence in distribution when a differentiable function is applied to a sequence, while the Slutsky theorem provides ``algebraic'' stability when combining two sequences converging respectively in distribution and probability.
\sidenote{Be careful with the convergence in distribution. Please keep in mind that this mode of convergence is about the convergence of the distributions and not the convergence of the random variables (hence its name). The notation $X_n \gosto X$ is rather misleading but convenient. In particular, nothing can be said in general about $f(X_n, Y_n)$ when we know that $X_n \gosto X$ and $Y_n \gosto Y$ (unless $X_n$ and $Y_n$ are independent sequences).}
\paragraph{Back again to Bernoulli.}
We have $\wh \theta_n \gopro \theta$, so that $(\wh \theta_n (1 - \wh \theta_n))^{1/2} \gopro (\theta (1 - \theta))^{1/2}$ since $x \mapsto (x(1-x))^{1/2}$ is continuous on $[0, 1]$ and let us write
\begin{equation*}
\frac{\sqrt n (\wh \theta_n - \theta)}{\sqrt{\wh \theta_n (1 - \wh \theta_n)}}
= \frac{\sqrt n (\wh \theta_n - \theta)}{\sqrt{ \theta (1 - \theta)}} \times
\sqrt{\frac{ \theta (1 - \theta)}{\wh \theta_n (1 - \wh \theta_n)}} =: A_n \times B_n.
\end{equation*}
We know that $A_n \gosto \nor(0, 1)$ and that $B_n \gopro 1$.
Therefore, using Theorem~\ref{thm:slutsky} leads%
\sidenote{With $f(x, y) = xy$.}%
to
\begin{equation*}
\sqrt{\frac{n}{\wh \theta_n (1 - \wh \theta_n)}} (\wh \theta_n - \theta) \gosto \nor(0, 1).
\end{equation*}
We just replaced $\theta$ by $\wh \theta_n$ in the variance term $\theta(1 - \theta)$ of the limit~\eqref{eq:portemanteau-bernoulli}, but doing so required Slutsky's theorem to prove this rigorously, and this provides us another confidence interval with asymptotic coverage given by
\begin{equation*}
\P_\theta \bigg\{ \theta \in \Big[ \wh \theta_n \pm q_\alpha \sqrt{\frac{\wh \theta_n (1 - \wh \theta_n)}{n}} \Big] \bigg\} \goes 1 - \alpha
\end{equation*}
as $n \goes +\infty$.%
\sidenote{With $1000$ tosses and $420$ heads, the realization of this confidence interval at level $95\%$ is $[0.38, 0.45]$.}%
\section{Tests} % (fold)
\label{sec:tests}
Let us consider, again, in this section, a statistical experiment with data $X$ and model $\{ P_\theta : \theta \in \Theta \}$.
Here, we want to decide between to hypotheses $H_0$ and $H_1$, where
\begin{equation*}
H_i \quad \text{means that} \quad \theta \in \Theta_i
\end{equation*}
for $i \in \{ 0, 1 \}$, where $\{ \Theta_0, \Theta_1 \}$ is a partition of the set of parameters~$\Theta$.
In order to understand the concept of statistical testing, let us consider the following unsettling example: imagine that you need to decide if a patient has cancer or not.
The patient has cancer if some parameter $\theta \in (0, 1)$ about him satisfies
$\theta \geq 0.42$.
We \emph{choose} $\Theta_0 = [0.42, 1]$ and $\Theta_1 = [0, 0.42)$, namely we decide that $H_0$ means that the patient has cancer, while $H_1$ means that the patient has not.
We need to construct a testing function $\varphi : E \goes \{ 0, 1 \}$ that maps $X \mapsto \varphi(X)$, our decision being given by the value of $\varphi(X)$.
The convention is to decide that $H_0$ is true whenever $\varphi(X) = 0$, in this case we say that we \emph{accept} $H_0$ and we \emph{reject} $H_0$ whenever $\varphi(X) = 1$.
The convention is with the ``$1$'' in $\varphi(X) = 1$ and $H_1$ which always means that we \emph{reject} the \emph{null hypothesis} $H_0$.
\subsection{Type I and Type II errors} % (fold)
When $\theta \in \Theta_i$, we are correct if $\varphi(X) = i$ and incorrect if $\varphi(X) = 1 - i$.
We have two types of errors: the \emph{Type-I error}, also called the \emph{first-order error}, given by
\begin{equation}
\label{eq:type-1-error}
\P_\theta[ \varphi(X) = 1] = \E_\theta [\varphi(X)] \quad \text{for} \quad
\theta \in \Theta_0
\end{equation}
and the \emph{Type-II error}, also called \emph{second-order error}, given by
\begin{equation}
\label{eq:type-2-error}
\P_\theta[ \varphi(X) = 0] = 1 - \E_\theta [\varphi(X)] \quad \text{for} \quad
\theta \in \Theta_1.
\end{equation}
For the cancer detection problem, the Type~I error corresponds to the \emph{probability of saying to the patient that he has not cancer while he has.}
The Type~II error corresponds to the \emph{probability of saying to the patient that he has cancer while he has not}.
Note that these two types of errors are not symmetrical: we consider that the first one is more serious than the second (although this can be debated, the patient could do a depression, or start an invasive treatment for nothing).%
\sidenote{Of course this morbid example is highly unrealistic, and is used only to stress the asymmetry of errors in a statistical testing problem.}
The important point here is that $H_0$ and $H_1$ must be \emph{chosen} depending on the practical application considered.
They are not \emph{given} and they correspond to an important modeling choice.
We will see below that $H_0$ and $H_1$ must be chosen, in practice, so that the corresponding Type~I error is \emph{more serious}, for the considered application, than the Type~II error.
\begin{definition}
\label{def:power-function}
The function $\beta : \Theta \goes [0, 1]$ that maps $\theta \mapsto \beta(\theta) = \E_\theta [\varphi(X)]$ is called the \emph{power function} of the test $\varphi$.
\end{definition}
Ideally, we would like both the Type~I and Type~II errors to be small, namely $\beta(\theta) \approx 0$ for $\theta \in \Theta_0$ and $\beta(\theta) \approx 1$ for $\theta \in \Theta_1$.
But this is impossible: if $\Theta$ is a connected set then $\Theta_0$ and $\Theta_1$ share a common frontier, so that $\beta$ must be discontinuous on it, while $\beta$ is in general a continuous function.
Therefore, it is hard to make both the Type~I and Type~II errors small at the same time.
\subsection{Desymmetrization of statistical tests} % (fold)
The way a statistical test is performed is through the \emph{Neyman-Pearson approach}, where we \emph{desymmetrize} the problem: choose the hypothesis $H_0$ using common sense, so that the Type~I error is more serious than the Type~II error.
The Type~I error is always the rejection of $H_0$ when it is true, while the Type~II error is always the acceptation of $H_0$ when it is false.
The only thing that we choose is what are $H_0$ and $H_1$.
Let us wrap up what we said before, and introduce some extra things in the next definition.
\begin{definition}
\label{def:test-definitions}
Consider a statistical testing problem with hypotheses
\begin{equation*}
H_0 : \theta \in \Theta_0 \quad \text{versus} \quad H_1 : \theta \in \Theta_1
\end{equation*}
and a testing function $\varphi : E \goes \{ 0, 1 \}$.
We call $H_0$ the \emph{null} hypothesis and $H_1$ the \emph{alternative} hypothesis.
When $\varphi(X) = 0$ we say that the test \emph{accepts} $H_0$ or simply that it \emph{accepts}. When $\varphi(X) = 1$ the test \emph{rejects}.
The set
\marginnote{The random variable $X$ is valued in a measurable space $(E, \cE)$.}
\begin{equation*}
R = \{ x \in E : \varphi(x) = 1 \}
\end{equation*}
is called the \emph{rejection set} of the test $\varphi$, and we call its complement $R^\complement$ the \emph{acceptation region}.
The restriction $\beta : \Theta_0 \goes [0, 1]$
of the power function $\beta$ from Definition~\ref{def:power-function} is called the \emph{Type~I error}, while the restriction $\beta : \Theta_1 \goes [0, 1]$
is called the \emph{power} of the test. The function $1 - \beta : \Theta_1 \goes [0, 1]$ is called the \emph{Type~II error} or \emph{second order error}.
Whenever
\begin{equation*}
\sup_{\theta \in \Theta_0} \beta(\theta) \leq \alpha
\end{equation*}
for some fixed $\alpha \in (0, 1)$, we say that the test \emph{has level} $\alpha$.
\end{definition}
The idea of desymmetrization is as follows: given a level $\alpha \in (0, 1)$ (something like $1\%$, $5\%$ or $10\%$) we build a test so that it has, \emph{by construction}, level $\alpha$.
Namely, a test is \emph{built so that the Type~I error is controlled}, while nothing is done directly about the Type~II error.
Given two statistical tests with level $\alpha$ (namely Type~I error $\leq \alpha$), we can simply compare their Type~II error and choose the one that maximizes it.
\paragraph{Back to Bernoulli.} % (fold)
Let us go back to the Bernoulli model where $X_1, \ldots, X_n$ are iid and distributed as $\ber(\theta)$.
We consider the problem of statistical testing with hypotheses:
\begin{equation}
\label{eq:chap03-tests-hypothesis}
H_0 : \theta \leq \theta_0 \quad \text{ against } \quad H_1 : \theta > \theta_0
\end{equation}
so that $\Theta = (0, 1)$, $\Theta_0 = (0, \theta_0]$ and $\Theta_1 = (\theta_0, 1)$.
We studied in Sections~\ref{sec:estimation} and~\ref{sec:confidence_intervals} the estimator $\wh \theta_n = S_n / n = \bar X_n$ and know that it is a good estimator.
A natural idea is therefore to reject $H_0$ if $\wh \theta_n$ is too large.
\begin{recipe}
We build a test by defining its rejection set $R$. The shape of the rejection set can be easily guessed by looking at the alternative hypothesis $H_1$.
\end{recipe}
Since we want to reject when $\theta > \theta_0$, we want to consider a rejection set $R = \{ \wh \theta_n > c \}$ for some constant $c$ chosen so that the Type~I error is controlled by $\alpha$.
Note that choosing $c = \theta_0$ is a bad idea: using the central limit theorem, we see that $\P_{\theta_0}[\wh \theta_n > \theta_0] \goes 1/2$.
We need to increase $c$ by some amount, so that the Type~I error can be indeed smaller than $\alpha$.%
\sidenote{It is very easy to build a statistical test with $\alpha = 0$, namely with zero Type~I error. For the cancer example from above, we just need to tell to all the patient that they have cancer. By doing so, we never miss any cancer diagnostic, but on the other hand this test has zero power. Arguably, this is not a good strategy, so we need to give some slack in the construction of the test by considering a small but non-zero $\alpha$.}
\subsection{Stochastic domination} % (fold)
% subsection stochastic_domination (end)
We understand at this point that $c$ will depend on $\alpha, \theta_0$ and the sample size $n$, among other things, and that in view of Definition~\ref{def:test-definitions} it needs to be such that $\sup_{\theta \leq \theta_0} \beta(\theta) = \sup_{\theta \leq \theta_0} \P_\theta[\wh \theta_n > c] \leq \alpha$.
But, we know that $n \wh \theta_n = S_n \sim \bin(n, \theta)$ under $\P_\theta$%
\sidenote{We write ``under $\P_\theta$'' here since Type~I error control must be performed under the null assumption $\theta \leq \theta_0$, so that we must specify under which distribution (which parameter $\theta$) we are working at this point.}%
, so that
\begin{equation}
\label{eq:power-control-binomial1}
\beta(\theta) = \P_\theta[S_n > n c] = \P [\bin(n, \theta) > nc]
\end{equation}
for any $\theta \in (0, 1)$.%
\sidenote{The notation $\P [\bin(n, \theta) > nc]$ stands for $\P [B > nc]$ where $B \sim \bin(n, \theta)$. Note also that we replaced $\P_\theta$ simply by $\P$ herein, the notation $\P_\theta$ is required when we need to stress that the computation is performed under $\P_\theta$, while in the last equality we consider a generic probability space with probability $\P$ on which $B$ lives, the dependency on $\theta$ is now only through the distribution of it. These semantics are important and will prove useful for statistical computations.}%
%%%
In order to control the supremum of $\beta$, we need to study its variations: in view of~\eqref{eq:power-control-binomial1} and~\eqref{eq:binomial_distribution}, we know that
\begin{equation}
\label{eq:binomial-power}
\beta(\theta) = 1 - F_{n, \theta}(n c) = 1 - \sum_{k=0}^{[n c]}\binom{n}{k} \theta^k (1 - \theta)^{n - k}
\end{equation}
for $nc \in [0, n]$, where $[x]$ stands for the integer part of $x \geq 0$, so that a direct study of the variations of $\beta$ is somewhat tedious.
Intuitively, $\beta(\theta)$ should be increasing with $\theta$, since when $\theta$ increases, we get more ones, so that $S_n$ increases.
This can be nicely formalized using the notion of \emph{stochastic domination}.
\begin{proposition}
\label{prop:stochastic-ordering}
Let $P$ and $Q$ be two probability measures on the same real probability space.
We say that \emph{$Q$ stochastically dominates $P$}, that we denote $P \lest Q$, whenever one of the following equivalent points is granted:
\begin{enumerate}
\item There are two real random variables $X \sim P$ and $Y \sim Q$ (on the same probability space) such that $\P[X \leq Y] = 1$;
\item We have $F_P(x) \geq F_Q(x)$ for any $x \in \R$, where $F_P$ and $F_Q$ are the distribution functions of $P$ and $Q$, or equivalently, $P[(x, +\infty)] \leq Q[(x, +\infty)]$ for any $x \in \R$;%
\marginnote[*-3]{We recall that the distribution function of $P$ is
$F_P(x) = P[ (-\infty, x]]$.}%
\item We have $F_P^-(p) \leq F_Q^-(p)$ for any $p \in [0, 1]$ where $F_P^-(p) = \inf \{ x \in \R : F_P(x) \geq p \}$ is the \emph{generalized inverse} of $F_P$ or \emph{quantile function} of $P$;%
\marginnote[*-3]{Since a distribution function is non-decreasing and c\`adl\`ag, its generalized inverse is well-defined and unique. See the proof of Proposition~\ref{prop:stochastic-ordering} for more details about it.}%
\item For any non-decreasing and bounded function $f$ we have $\int f dP \leq \int f dQ$.
\end{enumerate}
\end{proposition}
The proof of Proposition~\ref{prop:stochastic-ordering} is given in Section~\ref{sec:chap03-proofs} below and follows rather standard arguments.
However, the proof of $(3) \Rightarrow (1)$ deserves to be discussed here, since it uses a simple yet beautiful \emph{coupling} argument, which is a very powerful technique often used in probability theory~\sidecite{den2012probability}.
More precisely, we use something called a ``quantile coupling'': consider a random variable $U \sim \uni([0, 1])$%
\sidenote[][*-6]{We say that $X \sim \uni([a, b])$ for $a < b$ if it has density $x \mapsto (b - a)^{-1} \ind{[a, b]}(x)$ with respect to the Lebesgue measure, namely $\P_X(dx) = (b - a)^{-1} \ind{[a, b]}(x) dx$.}%
on some probability space and define $X = F_P^-(U)$ and $Y = F_Q^-(U)$.
We have by construction%
\sidenote{This comes from the fact that $\P[F_P^-(U) \leq x] = \P[U \leq F_P(x)] = F_P(x)$ since $U \sim \uni([0, 1])$ and since, by construction of the generalized inverse, we have that $F_P^-(u) \leq x$ is equivalent to $u \leq F_P(x)$ for any $u \in [0, 1]$ and $x \in \R$.}%
that $X \sim P$ and $Y \sim Q$, and that
\begin{equation*}
\P[X \leq Y] = \P[F_P^-(U) \leq F_Q^-(U)] = 1
\end{equation*}
since Point~3 tells us that $F_P^-(p) \leq F_Q^-(p)$ for any $p \in [0, 1]$. This proves Point~3 $\Rightarrow$ Point~1.
The really nice feature of Proposition~\ref{prop:stochastic-ordering} is that it allows to reformulate $P \lest Q$, which is a property regarding the \emph{distributions} $P$ and $Q$, as a property about \emph{random variables} $X \sim P$ and $Y \sim Q$.
Let us provide two examples.
\begin{example}
Whenever $\lambda_1 \leq \lambda_2$, we have $\expo(\lambda_2) \lest \expo(\lambda_1)$. This follows very easily from Point~2 of Proposition~\ref{prop:stochastic-ordering}.
\end{example}
\begin{example}
\label{ex:coupling-binomial}
Whenever $\theta_1 \leq \theta_2$, we have $\ber(n, \theta_1) \lest \ber(n, \theta_2)$. This is obtained through Point~1 (namely a coupling argument).
\marginnote[*1]{The notation $\# E$ stands for the cardinality of a set $E$.}
Consider $U_1, \ldots, U_n$ iid $\uni([0, 1])$ and define $S_{n, i} = \# \{ k : U_k \leq \theta_i \}$ for $i \in \{ 1, 2 \}$. By construction we have $S_{n, i} \sim \bin(n, \theta_i)$, and obviously $\P[S_{n, 1} \leq S_{n, 2}] = 1$ since $\theta_1 \leq \theta_2$.
\end{example}
Thanks to Example~\ref{ex:coupling-binomial} together with Proposition~\ref{prop:stochastic-ordering}, we know now that $F_{n, \theta_2} \leq F_{n, \theta_1}$ whenever $\theta_1 \leq \theta_2$, so that combined with Inequality~\eqref{eq:power-control-binomial1} this provides the following control of the Type~I error:
\begin{equation*}
\sup_{\theta \leq \theta_0} \P_\theta[ \wh \theta_n > c] = \sup_{\theta \leq \theta_0} (1 - F_{n, \theta}(n c)) \leq 1 - F_{n, \theta_0}(n c).
\end{equation*}
We can find out, given $\theta_0$, $\alpha$ and $n$, a constant $c$ as small as possible that satisfies $F_{n, \theta_0}(n c) \geq 1 - \alpha$, like we did in Section~\ref{sec:confidence_intervals} for the exact confidence interval.
Otherwise, we can use Theorem~\ref{thm:hoeffding} (but it leads to a slightly less powerful test) to obtain
\begin{equation*}
\P_{\theta_0} [\wh \theta_n > c] = \P_{\theta_0}[S_n - n \theta_0 > c'] \leq e^{-2 c'^2 / n} = \alpha,
\end{equation*}
so that choosing $c' = \sqrt{n \log(1 / \alpha) / 2}$ gives $\sup_{\theta \leq \theta_0} \beta(\theta) \leq \alpha$, and proves that the test with rejection set
\begin{equation*}
R = \bigg\{ \wh \theta_n \geq \theta_0 + \sqrt{ \frac{\log(1 / \alpha)}{2n}} \bigg\}
\end{equation*}
is a test of level $\alpha$ for the hypotheses~\eqref{eq:chap03-tests-hypothesis}.
Note that we managed to quantify exactly by how much we need to increase $\theta_0$ in order to tune the test so that its Type~I error is smaller than $\alpha$.
\subsection{Asymptotic approach}
We can use also an asymptotic approach by considering the test with rejection set
\begin{equation*}
R = \big\{ \wh \theta_n > \theta_0 + \delta_n \big\} \quad \text{where} \quad \delta_n := \sqrt{\frac{\theta_0 (1 - \theta_0)}{n}} \Phi^{-1}(1 - \alpha).
\end{equation*}
Indeed, we know by combining Example~\ref{ex:coupling-binomial} together with~\eqref{eq:portemanteau-bernoulli} that for any $\theta \leq \theta_0$ we have
\begin{equation*}
\P_\theta[ \wh \theta_n > \theta_0 + \delta_n ] \leq \P_{\theta_0}[ \wh \theta_n > \theta_0 + \delta_n ] \goes \alpha
\end{equation*}
as $n \goes +\infty$, so that $\limsup_n \sup_{\theta \leq \theta_0} \P_\theta[ \wh \theta_n > \theta_0 + \delta_n ] \leq \alpha$, which provides an asymptotic control of the Type~I error of this test: we say that it is \emph{asymptotically of level $\alpha$}.
But what can be said about the \emph{power} of the test ?
We know that $\wh \theta_n \goas \theta$ under $\P_\theta$ and that $\delta_n \go 0$, so, \emph{under $H_1$}, namely whenever $\theta > \theta_0$, we have
\begin{equation*}
\beta(\theta) = \P_\theta[ \wh \theta_n > \theta_0 + \delta_n] \go 1
\end{equation*}
as $n \go +\infty$, which claims that the power of the test goes to $1$.
In this case, we say that the test is \emph{consistent} or \emph{convergent}.
\begin{remark}
The convergence of $\beta(\theta)$ is not uniform in $\theta$ since its limit is discontinuous while $\beta(\theta)$ is continuous (see Equation~\eqref{eq:binomial-power}).
\end{remark}
\subsection{Ancillary statistics} % (fold)
An interesting pattern emerges from what we did for confidence intervals and tests.
In both cases, for the Bernoulli case, we constructed a statistic
$\sqrt n (\wh \theta_n - \theta) / \sqrt{\theta (1 - \theta)}$ whose asymptotic distribution is $\nor(0, 1)$, namely a distribution that \emph{does not} depend on the parameter $\theta$.
This is called an asymptotically \emph{ancillary} statistic.
\begin{definition}
Whenever $X \sim P_\theta$ and the distribution of $f_\theta(X)$ does not depend on $\theta$, we say that $f_\theta(X)$ is an \emph{ancillary} statistic.
\end{definition}
The construction of confidence intervals and tests requires such an ancillary or asymptotically ancillary statistic.
Indeed, we need to remove the dependence on $\theta$ from the distribution in order to compute quantiles allowing to tune the coverage property of a confidence interval, or the level of a test.
\subsection{Confidence intervals and tests} % (fold)
There is of course a strong connection between confidence intervals and tests, as explained in the following proposition.
\begin{proposition}
\label{prop:ci-and-tests}
If $S(X)$ is a confidence set of level $1 - \alpha$, namely $\P_\theta[ \theta \in S(X)] \geq 1 - \alpha$ for any $\theta \in \Theta$, then the test with rejection set $\{ x : S(x) \cap \Theta_0 = \emptyset\}$ is of level $\alpha$.
\end{proposition}
This proposition easily follows from the fact that $\P_\theta[ S(X) \cap \Theta_0 = \emptyset] \leq \P_\theta[ \theta \notin S(X) ] \leq \alpha$ for any $\theta \in \Theta_0$.
Confidence intervals and tests are therefore deeply intertwined notions in the sense that when you have built one of the two, you can build easily the other.
\paragraph{Types of hypotheses.} % (fold)
For $\Theta \subset \R$, we often consider one of the null hypotheses listed in Table~\ref{tab:standard-null-hypothesis}, where we provide some vocabulary.
\begin{table}[htbp]
\centering
\small
\begin{tabular}{|l|l|l|}\hline
$\Theta_0 = \{ \theta_0 \}$ & Simple hypothesis & \\ \hline
$\Theta_0 = [\theta_0, +\infty)$ & \multirow{3}{*}{Multiple hypothesis} & \multirow{2}{*}{One-sided hypothesis} \\
$\Theta_0 = (-\infty, \Theta_0]$ & & \\ \cline{3-3}
$\Theta_0 = [\theta_0 - \delta, \theta_0 + \delta]$ & & Two-sided hypothesis \\ \hline
\end{tabular}
\caption{Some examples of standard null hypotheses.}
\label{tab:standard-null-hypothesis}
\end{table}
A test with a one-sided null hypothesis can be obtained using a one-sided confidence interval in the opposite direction of $\Theta_0$. A test with a two-sided null hypothesis can be obtained using a (two-sided) confidence interval.
For hypotheses $H_0 : \theta = \theta_0$ versus $H_1 : \theta > \theta_0$ we use $R = \{ \wh \theta_n > \theta_0 + c \}$ while for $H_0 : \theta = \theta_0$ versus $H_1 : \theta \neq \theta_0$ we use $R = \{ | \wh \theta_n - \theta_0 | > c \}$, where $\wh \theta_n$ is some estimator of $\theta$ and where $c$ is a constant to be tuned so that the test has level $\alpha$.
Note that this is a generic recipe, that holds for any statistical model.
In Chapter~\ref{chap:tests} below, we provide systematic rules to build \emph{optimal} tests%
\sidenote{tests with maximum power, in some sense}%
in a fairly general setting, but this will require some extra concepts that we will be developed later.
\subsection{$p$-values} % (fold)
Consider a statistical model and a test at level $\alpha$, and keep everything
fixed but $\alpha$.
If $\alpha$ is very small, the test has no choice but to accept $H_0$, since it has almost no slack to eventually be wrong about it.%
\sidenote{Once again, the only way to build a test with $\alpha = 0$ is to never reject (tell all the patients that they have cancer).}
With everything fixed but $\alpha$, we can expect that for some value $\alpha(X)$ (that depends on the data $X$), we have that whenever $\alpha < \alpha(X)$ then the test \emph{accepts} $H_0$ while when $\alpha > \alpha(X)$ the test \emph{rejects} $H_0$.
Such a value $\alpha(X)$ is called the \emph{$p$-value} of the test.
Let $R_\alpha$ be the rejection set of some test at level $\alpha$, so that it satisfies $\sup_{\theta \in \Theta_0} \P_\theta[R_\alpha] \leq \alpha < \alpha'$ for any $\alpha' > \alpha$, which means that $R_\alpha$ also is a rejection set at level $\alpha'$.
Usually, the family $\{ R_\alpha \}_{\alpha \in [0, 1]}$ of rejection sets of a test is \emph{increasing} with respect to $\alpha$, namely $R_\alpha \subset R_{\alpha'}$ for any $\alpha < \alpha'$.
In this case, we can define the $p$-value as follows.
\begin{definition}
Consider a statistical experiment with data $X$ and a statistical test with an increasing family $\{ R_\alpha \}_{\alpha \in [0, 1]}$ of rejection sets.
The $p$-value of such a test the random variable given by
\begin{equation*}
\alpha(X) = \inf \{ \alpha \in [0, 1] : X \in R_\alpha \}.
\end{equation*}
\end{definition}
Let us compute the $p$-value of one of the tests we built previously for the $\ber(\theta)$ model and the hypotheses~\eqref{eq:chap03-tests-hypothesis}.
The rejection set is given by
\begin{equation*}
R_\alpha = \bigg\{ \wh \theta_n > \theta_0 + \Phi^{-1}(1 - \alpha) \sqrt{\frac{\theta_0
(1 - \theta_0)}{n}} \bigg \}
\end{equation*}
so that the $p$-value can be computed as follows:
\begin{align*}
\alpha(X) &= \inf \bigg\{ \alpha \in [0, 1] : \wh \theta_n > \theta_0 + \theta_0 + \Phi^{-1}(1 - \alpha) \sqrt{\frac{\theta_0 (1 - \theta_0)}{n}} \bigg\} \\
% &= \inf \bigg \{ \alpha \in [0, 1] : \alpha > 1 - \Phi\Big( \sqrt{\frac{n}{\theta_0 (1 - \theta_0)}} (\wh \theta_n - \theta_0) \Big) \bigg \} \\
&= 1 - \Phi\Big( \sqrt{\frac{n}{\theta_0 (1 - \theta_0)}} (\wh \theta_n - \theta_0) \Big)
\end{align*}
In practice, when performing a statistical testing procedure, we \emph{do not} choose the level $\alpha$, but we compute the $p$-value using the definition of the test and the data.
A statistical library will never ask you $\alpha$ but will rather give you the value of the $p$-value.
This value quantifies, somehow, \emph{how much we are willing to believe in $H_0$.}
For instance, if $\alpha(x) \leq 10^{-3}$
\marginnote{$x$ stands for the realization of the random variable $X$, namely $x = X(\omega)$}
then we are strongly rejecting $H_0$, since it would require a level $\alpha < 10^{-3}$ to accept $H_0$, which is very small.
If $\alpha(x) = 3\%$, the result of the test is rather ambiguous while $\alpha(x) = 30\%$ is a strong acceptation of $H_0$.
In many sciences, in order to publish conclusions based on experimental observations, researchers must exhibit the $p$-values of the considered statistical tests in order to justify that some effect is indeed observed.
However, the reign of the $p$-value in many fields of science is highly criticized, see for instance~\sidecite{wasserstein-p-values}.
\section{Proofs} % (fold)
\label{sec:chap03-proofs}
\paragraph{Proof of Theorem~\ref{thm:hoeffding}.}
We follow the proof from~\sidecite{massart2007concentration}.
First, we can assume without loss of generality that each $X_i$ is centered: it does not change the length $b_i - a_i$ of the interval containing $X_i$ almost surely.
We use the Cram\'er-Chernoff method: because of the Markov's inequality, we have
\begin{equation*}
\P[S \geq t] = \P[ e^{\lambda S} \geq e^{\lambda t}] \leq e^{-\lambda t} \E [e^{\lambda S}]
\end{equation*}
for any $\lambda > 0$.
Now, denoting by $\psi_S(\lambda) = \log \E[e^{\lambda S}]$ the log of the moment generating function of $S = \sum_{i=1}^n X_i$, we have thanks to the independence of $X_1, \ldots, X_n$ that
\begin{equation*}
\psi_S(\lambda) = \log \E[e^{\lambda \sum_{i=1}^n X_i}] = \sum_{i=1}^n \log \E[e^{\lambda X_i}] = \sum_{i=1}^n \psi_{X_i}(\lambda),
\end{equation*}
so that we need to control $\psi_{X_i}(\lambda)$. Consider a centered random variable $X$ such that $X \in [a, b]$ almost surely and let us prove that
\begin{equation}
\label{eq:hoeffding-lemma}
\psi_{X}(\lambda) \leq \frac{(b - a)^2 \lambda^2}{8},
\end{equation}
which is a result known as the \emph{Hoeffding lemma}.
Note also that if $Y$ is any random variable such that $Y \in [a, b]$ almost surely, then%
\sidenote{Just remark that $|Y - (a + b) / 2| \leq (b - a) / 2$ and that $\var[Y] = \var[Y - (a + b) / 2] \leq (b - a)^2 / 4.$}
\begin{equation}
\label{eq:var-bounded-variable}
\var[Y] \leq \frac{(b - a)^2}{4}.
\end{equation}
Then, denote as $P$ the distribution of $X$ and introduce the distribution
\begin{equation*}
P_\lambda(dx) = e^{-\psi_X(\lambda)} e^{\lambda x} P (dx),
\end{equation*}
so that if $X_\lambda$ is a random variable with distribution $P_\lambda$ we have $\E[\phi(X_\lambda)] = \E[\phi(X) e^{-\psi_X(\lambda)} e^{\lambda X}]$.
An easy computations gives that the second derivative of $\psi_X$ satisfies
\begin{equation*}
\psi_X''(\lambda) = e^{-\psi_X(\lambda)} \E[X^2 e^{\lambda X}] - e^{-2 \psi_X(\lambda)} (\E[X e^{\lambda X}])^2 = \var[X_\lambda].
\end{equation*}
But, since $X_\lambda \in [a, b]$ almost surely, we have using~\eqref{eq:var-bounded-variable} that $\psi_X''(\lambda) \leq (b - a)^2 / 4$, so that integration proves~\eqref{eq:hoeffding-lemma}.%
\sidenote{Integration and the facts that $\psi_X(0) = 0$ and that $\psi_X'(0) = 0$ since $X$ is centered.}
Most of the work is done now, since wrapping up the inequalities from above gives
\begin{equation*}
\P[S \geq t] \leq \exp \Big(-\lambda t + \frac{\lambda^2}{8} \sum_{i=1}^n (b_i - a_i)^2 \Big)
\end{equation*}
for any $\lambda > 0$: minimizing the right-hand side with respect to $\lambda$ allows to conclude for the optimal choice $\lambda = 4 t / \sum_{i=1}^n (b_i - a_i)^2$. $\hfill \qed$
\paragraph{Prof of Theorem~\ref{thm:delta-method}.}
Consider the neighborhood $V$ of $z$ and define $r(h) = (g(z + h) - g(z)) / h - g'(z)$ for $h \neq 0$ such that $z + h \in V$ and put $r(0) = 0$. We know that $r(h) \go 0$ as $h \go 0$. By definition of $r$ we have
\begin{equation*}
g(z + h) = g(z) + h g'(z) + h r(h),
\end{equation*}
so putting $h = Z_n - z$ gives
\begin{equation}
\label{eq:slutsky-proof}
a_n (g(Z_n) - g(z)) = a_n g'(z) (Z_n - z) + a_n (Z_n - z) r(Z_n - z).
\end{equation}
Now, we need to use Theorem~\ref{thm:slutsky} (Slutsky) several times. First, we have $Z_n - z = a_n^{-1} a_n (Z_n - z)$, so that $Z_n - z \gopro 0$ since $a_n^{-1} \go 0$ and $a_n (Z_n - z) \gosto Z$, so that $r(Z_n - z) \gopro 0$.
Second, using again Theorem~\ref{thm:slutsky}, we have $a_n (Z_n - z) r(Z_n - z) \gopro 0$ since $a_n (Z_n - z) \gosto Z$ and $r(Z_n - z) \gopro 0$. Finally, this allows to conclude that $a_n (g(Z_n) - g(z)) \gosto g'(z) Z$ because of~\eqref{eq:slutsky-proof} combined with $a_n (Z_n - z) r(Z_n - z) \gopro 0$ and Theorem~\ref{thm:slutsky}. $\hfill \square$
\paragraph{Prof of Theorem~\ref{thm:slutsky}.}
Let us first prove that since $Y_n \gosto y$ with $y$ deterministic, we actually have that $Y_n \gopro y$.
Indeed, since $Y_n \gosto y$ we have that $\E [\phi(Y_n)] \go \E [\phi(y)]$ for any continuous and bounded function $\phi$, for instance $\phi(x) = \norm{x - y} / (\norm{x - y} + 1)$ so that we know that
\begin{equation*}
\E \Big[ \frac{\norm{Y_n - y}}{\norm{Y_n - y} + 1} \Big] \go 0.
\end{equation*}
Now, we can conclude with the Markov's inequality, since $x \mapsto x / (x + 1)$ is increasing on $(0, +\infty)$:
\begin{align*}
\P\big[ \norm{Y_n - y} \geq \eps\big] &= \P\Big[ \frac{\norm{Y_n - y}}{\norm{Y_n - y} + 1}
\geq \frac{\eps}{1 + \eps} \Big] \\
&\leq \frac{1 + \eps}{\eps} \E \Big[ \frac{\norm{Y_n - y}}{\norm{Y_n - y} + 1} \Big] \go 0,
\end{align*}
which proves $Y_n \gopro y$.
Now, let us prove that $(X_n, Y_n) \gosto (X, y)$. Thanks to the Portemanteau theorem, we know that it suffices to prove that $\E[\phi(X_n, Y_n)] \go \E[\phi(X, y)]$ for any function $\phi : \R^d \times \R^{d'} \go \R$ which is Lipschitz and bounded.
We have
\begin{align*}
\E\big[ | \phi(X_n, Y_n) - \phi(X, y) | \big] &\leq \E\big[ | \phi(X_n, Y_n) - \phi(X_n, y) | \big] \\
& \quad \quad + \E\big[ | \phi(X_n, y) - \phi(X, y) | \big]
\end{align*}
and we already know that $\E\big[ | \phi(X_n, y) - \phi(X, y) | \big] \go 0$ since $X_n \gosto X$.
Moreover, we have
\begin{align*}
| \phi(X_n, Y_n) - \phi(X_n, y) | \leq 2 b\ind{\norm{Y_n - y} > \eps} + L \eps
\end{align*}
for any $\eps > 0$, where we used the fact that $\phi$ is bounded by $b$ and where $L$ is the Lipschitz constant of $\phi$.
This entails
\begin{equation*}
\E \big[ | \phi(X_n, Y_n) - \phi(X_n, y) | \big] \leq 2 b \P[\norm{Y_n - y} > \eps] + L \eps,
\end{equation*}
which allows to conclude since $Y_n \gopro y$, so that $\P[\norm{Y_n - y} > \eps] \go 0$.
Finally, $f(X_n, Y_n) \gosto f(X, y)$ for any continuous function $f$, since we know that $\E[\phi(f(X_n, Y_n))] \go \E[\phi(f(X, y))]$ for any continuous and bounded function $\phi$, since $\phi \circ f$ is itself continuous and bounded, and since $(X_n, Y_n) \gosto (X, y)$. $\hfill \square$
\paragraph{Proof of Proposition~\ref{prop:stochastic-ordering}.} % (fold)
We already know that Point~(3) $\Rightarrow$ Point~(1).
We have Point~(2) $\Rightarrow$ Point~(3) since Point~(2) entails that $\{ x \in \R : F_Q(x) \geq p\} \subset \{ x \in \R : F_P(x) \geq p\}$ for any $p \in [0, 1]$, so that $F_P^{-1}(p) \leq F_Q^{-1}(p)$ by definition of the generalized inverse.
We have easily Point~(4) $\Rightarrow$ Point~(2) by choosing $f_x(t) = \ind{(x, +\infty)}(t)$, which is a non-decreasing and bounded function, so that
\begin{equation*}
P[(x, +\infty)] = \int f_x dP \leq \int f_x dQ = Q[(x, +\infty)].
\end{equation*}
Finally, Point~(1) $\Rightarrow$ Point~(4) by taking $X \sim P$ and $Y \sim Q$ such that $X \leq Y$ almost surely, so that
\begin{equation*}
\int f dP = \E[f(X)] = \E[f(X) \ind{X \leq Y}] \leq \E[f(Y)] = \int f dQ
\end{equation*}
for any non-decreasing function $f$. $\hfill \square$