-
-
Notifications
You must be signed in to change notification settings - Fork 39
/
Copy path03-descriptive-stat.Rmd
2069 lines (1394 loc) · 83.7 KB
/
03-descriptive-stat.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Descriptive Statistics
When you have an area of interest to research, a problem to solve, or a relationship to investigate, theoretical and empirical processes will help you.
**Estimand**: Defined as "a quantity of scientific interest that can be calculated in the population and does not change its value depending on the data collection design used to measure it (i.e., it does not vary with sample size, survey design, the number of non-respondents, or follow-up efforts)." [@Rubin_1996]
Examples of estimands include:
- Population means
- Population variances
- Correlations
- Factor loadings
- Regression coefficients
------------------------------------------------------------------------
## Numerical Measures
There are differences between a population and a sample:
+----------------------+-----------------------------------------------------------------+-----------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| **Measures of** | **Category** | **Population** | **Sample** |
+======================+=================================================================+===============================================================================================+=============================================================+
| | What is it? | Reality | A small fraction of reality (inference) |
+----------------------+-----------------------------------------------------------------+-----------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| | Characteristics described by | Parameters | Statistics |
+----------------------+-----------------------------------------------------------------+-----------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| **Central Tendency** | Mean | $\mu = E(Y)$ | $\hat{\mu} = \overline{y}$ |
+----------------------+-----------------------------------------------------------------+-----------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| **Central Tendency** | Median | 50th percentile | $y_{(\frac{n+1}{2})}$ |
+----------------------+-----------------------------------------------------------------+-----------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| **Dispersion** | Variance | $$\sigma^2 = var(Y) = E[(Y-\mu)^2]$$ | $s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (y_i - \overline{y})^2$ |
+----------------------+-----------------------------------------------------------------+-----------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| **Dispersion** | Coefficient of Variation | $\frac{\sigma}{\mu}$ | $\frac{s}{\overline{y}}$ |
+----------------------+-----------------------------------------------------------------+-----------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| **Dispersion** | Interquartile Range | Difference between 25th and 75th percentiles; robust to outliers | |
+----------------------+-----------------------------------------------------------------+-----------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| **Shape** | Skewness | $g_1 = \frac{\mu_3}{\sigma^3}$ | $\hat{g_1} = \frac{m_3}{m_2^{3/2}}$ |
| | | | |
| | Standardized 3rd central moment (unitless) | | |
+----------------------+-----------------------------------------------------------------+-----------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| **Shape** | Central moments | $\mu=E(Y)$, $\mu_2 = \sigma^2 = E[(Y-\mu)^2]$, $\mu_3 = E[(Y-\mu)^3]$, $\mu_4 = E[(Y-\mu)^4]$ | $m_2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - \overline{y})^2$ |
| | | | |
| | | | $m_3 = \frac{1}{n} \sum_{i=1}^{n} (y_i - \overline{y})^3$ |
+----------------------+-----------------------------------------------------------------+-----------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| **Shape** | Kurtosis | $g_2^* = \frac{E[(Y-\mu)^4]}{\sigma^4}$ | $\hat{g_2} = \frac{m_4}{m_2^2} - 3$ |
| | | | |
| | (peakedness and tail thickness) Standardized 4th central moment | | |
+----------------------+-----------------------------------------------------------------+-----------------------------------------------------------------------------------------------+-------------------------------------------------------------+
------------------------------------------------------------------------
**Notes**:
1. **Order Statistics**: $y_{(1)}, y_{(2)}, \ldots, y_{(n)}$, where $y_{(1)} < y_{(2)} < \ldots < y_{(n)}$.
2. **Coefficient of Variation**:
- Defined as the standard deviation divided by the mean.
- A stable, unitless statistic useful for comparison.
3. **Symmetry**:
- **Symmetric distributions**: Mean = Median; Skewness = 0.
- **Skewed Right**: Mean \> Median; Skewness \> 0.
- **Skewed Left**: Mean \< Median; Skewness \< 0.
4. **Central Moments**:
- $\mu = E(Y)$
- $\mu_2 = \sigma^2 = E[(Y-\mu)^2]$
- $\mu_3 = E[(Y-\mu)^3]$
- $\mu_4 = E[(Y-\mu)^4]$
**Skewness (**$\hat{g_1}$**)**
1. **Sampling Distribution**:\
For samples drawn from a normal population:
- $\hat{g_1}$ is approximately distributed as $N(0, \frac{6}{n})$ when $n > 150$.
2. **Inference**:
- **Large Samples**: Inference on skewness can be based on the standard normal distribution.\
The 95% confidence interval for $g_1$ is given by: $$
\hat{g_1} \pm 1.96 \sqrt{\frac{6}{n}}
$$
- **Small Samples**: For small samples, consult special tables such as:
- @snedecor1989statistical, Table A 19(i)
- Monte Carlo test results
**Kurtosis (**$\hat{g_2}$**)**
1. **Definitions and Relationships**:
- A normal distribution has kurtosis $g_2^* = 3$.\
Kurtosis is often redefined as: $$
g_2 = \frac{E[(Y - \mu)^4]}{\sigma^4} - 3
$$ where the 4th central moment is estimated by: $$
m_4 = \frac{\sum_{i=1}^n (y_i - \overline{y})^4}{n}
$$
2. **Sampling Distribution**:\
For large samples ($n > 1000$):
- $\hat{g_2}$ is approximately distributed as $N(0, \frac{24}{n})$.
3. **Inference**:
- **Large Samples**: Inference for kurtosis can use standard normal tables.
- **Small Samples**: Refer to specialized tables such as:
- @snedecor1989statistical, Table A 19(ii)
- @geary1936moments
+-------------------------+-------------------+----------------------------------------------------+
| **Kurtosis Value** | **Tail Behavior** | **Comparison to Normal Distribution** |
+=========================+===================+====================================================+
| $g_2 > 0$ (Leptokurtic) | Heavier Tails | Examples: $t$-distributions |
+-------------------------+-------------------+----------------------------------------------------+
| $g_2 < 0$ (Platykurtic) | Lighter Tails | Examples: Uniform or certain bounded distributions |
+-------------------------+-------------------+----------------------------------------------------+
| $g_2 = 0$ (Mesokurtic) | Normal Tails | Exactly matches the normal distribution |
+-------------------------+-------------------+----------------------------------------------------+
```{r}
# Generate random data from a normal distribution
data <- rnorm(100)
# Load the e1071 package for skewness and kurtosis functions
library(e1071)
# Calculate skewness
skewness_value <- skewness(data)
cat("Skewness:", skewness_value, "\n")
# Calculate kurtosis
kurtosis_value <- kurtosis(data)
cat("Kurtosis:", kurtosis_value, "\n")
```
## Graphical Measures
The following table summarizes key graphical measures along with guidance on when and why to use each. More detailed explanations, visual examples, and sample code will be discussed after this table.
+---------------------------+------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
| **Graph Type** | **When to Use** | **Why It\'s Useful** |
+===========================+==========================================================================================+====================================================================================+
| **Histogram** | \- Exploring the distribution (shape, center, spread) of a single continuous variable | \- Quickly identifies frequency, modes, skewness, and potential outliers \ |
| | | - Provides an overview of data \"shape\" |
+---------------------------+------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
| **Box-and-Whisker Plot** | \- Comparing the same continuous variable across multiple categories \ | \- Shows distribution at a glance (median, quartiles) \ |
| | - Identifying median, IQR, and outliers | - Highlights outliers and potential group differences |
+---------------------------+------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
| **Stem-and-Leaf Plot** | \- Small, single-variable datasets where you want a textual yet visual distribution view | \- Reveals the distribution while preserving actual data values \ |
| | | - Easy to spot clusters and gaps for small datasets |
+---------------------------+------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
| **Notched Boxplot** | \- Similar to a standard boxplot but with confidence intervals around the median | \- If notches don\'t overlap, it suggests the medians differ significantly \ |
| | | - Helps clarify whether differences in medians are likely meaningful |
+---------------------------+------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
| **Bagplot (2D Boxplot)** | \- Bivariate data where you want a 2D \"boxplot\"-style overview \ | \- Depicts both central region (\"bag\") and potential outliers \ |
| | - Identifying outliers in two-dimensional space | - Ideal for discovering clusters or unusual points in two continuous variables |
+---------------------------+------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
| **Boxplot Matrix** | \- Multiple continuous variables that you want to compare side-by-side | \- Quickly compares distributions of many variables simultaneously \ |
| | | - Helpful for spotting differences in median, spread, and outliers |
+---------------------------+------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
| **Violin Plot** | \- Same use case as boxplot but you want *more detail* on the distribution\'s shape | \- Combines boxplot features with a density plot \ |
| | | - Shows where data are concentrated or sparse within each category |
+---------------------------+------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
| **Scatterplot** | \- Two continuous variables to check for relationships, trends, or outliers | \- Visualizes correlation or non-linear patterns \ |
| | | - Aids in identifying clusters or extreme values |
+---------------------------+------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
| **Pairwise Scatterplots** | \- Initial exploration of *several* variables at once | \- Enables a quick scan of relationships between *all* variable pairs \ |
| | | - Useful for identifying multivariate patterns or potential correlation structures |
+---------------------------+------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
**Tips for Selecting the Right Plot:**
- **Focus on Your Question**: Are you comparing groups, investigating correlations, or just exploring the overall shape of the data?
- **Match the Plot to Your Data Type**: Continuous vs. categorical data often dictates your choice of chart.
- **Mind the Data Size**: Some plots become cluttered or lose clarity with very large datasets (e.g., stem-and-leaf), while others may be less informative with very few data points.
### Shape
Properly labeling your graphs is essential to ensure that viewers can easily understand the data presented. Below are several examples of graphical measures used to assess the shape of a dataset.
```{r}
# Generate random data for demonstration purposes
data <- rnorm(100)
# Histogram: A graphical representation of the distribution of a dataset.
hist(
data,
labels = TRUE,
col = "grey",
breaks = 12,
main = "Histogram of Random Data",
xlab = "Value",
ylab = "Frequency"
)
# Interactive Histogram: Using 'highcharter' for a more interactive visualization.
# pacman::p_load("highcharter")
# hchart(data, type = "column", name = "Random Data Distribution")
# Box-and-Whisker Plot: Useful for visualizing the distribution and identifying outliers.
boxplot(
count ~ spray,
data = InsectSprays,
col = "lightgray",
main = "Boxplot of Insect Sprays",
xlab = "Spray Type",
ylab = "Count"
)
# Notched Boxplot: The notches indicate a confidence interval around the median.
boxplot(
len ~ supp * dose,
data = ToothGrowth,
notch = TRUE,
col = c("gold", "darkgreen"),
main = "Tooth Growth by Supplement and Dose",
xlab = "Supplement and Dose",
ylab = "Length"
)
# If the notches of two boxes do not overlap, this suggests that the medians differ significantly.
# Stem-and-Leaf Plot: Provides a quick way to visualize the distribution of data.
stem(data)
# Bagplot - A 2D Boxplot Extension: Visualizes the spread and identifies outliers in two-dimensional data.
pacman::p_load(aplpack)
attach(mtcars)
bagplot(wt,
mpg,
xlab = "Car Weight",
ylab = "Miles Per Gallon",
main = "Bagplot of Car Weight vs. Miles Per Gallon")
detach(mtcars)
```
Below are some advanced plot types that can provide deeper insights into data:
```{r}
# boxplot.matrix(): Creates boxplots for each column in a matrix. Useful for comparing multiple variables.
graphics::boxplot.matrix(
cbind(
Uni05 = (1:100) / 21,
Norm = rnorm(100),
T5 = rt(100, df = 5),
Gam2 = rgamma(100, shape = 2)
),
main = "Boxplot Marix",
notch = TRUE,
col = 1:4
)
# Violin Plot (vioplot()): Combines a boxplot with a density plot, providing more information about the distribution.
library("vioplot")
vioplot(data, col = "lightblue", main = "Violin Plot Example")
```
### Scatterplot
Scatterplots are useful for visualizing relationships between two continuous variables. They help identify patterns, correlations, and outliers.
Pairwise Scatterplots: Visualizes relationships between all pairs of variables in a dataset. This is especially useful for exploring potential correlations.
```{r}
pairs(mtcars,
main = "Pairwise Scatterplots",
pch = 19,
col = "blue")
```
## Normality Assessment
The Normal (Gaussian) distribution plays a critical role in statistical analyses due to its theoretical and practical applications. Many statistical methods assume normality in the data, making it essential to assess whether our variable of interest follows a normal distribution. To achieve this, we utilize both [Numerical Measures] and [Graphical Assessment].
### Graphical Assessment
Graphical methods provide an intuitive way to visually inspect the normality of a dataset. One of the most common methods is the **Q-Q plot** (quantile-quantile plot). The Q-Q plot compares the quantiles of the sample data to the quantiles of a theoretical normal distribution. Deviations from the line indicate departures from normality.
Below is an example of using the `qqnorm` and `qqline` functions in R to assess the normality of the `precip` dataset, which contains precipitation data (in inches per year) for 70 U.S. cities:
```{r}
# Load the required package
pacman::p_load("car")
# Generate a Q-Q plot
qqnorm(precip,
ylab = "Precipitation [in/yr] for 70 US cities",
main = "Q-Q Plot of Precipitation Data")
qqline(precip, col = "red")
```
**Interpretation**
- **Theoretical Line**: The red line represents the expected relationship if the data were perfectly normally distributed.
- **Data Points**: The dots represent the actual empirical data.
If the points closely align with the theoretical line, we can conclude that the data likely follow a normal distribution. However, noticeable deviations from the line, particularly systematic patterns (e.g., curves or s-shaped patterns), indicate potential departures from normality.
Tips
1. **Small Deviations**: Minor deviations from the line in small datasets are not uncommon and may not significantly impact analyses that assume normality.
2. **Systematic Patterns**: Look for clear trends, such as clusters or s-shaped curves, which suggest skewness or heavy tails.
3. **Complementary Tests**: Always pair graphical methods with numerical measures (e.g., Shapiro-Wilk test) to make a robust conclusion.
When interpreting a Q-Q plot, it is helpful to see both ideal and non-ideal scenarios. Below is an illustrative example:
1. **Normal Data**: Points fall closely along the line.
2. **Skewed Data**: Points systematically deviate from the line, curving upward or downward.
3. **Heavy Tails**: Points deviate at the extremes (ends) of the distribution.
By combining visual inspection and numerical measures, we can better understand the nature of our data and its alignment with the assumption of normality.
### Summary Statistics
While graphical assessments, such as Q-Q plots, provide a visual indication of normality, they may not always offer a definitive conclusion. To supplement graphical methods, statistical tests are often employed. These tests provide quantitative evidence to support or refute the assumption of normality. The most common methods can be classified into two categories:
- [Methods Based on Normal Probability Plot]
- [Correlation Coefficient with Normal Probability Plots]
- [Shapiro-Wilk Test]
- [Methods based on empirical cumulative distribution function]
- [Anderson-Darling Test]
- [Kolmogorov-Smirnov Test]
- [Cramer-von Mises Test]
- [Jarque--Bera Test](#jarquebera-test)
#### Methods Based on Normal Probability Plot
##### Correlation Coefficient with Normal Probability Plots
As described by @Looney_1985 and @Shapiro_1972, this method evaluates the linearity of a normal probability plot by calculating the correlation coefficient between the ordered sample values $y_{(i)}$ and their theoretical normal quantiles $m_i^*$. A perfectly linear relationship suggests that the data follow a normal distribution.
The correlation coefficient, denoted $W^*$, is given by:
$$
W^* = \frac{\sum_{i=1}^{n}(y_{(i)}-\bar{y})(m_i^* - 0)}{\sqrt{\sum_{i=1}^{n}(y_{(i)}-\bar{y})^2 \cdot \sum_{i=1}^{n}(m_i^* - 0)^2}}
$$
where:
- $\bar{y}$ is the sample mean,
- $\bar{m^*} = 0$ under the null hypothesis of normality.
The **Pearson product-moment correlation formula** can also be used to evaluate this relationship:
$$
\hat{\rho} = \frac{\sum_{i=1}^{n}(y_i - \bar{y})(x_i - \bar{x})}{\sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2 \cdot \sum_{i=1}^{n}(x_i - \bar{x})^2}}
$$
- **Interpretation**:
- When the correlation is 1, the plot is exactly linear, and normality is assumed.
- The closer the correlation is to 0, the stronger the evidence to reject normality.
- Inference on $W^*$ requires reference to special tables [@Looney_1985].
```{r}
library("EnvStats")
# Perform Probability Plot Correlation Coefficient (PPCC) Test
gofTest(data, test = "ppcc")$p.value # Probability Plot Correlation Coefficient
```
##### Shapiro-Wilk Test
The Shapiro-Wilk test [@Shapiro_1965] is one of the most widely used tests for assessing normality, especially for sample sizes $n < 2000$. This test evaluates how well the data's order statistics match a theoretical normal distribution. The test statistic, $W$, is computed as:
$$
W=\frac{\sum_{i=1}^{n}a_i x_{(i)}}{\sum_{i=1}^{n}(x_{(i)}-\bar{x})^2}
$$
where
- $n$: The sample size.
- $x_{(i)}$: The $i$-th smallest value in the sample (the ordered data).
- $\bar{x}$: The sample mean.
- $a_i$: Weights derived from the expected values and variances of the order statistics of a normal distribution, precomputed based on the sample size $n$.
**Sensitive** to:
1. **Symmetry**
- The Shapiro-Wilk test assesses whether a sample is drawn from a normal distribution, which assumes **symmetry** around the mean.
- If the data exhibit skewness (a lack of symmetry), the test is likely to reject the null hypothesis of normality.
2. **Heavy Tails**
- Heavy tails refer to distributions where extreme values (outliers) are more likely compared to a normal distribution.
- The Shapiro-Wilk test is also sensitive to such departures from normality because heavy tails affect the spread and variance, which are central to the calculation of the test statistic $W$.
Hence, the Shapiro-Wilk test's sensitivity to these deviations makes it a powerful tool for detecting non-normality only in small to moderate-sized samples. However:
- It is generally more sensitive to **symmetry** (skewness) than to **tail behavior** (kurtosis).
- In very large samples, even small deviations in symmetry or tail behavior may cause the test to reject the null hypothesis, even if the data is practically "normal" for the intended analysis.
- Small sample sizes may lack power to detect deviations from normality.
- Large sample sizes may detect minor deviations that are not practically significant.
**Key Steps**:
1. **Sort the Data:** Arrange the sample data in ascending order, yielding $x_{(1)}, x_{(2)}, \dots, x_{(n)}$.
2. **Compute Weights:** The weights $a_i$ are determined using a covariance matrix of the normal order statistics. These are optimized to maximize the power of the test.
3. **Calculate** $W$: Use the formula to determine $W$, which ranges from 0 to 1.
**Decision Rule**:
- **Null Hypothesis** ($H_0$): The data follows a normal distribution.
- **Alternative Hypothesis** ($H_1$): The data does not follow a normal distribution.
- A small $W$ value, along with a $p$-value below a chosen significance level (e.g., 0.05), leads to rejection of $H_0$.
- Under normality, $W$ approaches 1.
- Smaller values of $W$ indicate deviations from normality.
```{r}
# Perform Shapiro-Wilk Test (Default for gofTest)
EnvStats::gofTest(mtcars$mpg, test = "sw")
```
#### Methods Based on Empirical Cumulative Distribution Function
The **Empirical Cumulative Distribution Function (ECDF)** is a way to represent the distribution of a sample dataset in cumulative terms. It answers the question:
> "What fraction of the observations in my dataset are less than or equal to a given value $x$?"
The ECDF is defined as:
$$
F_n(x) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{I}(X_i \leq x)
$$
where:
- $F_(x)$: ECDF at value $x$.
- $n$: Total number of data points.
- $\mathbb{I}(X_i \leq x)$: Indicator function, equal to 1 if $X_i \leq x$, otherwise 0.
This method is especially useful for large sample sizes and can be applied to distributions beyond the normal (Gaussian) distribution.
Properties of the ECDF
1. **Step Function**: The ECDF is a step function that increases by $1/n$ at each data point.
2. **Non-decreasing**: As $x$ increases, $F_n(x)$ never decreases.
3. **Range**: The ECDF starts at 0 and ends at 1:
- $F_n(x) = 0$ for $x < \min(X)$.
- $F_n(x) = 1$ for $x \geq \max(X)$.
4. **Convergence**: As $n \to \infty$, the ECDF approaches the true cumulative distribution function (CDF) of the population.
Let's consider a sample dataset $\{3, 7, 7, 10, 15\}$. The ECDF at different values of $x$ is calculated as:
+------------+-----------------------------------------+----------------+---------------+
| $x$ | $\mathbb{I}(X_i \leq x)$ for each $X_i$ | Count $\leq x$ | ECDF $F_n(x)$ |
+============+=========================================+================+===============+
| $x = 5$ | $\{1, 0, 0, 0, 0\}$ | 1 | $1/5 = 0.2$ |
+------------+-----------------------------------------+----------------+---------------+
| $x = 7$ | $\{1, 1, 1, 0, 0\}$ | 3 | $3/5 = 0.6$ |
+------------+-----------------------------------------+----------------+---------------+
| $x = 12$ | $\{1, 1, 1, 1, 0\}$ | 4 | $4/5 = 0.8$ |
+------------+-----------------------------------------+----------------+---------------+
| $x = 15$ | $\{1, 1, 1, 1, 1\}$ | 5 | $5/5 = 1.0$ |
+------------+-----------------------------------------+----------------+---------------+
Applications of the ECDF
1. **Goodness-of-fit Tests**: Compare the ECDF to a theoretical CDF (e.g., using the Kolmogorov-Smirnov test).
2. **Outlier Detection**: Analyze cumulative trends to spot unusual data points.
3. **Visual Data Exploration**: Use the ECDF to understand the spread, skewness, and distribution of the data.
4. **Comparing Distributions**: Compare the ECDFs of two datasets to assess differences in their distributions.
```{r}
# Load required libraries
library(ggplot2)
# Sample dataset
data <- c(3, 7, 7, 10, 15)
# ECDF calculation
ecdf_function <- ecdf(data)
# Generate a data frame for plotting
ecdf_data <- data.frame(x = sort(unique(data)),
ecdf = sapply(sort(unique(data)), function(x)
mean(data <= x)))
# Display ECDF values
print(ecdf_data)
# Plot the ECDF
ggplot(ecdf_data, aes(x = x, y = ecdf)) +
geom_step() +
labs(
title = "Empirical Cumulative Distribution Function",
x = "Data Values",
y = "Cumulative Proportion"
) +
theme_minimal()
```
```{r}
# Alternatively
plot.ecdf(as.numeric(mtcars[1, ]),
verticals = TRUE,
do.points = FALSE)
```
##### Anderson-Darling Test
The **Anderson-Darling test** statistic [@Anderson_1952] is given by:
$$
A^2 = \int_{-\infty}^{\infty} \frac{\left(F_n(t) - F(t)\right)^2}{F(t)(1 - F(t))} dF(t)
$$
This test calculates a weighted average of squared deviations between the empirical cumulative distribution function (CDF), $F_n(t)$, and the theoretical CDF, $F(t)$. More weight is given to deviations in the tails of the distribution, which makes the test particularly sensitive to these regions.
For a sample of size $n$, with ordered observations $y_{(1)}, y_{(2)}, \dots, y_{(n)}$, the Anderson-Darling test statistic can also be written as:
$$
A^2 = -n - \frac{1}{n} \sum_{i=1}^n \left[ (2i - 1) \ln(F(y_{(i)})) + (2n + 1 - 2i) \ln(1 - F(y_{(i)})) \right]
$$
For the **normal distribution**, the test statistic is further simplified. Using the transformation:
$$
p_i = \Phi\left(\frac{y_{(i)} - \bar{y}}{s}\right),
$$
where:
- $p_i$ is the cumulative probability under the standard normal distribution,
- $y_{(i)}$ are the ordered sample values,
- $\bar{y}$ is the sample mean,
- $s$ is the sample standard deviation,
the formula becomes:
$$
A^2 = -n - \frac{1}{n} \sum_{i=1}^n \left[ (2i - 1) \ln(p_i) + (2n + 1 - 2i) \ln(1 - p_i) \right].
$$
Key Features of the Test
1. **CDF-Based Weighting**: The Anderson-Darling test gives more weight to deviations in the tails, which makes it particularly sensitive to detecting non-normality in these regions.
2. **Sensitivity**: Compared to other goodness-of-fit tests, such as the [Kolmogorov-Smirnov Test], the Anderson-Darling test is better at identifying differences in the tails of the distribution.
3. **Integral Form**: The test statistic can also be expressed as an integral over the theoretical CDF: $$
A^2 = n \int_{-\infty}^\infty \frac{\left[F_n(t) - F(t)\right]^2}{F(t)(1 - F(t))} dF(t),
$$ where $F_n(t)$ is the empirical CDF, and $F(t)$ is the specified theoretical CDF.
4. **Applications**:
- Testing for normality or other distributions (e.g., exponential, Weibull).
- Validating assumptions in statistical models.
- Comparing data to theoretical distributions.
Hypothesis Testing
- **Null Hypothesis (**$H_0$): The data follows the specified distribution (e.g., normal distribution).
- **Alternative Hypothesis (**$H_1$): The data does not follow the specified distribution.
- The null hypothesis is rejected if $A^2$ is too large, indicating a poor fit to the specified distribution.
Critical values for the test statistic are provided by [@Marsaglia_2004] and [@Stephens_1974].
Applications to Other Distributions
The Anderson-Darling test can be applied to various distributions by using specific transformation methods. Examples include:
- **Exponential**
- **Logistic**
- **Gumbel**
- **Extreme-value**
- **Weibull** (after logarithmic transformation: $\log(\text{Weibull}) = \text{Gumbel}$)
- **Gamma**
- **Cauchy**
- **von Mises**
- **Log-normal (two-parameter)**
For more details on transformations and critical values, consult [@Stephens_1974].
```{r}
# Perform Anderson-Darling Test
library(nortest)
ad_test_result <- ad.test(mtcars$mpg)
# Output the test statistic and p-value
ad_test_result
```
Alternatively, for a broader range of distributions, use the `gofTest` function from the `gof` package:
```{r}
# General goodness-of-fit test with Anderson-Darling
library(EnvStats)
gof_test_result <- EnvStats::gofTest(mtcars$mpg, test = "ad")
# Extract the p-value
gof_test_result$p.value
```
##### Kolmogorov-Smirnov Test
The **Kolmogorov-Smirnov (K-S) test** is a nonparametric test that compares the empirical cumulative distribution function (ECDF) of a sample to a theoretical cumulative distribution function (CDF), or compares the ECDFs of two samples. It is used to assess whether a sample comes from a specific distribution (one-sample test) or to compare two samples (two-sample test).
The test statistic $D_n$ for the one-sample test is defined as:
$$
D_n = \sup_x \left| F_n(x) - F(x) \right|,
$$
where:
- $F_n(x)$ is the empirical CDF of the sample,
- $F(x)$ is the theoretical CDF under the null hypothesis,
- $\sup_x$ denotes the supremum (largest value) over all possible values of $x$.
For the two-sample K-S test, the statistic is:
$$
D_{n,m} = \sup_x \left| F_{n,1}(x) - F_{m,2}(x) \right|,
$$
where $F_{n,1}(x)$ and $F_{m,2}(x)$ are the empirical CDFs of the two samples, with sizes $n$ and $m$, respectively.
Hypotheses
- **Null hypothesis (**$H_0$): The sample comes from the specified distribution (one-sample) or the two samples are drawn from the same distribution (two-sample).
- **Alternative hypothesis (**$H_1$): The sample does not come from the specified distribution (one-sample) or the two samples are drawn from different distributions (two-sample).
Properties
1. **Based on the Largest Deviation**: The K-S test is sensitive to the largest absolute difference between the empirical and expected CDFs, making it effective for detecting shifts in location or scale.
2. **Distribution-Free**: The test does not assume a specific distribution for the data under the null hypothesis. Its significance level is determined from the distribution of the test statistic under the null hypothesis.
3. **Limitations**:
- The test is more sensitive near the center of the distribution than in the tails.
- It may not perform well with discrete data or small sample sizes.
4. **Related Tests**:
- **Kuiper's Test**: A variation of the K-S test that is sensitive to deviations in both the center and tails of the distribution. The Kuiper test statistic is: $$
V_n = D^+ + D^-,
$$ where $D^+$ and $D^-$ are the maximum positive and negative deviations of the empirical CDF from the theoretical CDF.
Applications
- Testing for normality or other specified distributions.
- Comparing two datasets to determine if they are drawn from the same distribution.
To perform a one-sample K-S test in R, use the `ks.test()` function. To check the goodness of fit for a specific distribution, the `gofTest()` function from a package like `DescTools` can also be used.
```{r}
# One-sample Kolmogorov-Smirnov test for normality
data <- rnorm(50) # Generate random normal data
ks.test(data, "pnorm", mean(data), sd(data))
# Goodness-of-fit test using gofTest
library(DescTools)
gofTest(data, test = "ks")$p.value # Kolmogorov-Smirnov test p-value
```
- **Advantages**:
- Simple and widely applicable.
- Distribution-free under the null hypothesis.
- **Limitations**:
- Sensitive to sample size: small deviations may lead to significance in large samples.
- Reduced sensitivity to differences in the tails compared to the Anderson-Darling test.
The Kolmogorov-Smirnov test provides a general-purpose method for goodness-of-fit testing and sample comparison, with particular utility in detecting central deviations.
##### Cramer-von Mises Test
The **Cramer-von Mises (CVM) test** is a nonparametric goodness-of-fit test that evaluates the agreement between the empirical cumulative distribution function (ECDF) of a sample and a specified theoretical cumulative distribution function (CDF). Unlike the [Kolmogorov-Smirnov test], which focuses on the largest discrepancy, the Cramer-von Mises test considers the **average squared discrepancy** across the entire distribution. Unlike the [Anderson-Darling test], it weights all parts of the distribution equally.
The test statistic $W^2$ for the one-sample Cramer-von Mises test is defined as:
$$
W^2 = n \int_{-\infty}^\infty \left[ F_n(t) - F(t) \right]^2 dF(t),
$$
where:
- $F_n(t)$ is the empirical CDF,
- $F(t)$ is the specified theoretical CDF under the null hypothesis,
- $n$ is the sample size.
In practice, $W^2$ is computed using the ordered sample values $y_{(1)}, y_{(2)}, \dots, y_{(n)}$ as:
$$
W^2 = \sum_{i=1}^n \left( F(y_{(i)}) - \frac{2i - 1}{2n} \right)^2 + \frac{1}{12n},
$$
where:
- $F(y_{(i)})$ is the theoretical CDF evaluated at the ordered sample values $y_{(i)}$.
Hypotheses
- **Null hypothesis (H0)**: The sample data follow the specified distribution.
- **Alternative hypothesis (H1)**: The sample data do not follow the specified distribution.
Properties
1. **Focus on Average Discrepancy**: The Cramer-von Mises test measures the overall goodness-of-fit by considering the squared deviations across all points in the distribution, ensuring equal weighting of discrepancies.
2. **Comparison to Anderson-Darling**: Unlike the Anderson-Darling test, which gives more weight to deviations in the tails, the CVM test weights all parts of the distribution equally.
3. **Integral Representation**: The statistic is expressed as an integral over the squared differences between the empirical and theoretical CDFs.
4. **Two-Sample Test**: The Cramer-von Mises framework can also be extended to compare two empirical CDFs. The two-sample statistic is based on the pooled empirical CDF.
Applications
- Assessing goodness-of-fit for a theoretical distribution (e.g., normal, exponential, Weibull).
- Comparing two datasets to determine if they are drawn from similar distributions.
- Validating model assumptions.
To perform a Cramer-von Mises test in R, the `gofTest()` function from the `DescTools` package can be used. Below is an example:
```{r}
# Generate random normal data
data <- rnorm(50)
# Perform the Cramer-von Mises test
library(DescTools)
gofTest(data, test = "cvm")$p.value # Cramer-von Mises test p-value
```
- **Advantages**:
- Considers discrepancies across the entire distribution.
- Robust to outliers due to equal weighting.
- Simple to compute and interpret.
- **Limitations**:
- Less sensitive to deviations in the tails compared to the Anderson-Darling test.
- May be less powerful than the Kolmogorov-Smirnov test in detecting central shifts.
##### Jarque-Bera Test {#jarquebera-test}
The **Jarque-Bera (JB) test** [@Bera_1981] is a goodness-of-fit test used to check whether a dataset follows a normal distribution. It is based on the **skewness** and **kurtosis** of the data, which measure the asymmetry and the "tailedness" of the distribution, respectively.
The Jarque-Bera test statistic is defined as:
$$
JB = \frac{n}{6}\left(S^2 + \frac{(K - 3)^2}{4}\right),
$$
where:
- $n$ is the sample size,
- $S$ is the sample skewness,
- $K$ is the sample kurtosis.
Skewness ($S$) is calculated as:
$$
S = \frac{\hat{\mu}_3}{\hat{\sigma}^3} = \frac{\frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^3}{\left(\frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2\right)^{3/2}},
$$
where:
- $\hat{\mu}_3$ is the third central moment,
- $\hat{\sigma}$ is the standard deviation,
- $\bar{x}$ is the sample mean.
Kurtosis ($K$) is calculated as:
$$
K = \frac{\hat{\mu}_4}{\hat{\sigma}^4} = \frac{\frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^4}{\left(\frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2\right)^2},
$$
where:
- $\hat{\mu}_4$ is the fourth central moment.
Hypothesis
- **Null hypothesis (**$H_0$): The data follow a normal distribution, implying:
- Skewness $S = 0$,
- Excess kurtosis $K - 3 = 0$.
- **Alternative hypothesis (**$H_1$): The data do not follow a normal distribution.
**Distribution of the JB Statistic**
Under the null hypothesis, the Jarque-Bera statistic asymptotically follows a chi-squared distribution with 2 degrees of freedom:
$$
JB \sim \chi^2_2.
$$
Properties
1. **Sensitivity**:
- Skewness ($S$) captures asymmetry in the data.
- Kurtosis ($K$) measures how heavy-tailed or light-tailed the distribution is compared to a normal distribution.
2. **Limitations**:
- The test is sensitive to large sample sizes; even small deviations from normality may result in rejection of $H_0$.
- Assumes that the data are independently and identically distributed.
Applications
- Testing normality in regression residuals.
- Validating distributional assumptions in econometrics and time series analysis.
The Jarque-Bera test can be performed in R using the `tseries` package:
```{r}
library(tseries)
# Generate a sample dataset
data <- rnorm(100) # Normally distributed data
# Perform the Jarque-Bera test
jarque.bera.test(data)
```
## Bivariate Statistics
Bivariate statistics involve the analysis of relationships between two variables. Understanding these relationships can provide insights into patterns, associations, or (suggestive of) causal connections. Below, we explore the correlation between different types of variables:
- [Two Continuous] **Variables**
- [Two Discrete] **Variables**
- [Categorical and Continuous] **Variables**
Before delving into the analysis, it is critical to consider the following:
1. **Is the relationship linear or non-linear?**
- Linear relationships can be modeled with simpler statistical methods such as Pearson's correlation, while non-linear relationships may require alternative approaches, such as Spearman's rank correlation or regression with transformations.
2. **If the variable is continuous, is it normal and homoskedastic?**
- For parametric methods like Pearson's correlation, assumptions such as normality and homoskedasticity (equal variance) must be met. When these assumptions fail, non-parametric methods like Spearman's correlation or robust alternatives are preferred.
3. **How big is your dataset?**
- Large datasets can reveal subtle patterns but may lead to statistically significant results that are not practically meaningful. For smaller datasets, careful selection of statistical methods is essential to ensure reliability and validity.
+-----------------+------------------------------------------------------------+-----------------------------------------------+
| | Categorical | Continuous |
+=================+============================================================+===============================================+
| **Categorical** | [Chi-squared Test] | |
| | | |
| | [Phi Coefficient] | |
| | | |
| | [Cramer's V](#cramers-v) | |
| | | |
| | [Tschuprow's T](#tschuprows-t) | |
| | | |
| | [Spearman's Rank Correlation](#spearmans-rank-correlation) | |
| | | |
| | [Kendall's Tau](#kendalls-tau) | |
| | | |
| | [Gamma Statistic] | |
| | | |
| | [Freeman's Theta](#freemans-theta) | |
| | | |
| | [Epsilon-squared] | |
| | | |
| | [Goodman Kruskal's Gamma](#goodman-kruskals-gamma) | |
| | | |
| | [Somers' D](#somers-d) | |
| | | |
| | [Kendall's Tau-b](#kendalls-tau-b) | |
| | | |
| | [Yule's Q and Y](#yules-q-and-y) | |
| | | |
| | [Tetrachoric Correlation] | |
| | | |
| | [Polychoric Correlation] | |
+-----------------+------------------------------------------------------------+-----------------------------------------------+
| **Continuous** | [Point-Biserial Correlation] | [Pearson Correlation](#pearson-correlation-1) |
| | | |
| | [Logistic Regression] | [Spearman Correlation] |
+-----------------+------------------------------------------------------------+-----------------------------------------------+
### Two Continuous
```{r}
set.seed(1)
n = 100 # (sample size)
data = data.frame(A = sample(1:20, replace = TRUE, size = n),
B = sample(1:30, replace = TRUE, size = n))
```
#### Pearson Correlation
**Pearson correlation** quantifies the strength and direction of a **linear relationship** between two continuous variables.
Formula:
$$
r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \cdot \sum (y_i - \bar{y})^2}}
$$ where
- $x_i, y_i$: Individual data points of variables $X$ and $Y$.
- $\bar{x}, \bar{y}$: Means of $X$ and $Y$.
**Assumptions**:
1. The relationship between variables is **linear**.
2. Variables are **normally distributed**.
3. Data exhibits **homoscedasticity** (equal variance of $Y$ for all values of $X$).
Use Case:
- Use when the relationship is expected to be linear, and assumptions of normality and homoscedasticity are met.
Interpretation:
- $r = +1$: Perfect positive linear relationship.
- $r = -1$: Perfect negative linear relationship.
- $r = 0$: No linear relationship.
```{r}
# Pearson correlation
pearson_corr <- stats::cor(data$A, data$B, method = "pearson")
cat("Pearson Correlation (r):", pearson_corr, "\n")
```
#### Spearman Correlation
**Spearman correlation** measures the strength of a **monotonic relationship** between two variables. It ranks the data and calculates correlation based on ranks.
Formula:
$$
\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 -1)}
$$
where
- $d_i$: Difference between the ranks of $x_i$ and $y_i$.
- $n$: Number of paired observations.
**Assumptions**:
1. Relationship must be **monotonic**, not necessarily linear.
2. No assumptions about the distribution of variables.
Use Case:
- Use when data is ordinal or when normality and linearity assumptions are violated.
Interpretation:
- $\rho = +1$: Perfect positive monotonic relationship.
- $\rho = -1$: Perfect negative monotonic relationship.
- $\rho = 0$: No monotonic relationship.
```{r}
# Spearman correlation
spearman_corr <- stats::cor(data$A, data$B, method = "spearman")
cat("Spearman Correlation (rho):", spearman_corr, "\n")