forked from lhbikos/ReCenterPsychStats
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03-Preliminaries.Rmd
942 lines (685 loc) · 62.9 KB
/
03-Preliminaries.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
# Preliminary Analyses {#preliminaries}
[Screencasted Lecture Link](https://youtube.com/playlist?list=PLtz5cFLQl4KOoTRizMIyRU0ioiuSM1tWg)
```{r include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA) #keeps out the hashtags in the knits
options(scipen=999)#eliminates scientific notation
```
The beginning of any data analysis means familiarizing yourself with the data. Among other things, this includes producing and interpreting its distributional characteristics. In this lesson we mix common R operations for formatting, preparing, and analyzing the data with foundational statistical concepts in statistics.
## Navigating this Lesson
There is just less than two hours of lecture. If you work through the lesson with me, I would plan for an additional three hours.
While the majority of R objects and data you will need are created within the R script that sources the lesson, occasionally there are some that cannot be created from within the R framework. Additionally, sometimes links fail. All original materials are provided at the [Github site](https://github.com/lhbikos/ReCenterPsychStats) that hosts the book. More detailed guidelines for ways to access all these materials are provided in the OER's [introduction](#ReCintro)
### Learning Objectives
Learning objectives from this lecture include the following:
* Determine the appropriate scale of measurement for variables and format them properly in R
* Produce and interpret measures of central tendency
* Analyze the distributional characteristics of data
* Describe the steps in calculating a standard deviation.
* Describe the steps in calculating a bivariate correlation coefficient (i.e., Pearson *r*).
* Create an APA Style table and results section that includes means, standard deviations, and correlations and addresses skew and kurtosis.
### Planning for Practice
The practice assignment at the end of the lesson is designed as a "get (or 'get back') into it" assignment. You will essentially work through this very same lecture, using the same dataframe; you will simply use a different set of continuous variables.
### Readings & Resources
In preparing this chapter, I drew heavily from the following resource(s). Other resources are cited (when possible, linked) in the text with complete citations in the reference list.
* Revelle, W. (2021). An introduction to the psych package: Part I: data entry and data description. 60.
- Revelle is the author/creator of the *psych* package. His tutorial provides both technical and interpretive information. Read pages 1-17.
* Lui, P. P. (2020). Racial microaggression, overt discrimination, and distress: (In)Direct associations with psychological adjustment. *The Counseling Psychologist, 32*.
- This is the research vignette from which I simulate data that we can use in the lesson and practice problem.
```{r include=FALSE}
#will install the package if not already installed
#if(!require(tidyverse)){install.packages("tidyverse")}
#if(!require(lsr)){install.packages("lsr")}
#if(!require(psych)){install.packages("psych")}
#if(!require(apaTables)){install.packages("apaTables")}
#if(!require(MASS)){install.packages("MASS")}
#if(!require(knitr)){install.packages("knitr")}
#if(!require(ggpubr)){install.packages("ggpubr")}
#if(!require(dplyr)){install.packages("dplyr")}
#if(!require(pastecs)){install.packages("pastecs")}
```
## Research Vignette
We will use data that has been simulated data from Lui [-@lui_racial_2020] as the research vignette. Controlling for overt discrimination, and neuroticism, Lui examined the degree to which racial microaggressions contributed to negative affect, alcohol consumption, and drinking problems in African American, Asian American, and Latinx American college students (*N* = 713).
Using the means, standard deviations, correlation matrix, and group sizes (*n*) I simulated the data. Although I provide some narration of what I did, process of simulation is beyond the learning goals of this lesson, so you are welcome to skip it. Simulating data within each chapter makes the lesson more "portable."
```{r warning=FALSE, message=FALSE, tidy=TRUE, tidy.opts=list(width.cutoff=70)}
set.seed(210807)#sets the random seed so that we consistently get the same results
#for practice, you could change (or remove) the random seed and try to interpret the results (they should be similar)
#There are probably more efficient ways to simulate data. Given the information available in the manuscript, my approach was to first create separate datasets for each of the racial ethnic groups and then bind them together.
#First, the data for the students who identified as Asian American
Asian_mu <- c(1.52, 1.72, 2.69, 1.71, 2.14, 2.35, 2.42) #creating an object containing the means
Asian_stddev <- c(2.52, 2.04, 0.47, 0.70, 0.80, 2.41, 3.36) # creating an object containing thestandard deviations
Asian_corMat <- matrix(c(1.00, 0.69, 0.19, 0.28, 0.32, 0.08, 0.23,
0.69, 1.00, 0.20, 0.29, 0.33, 0.13, 0.25,
0.19, 0.20, 1.00, 0.50, 0.50, -0.04, 0.09,
0.28, 0.29, 0.50, 1.00, 0.76, 0.04, 0.18,
0.32, 0.33, 0.50, 0.76, 1.00, 0.10, 0.21,
0.08, 0.13, -0.04, 0.04, 0.10, 1.00, 0.42,
0.23, 0.25, 0.09, 0.18, 0.21, 0.42, 1.00),
ncol=7) # creating an object containing the correlation matrix
Asian_covMat <- Asian_stddev %*% t(Asian_stddev) * Asian_corMat #creating a covariance matrix from the above objects
Asian_dat <- MASS::mvrnorm(n = 398, mu = Asian_mu, Sigma = Asian_covMat, empirical = TRUE) #creating the dataset
Asian_df <- as.data.frame(Asian_dat) #formatting the dataset as a data frame
library(tidyverse)
#renaming the variables
Asian_df <- rename(Asian_df, OvDisc = V1, mAggr = V2, Neuro = V3, nAff = V4, psyDist = V5, Alcohol = V6, drProb = V7)
#set upper and lower bound for each variable
Asian_df$OvDisc[Asian_df$OvDisc > 16] <- 16
Asian_df$OvDisc[Asian_df$OvDisc < 0] <- 0
Asian_df$mAggr[Asian_df$mAggr > 16] <- 16
Asian_df$mAggr[Asian_df$mAggr < 0] <- 0
Asian_df$Neuro[Asian_df$Neuro > 5] <- 5
Asian_df$Neuro[Asian_df$Neuro < 1] <- 1
Asian_df$nAff[Asian_df$nAff > 4] <- 4
Asian_df$nAff[Asian_df$nAff < 1] <- 1
Asian_df$psyDist[Asian_df$psyDist > 5] <- 5
Asian_df$psyDist[Asian_df$psyDist < 1] <- 1
Asian_df$Alcohol[Asian_df$Alcohol > 12] <- 12
Asian_df$Alcohol[Asian_df$Alcohol < 0] <- 0
Asian_df$drProb[Asian_df$drProb > 12] <- 12
Asian_df$drProb[Asian_df$drProb < 0] <- 0
Asian_df$RacEth <- "Asian"
#Second, the data for the students who identified as Black/African American
Black_mu <- c(4.45, 3.84, 2.60, 1.84, 2.10, 2.81, 2.14)
Black_stddev <- c(4.22, 3.08, 0.89, 0.80, 0.81, 2.49, 3.24)
Black_corMat <- matrix(c( 1.00, 0.81, 0.17, 0.15, 0.09, 0.05, -0.16,
0.81, 1.00, 0.17, 0.21, 0.11, 0.09, -0.01,
0.17, 0.17, 1.00, 0.59, 0.54, 0.05, 0.24,
0.15, 0.21, 0.59, 1.00, 0.72, 0.12, 0.22,
0.09, 0.11, 0.54, 0.72, 1.00, 0.21, 0.40,
0.05, 0.09, 0.05, 0.12, 0.21, 1.00, 0.65,
-0.16,-0.01, 0.24, 0.22, 0.40, 0.65, 1.00),
ncol = 7)
Black_covMat <- Black_stddev %*% t(Black_stddev) * Black_corMat
Black_dat <- MASS::mvrnorm(n = 133, mu = Black_mu, Sigma = Black_covMat, empirical = TRUE)
Black_df <- as.data.frame(Black_dat)
Black_df <- rename(Black_df, OvDisc = V1, mAggr = V2, Neuro = V3, nAff = V4, psyDist = V5, Alcohol = V6, drProb = V7)
#set upper and lower bound for each variable
Black_df$OvDisc[Black_df$OvDisc > 16] <- 16
Black_df$OvDisc[Black_df$OvDisc < 0] <- 0
Black_df$mAggr[Black_df$mAggr > 16] <- 16
Black_df$mAggr[Black_df$mAggr < 0] <- 0
Black_df$Neuro[Black_df$Neuro > 5] <- 5
Black_df$Neuro[Black_df$Neuro < 1] <- 1
Black_df$nAff[Black_df$nAff > 4] <- 4
Black_df$nAff[Black_df$nAff < 1] <- 1
Black_df$psyDist[Black_df$psyDist > 5] <- 5
Black_df$psyDist[Black_df$psyDist < 1] <- 1
Black_df$Alcohol[Black_df$Alcohol > 12] <- 12
Black_df$Alcohol[Black_df$Alcohol < 0] <- 0
Black_df$drProb[Black_df$drProb > 12] <- 12
Black_df$drProb[Black_df$drProb < 0] <- 0
Black_df$RacEth <- "Black"
#Third, the data for the students who identified as Latinx American
Latinx_mu <- c(1.56, 2.34, 2.69, 1.81, 2.17, 3.47, 2.69)
Latinx_stddev <- c(2.46, 2.49, 0.86, 0.71, 0.78, 2.59, 3.76)
Latinx_corMat <- matrix(c( 1.00, 0.78, 0.27, 0.36, 0.42, -0.06, 0.08,
0.78, 1.00, 0.33, 0.26, 0.35, -0.11, -0.02,
0.27, 0.33, 1.00, 0.62, 0.64, -0.04, 0.15,
0.36, 0.26, 0.62, 1.00, 0.81, -0.08, 0.17,
0.42, 0.35, 0.64, 0.81, 1.00, -0.06, 0.15,
-0.06,-0.11, -0.04, -0.08, -0.06, 1.00, 0.60,
0.08, -0.02, 0.15, 0.17, 0.15, 0.60, 1.00),
ncol = 7)
Latinx_covMat <- Latinx_stddev %*% t(Latinx_stddev) * Latinx_corMat
Latinx_dat <- MASS::mvrnorm(n = 182, mu = Latinx_mu, Sigma = Latinx_covMat, empirical = TRUE)
Latinx_df <- as.data.frame(Latinx_dat)
Latinx_df <- rename(Latinx_df, OvDisc = V1, mAggr = V2, Neuro = V3, nAff = V4, psyDist = V5, Alcohol = V6, drProb = V7)
Latinx_df$OvDisc[Latinx_df$OvDisc > 16] <- 16
Latinx_df$OvDisc[Latinx_df$OvDisc < 0] <- 0
Latinx_df$mAggr[Latinx_df$mAggr > 16] <- 16
Latinx_df$mAggr[Latinx_df$mAggr < 0] <- 0
Latinx_df$Neuro[Latinx_df$Neuro > 5] <- 5
Latinx_df$Neuro[Latinx_df$Neuro < 1] <- 1
Latinx_df$nAff[Latinx_df$nAff > 4] <- 4
Latinx_df$nAff[Latinx_df$nAff < 1] <- 1
Latinx_df$psyDist[Latinx_df$psyDist > 5] <- 5
Latinx_df$psyDist[Latinx_df$psyDist < 1] <- 1
Latinx_df$Alcohol[Latinx_df$Alcohol > 12] <- 12
Latinx_df$Alcohol[Latinx_df$Alcohol < 0] <- 0
Latinx_df$drProb[Latinx_df$drProb > 12] <- 12
Latinx_df$drProb[Latinx_df$drProb < 0] <- 0
Latinx_df$RacEth <- "Latinx"
#binding the datasets together
Lui_sim_df <-bind_rows(Asian_df, Black_df, Latinx_df)
```
If you have simulated the data, you can continue using the the "Lui_sim_df" object that we created. In your own research you will likely work with a datafile stored on your computer. Although I will hashtag the code out (making it inoperable until the hashtags are removed), here is script to save the simulated data both .csv (think "Excel lite") and .rds (it retains all the properties we specified in R) files and then bring/import them back into R. For more complete instructions see the [Ready_Set_R](#Ready) lesson.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
#write the simulated data as a .csv
#write.table(Lui_sim_df, file="Lui_CSV.csv", sep=",", col.names=TRUE, row.names=FALSE)
#bring back the simulated dat from a .csv file
#df <- read.csv ("Lui_CSV.csv", header = TRUE)
```
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
#to save the df as an .rds (think "R object") file on your computer;
#it should save in the same file as the .rmd file you are working with
#saveRDS(Lui_sim_df, "Lui_RDS.rds")
#bring back the simulated dat from an .rds file
#df <- readRDS("Lui_RDS.rds")
```
You may have noticed a couple of things in each of these operations
* First, I named the data object to include a "df" (i.e., dataframe).
- It is a common (but not required) practice for researchers to simply use "df" or "dat" as the name of the object that holds their data. This practice has advantages (e.g., as making the re-use of code quite easy across datasets) and disadvantages (e.g., it is easy to get confused about what data is being used).
* Second, when you run the code, any updating *replaces* the prior object.
- While this is irrelevant today (we are saving the same data with different names), it points out the importance of creating a sensible and systematic *order of operations* in your .rmd files and then knowing where you are in the process.
Because the data is simulated, I can simply use the data I created in the simulation, however, I will go ahead and use the convention of renaming it, "df", which (in this case) stands for *dataframe* and is the common term for a dataset for users of R. *A quick note: in statistics "df" is also an abbreviation for "degrees of freedom."*
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
df <- Lui_sim_df
```
## Variable Types (Scale of Measurement)
When working with raw data, we begin by inspecting and preparing it for the planned analyses. The *type* of variables we have influences what statistics we will utilize. Further, the data must be formatted as that type in order for the statistic to properly execute. Variable types (or formats) are directly connected to the statistical concept of *measurement scale* (or *scale of measurement*). Researchers often think of the *categorical versus continuous* distinction, but it's even more nuanced than that.
### Measurement Scale
**Categorical** variables name *discrete* or *distinct* entities where the categorization has no inherent value or order. When there are two categories, the variable type is **binary** (e.g., pregnant or not, treatment and control conditions). When there are more than two categories, the variable type is **nominal** (e.g., teacher, student, or parent; Republican, Democrat, or Independent).
**Ordinal** variables are also categorical variables where the score reflects a logical order or relative rank (e.g., the order of finishing in a race). A challenge with the ordinal scale is the inability to determine the distance between rankings. The percentile rank is a (sometimes surprising) example of the ordinal scale. Technically, Likert type scaling (e.g., providing ratings on a 1-to-5 scale) is ordinal because it is uncertain that the distance between each of the anchors is equal. Practically, though, most researchers treat the Likert type scale as interval. This is facilitated, in part, because most Likert-type scales have multiple items which are averaged into a single score. Navarro[-@navarro_book_2020] uses the term, **quasi-interval** to describe Likert-type scaling.
**Continuous** variables can take on any value in the measurement scale that is being used. **Interval** level data has equal distances between each unit on the scale. Two classic examples of interval level data are temperature and year. Whether using Fahrenheit or Celsius, the rating of 0 does not mean there is an absence of temperature, rather, it is simply a number along a continuum of temperature. Another interval example is calendrical time. In longitudinal research, we frequently note the date or year (e.g., 2019) of an event. It is highly unlikely that the value zero will appear in our research and if it did, it would not represent the absence of time. A researcher can feel confident that a variable is on the interval scale if the values can be meaningfully added and subtracted.
**Ratio** level data also has equal distances between each unit on the scale, plus it has a true zero point where the zero indicates absence. Examples are behavioral counts (e.g., cigarettes smoked) and time-on-task (e.g., 90 seconds). Ratio data offers more manipulative power because researchers can add, subtract, multiply, and divide ratio level data.
### Corresponding Variable Structure in R
With these definitions in mind, we will see if R is reading our variables correctly. R will provide the following designations of variables:
|Abbreviation |Unabbreviated | Used for |Scale of Measurement
|:------------|:-------------|:----------------------------------------|:---------------------------------|
|num |numerical |numbers that allow decimals or fractional values|quasi-interval, interval, or ratio|
|int |integer |whole numbers (no decimals) |quasi-interval, interval, or ratio|
|chr |character |sometimes termed "string" variables, these are interpreted as words|NA|
|Factor |factor |two or more categories; R imposes an alphabetical order; the user can re-specify the order based on the logic of the design|nominal|
Looking back at the Lui [-@lui_racial_2020] article we can determine what the scale of measurement is for each variable and what the corresponding R format for that variable should be:
|Name |Variable |How assessed |Scale of measurement |R format|
|:-----|:------------------|:----------------------------|:--------------------|:--------- |
|OvDis |Overt racial discrimination|9 items, 1-to-4 Likert scaling for frequency and stressfulness assessed separately, then multiplied|quasi-interval|numerical|
|mAggr |Racial and ethnic microaggressions|28 items, 1-to-4 Likert scaling for frequency and stressfulness assessed separately, then multiplied|quasi-interval|numerical|
|Neuro |Neuroticism|4 items, 1-to-5 Likert scaling |quasi-interval|numerical|
|nAff |Negative affect|6 items, 1-to-4 Likert scaling |quasi-interval|numerical|
|psyDist |Psychological distress|6 items, 1-to-5 Likert scaling |quasi-interval|numerical|
|Alcohol|Hazardous alcohol use|10 items, 0-to-4 Likert scaling |quasi-interval|numerical|
|drProb|Drinking problems|10 items, 0-to-4 Likert scaling |quasi-interval|numerical|
|RacEth|Race Ethnicity|3 categories |nominal|factor|
We can examine the accuracy with which R interpreted the type of data with the *structure()* command.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
str(df)
```
Only Race/Ethnicity needs to be transformed from a character ("chr) variable to a factor. I will use the *mutate()* function in the *dplyr* package to convert the RacEth variable to be a factor with three levels.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
library(tidyverse)
df <- df %>%
dplyr::mutate(
RacEth = as.factor(RacEth))
```
Let's check the structure again. Below we see that the RacEth variable is now a factor. R has imposed an alphabetical order: Asian, Black, Latinx.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
#checking the structure of the data
str(df)
```
## Descriptive Statistics
While the majority of this OER (and statistics training in general) concerns the ability to make predictions or inferences (hence *inferential statistics*) from data, we almost always begin data analysis by describing it (hence, *descriptive statistics*).
Our research vignette contains a number of variables. Lui [-@lui_racial_2020] was interested in predicting negative affect, alcohol consumption, and drinking problems from overt discrimination, microaggressions, neuroticism, through psychological distress. This research model is a *mediation* model (or model of indirect effects) and is beyond the learning objectives of today's instruction. In demonstrating descriptive statistics, we will focus on one of the dependent variables: negative affect.
As we begin to explore the descriptive and distributional characteristics of this variable, it may be helpful to visualize it through a histogram.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70), warning=FALSE}
ggpubr::gghistogram(df$nAff, xlab="Negative Affect", ylab = "Frequency", add = "mean", rug=TRUE, color = "#00AFBB", title = "Frequencies of Negative Affect")
```
### Measures of Central Tendency
Describing data almost always begins with *measures of central tendency*: the mean, median, and mode.
#### Mean
The **mean** is simply a mathematical average of the non-missing data. The mathematical formula is frequently expressed this way:
$$\bar{X} = \frac{X_{1} + X_{2} + X_{3}...+ X_{N}}{N}$$
Because this formula is clumsy to write, there is statistical shorthand to help us convey it more efficiently (not necessarily, more easily).
Placing information below (where to start), above (where to stop), and to the right (what data to use) of the summation operator ($\sum$), provides information about the nature of the data. In the formula below, we learn from the notation to the right that we use the individual data in the vector X. We start with the first piece of data (*i* = 1) and stop with the *Nth* (or last) case.
$$\sum_{i=1}^{N}X_{i}$$
The $\frac{1}{N}$ notation to the left tells us that we are calculating the mean.
$$\bar{X}=\frac{1}{N}\sum_{i=1}^{N}X_{i}$$
R is an incredible tool in that we can type out mathematical operations, use functions from base R, and use packages to do the work for us. If we had the following toy dataset (2, 3, 2, 1, 5, NA) we could calculate the mean by typing it out:
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
(2 + 3 + 2 + 1 + 5)/5
```
Alternatively we could use the built-in functions in base R to do the work for us. Let me add a little complexity by creating a single variable (a vector of data) and introducing a little missingness (i.e., the "NA").
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
toy <- c(2, 3, 2, 1, 5, NA)
toy<-as.data.frame(toy)
```
I can use the base R function *mean()*. Inside the parentheses I point to the data. The function automatically sums the values. When there is missingness, adding *na.rm=TRUE* tells the function to exclude the missing variables from the count (i.e., the denominator would still be 5).
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
mean(toy$toy, na.rm=TRUE)
```
In my simulation of the research vignette, we have no missing values, none-the-less, it is, perhaps a good habit to include the *na.rm=TRUE* specification in our code. Because we have an entire dataframe, we just point to the dataframe and the specific variable (i.e., negative affect).
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
mean(df$nAff, na.rm=TRUE)
```
#### Median
The middle value in a set of values is the **median**. The easiest way to calculate the median is to sort the numbers:
|Unsorted |Sorted
|:-----------------|:-------------:|
|2, 3, 2, 1, 5, |1, 2, 2, 3, 5 |
And select the middle value. Because we have an odd number of values (*N* = 5), our median is 2. If we had an even number of values, we would take the average of the middle two numbers.
We can use a base R function to calculate the median for us. Let's do it first with the toy data:
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
median(toy$toy, na.rm=TRUE)
```
Let's also calculate it for the negative affect variable from the research vignette.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
median(df$nAff, na.rm=TRUE)
```
#### Mode
The **mode** is the score that occurs most frequently. When a histogram is available, spotting the mode is easy because it will have the tallest bar. Determining the mode can be made complicated if there are ties for high frequencies of values. A common occurrence of this happens in the **bimodal** distribution.
Unfortunately, there is no base R function that will call a mode. In response, Navarro developed and included a function in the *lsr* package that accompanies her [-@navarro_book_2020] textbook. Once the package is installed, you can include two colons, the function name, and then the dataset to retrieve the mode.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
lsr::modeOf(toy$toy)
```
From our toy data, we the *modeOf()* function returns a 2.
Let's retrieve the mode from the negative affect variable in our research vignette.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
lsr::modeOf(df$nAff)
```
The value is a 1.0 and is likely an artifact of how I simulated the data. Specifically, to ensure that the values fell within the 1-to-4 range, I rounded up to 1.0 any negative values and rounded down to 4.0 any values that were higher than 4.0.
#### Relationship between mean, median, and mode
Many inferential statistics rely on manipulations of the mean. The mean, though, can be misleading when it is influenced by outliers. Therefore, as we engage in preliminary exploration, it can be quite useful to calculate all three measures of central tendency, as well as exploring other distributional characteristics.
As a bit of an advanced cognitive organizer, it may be helpful to know that in a normal distribution, the mean, median, and mode are the same number (or quite close). In a positively skewed distribution, the mean is higher than the median which is higher than the mode. In a negatively skewed distribution, the mean is lower than the median, which is lower than the mode.
```{r}
mean(df$nAff, na.rm=TRUE)
median(df$nAff, na.rm=TRUE)
lsr::modeOf(df$nAff, na.rm=TRUE)
```
In our research vignette, the mean (1.81) is higher than the median (1.75) is higher than the mode (1.0). This would suggest a positive skew. Here is a reminder of our histogram:
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70), warning=FALSE}
ggpubr::gghistogram(df$nAff, xlab="Negative Affect", ylab = "Frequency", add = "mean", rug=TRUE, color = "#00AFBB", title = "Frequencies of Negative Affect")
```
## Variability
Researchers are critically interested in the spread or dispersion of the scores.
### Range
The **range** is the simplest assessment of variability and is calculated by identifying the highest and lowest scores and subtracting the lowest from the highest. In our toy dataset, arranged from low-to-high (1, 2, 2, 3, 5 ) we see that the low is 1 and high is 5; 4 is the range. We can retrieve this data with three base R functions that ask for the minimum score, the maximum score, or both together -- the range:
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
min(toy$toy, na.rm=TRUE)
max(toy$toy, na.rm=TRUE)
range(toy$toy, na.rm=TRUE)
```
The negative affect variable from our research vignette has the following range:
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
min(df$nAff)
max(df$nAff)
range(df$nAff)
```
With a low of 1 and high of 4, the range of negative affect is 3. This is consistent with the description of the negative affect measure.
One limitation of the range is that it is easily influenced by extreme scores.
### Percentiles, Quantiles, Interquartile Range
The **interquartile range** is middle 50% of data, or the scores that fall between 25th and 75th percentiles. Before calculating that, let's first define **quantiles** and **percentiles**. **Quantiles** are values that split a data into equal portions. **Percentiles** divide the data into 100 equal parts. Percentiles are commonly used in testing and assessment. You may have encountered them in standardized tests such as the SAT and GRE where both the score obtained and its associated percentile are reported. When graduate programs evaluate GRE scores, depending on their criteria and degree of competitiveness they may set a threshold based on percentiles (e.g., using a cut of of the 50th, 75th, or higher percentile for the verbal or quantitative GRE scores).
We have already learned the value of the median. The median is also the 50th percentile. We can now use the *quantile()* function and indicate we want the value at the 50% percentile.
Let's first examine the toy dataset:
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
median(toy$toy, na.rm=TRUE)
quantile(toy$toy, probs = .50, na.rm=TRUE )
```
As shown by our calculation, the value at the median and the 50th percentile is 2.0. Let's look at those values for the research vignette:
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
median (df$nAff, na.rm=TRUE)
quantile(df$nAff, probs = .50, na.rm=TRUE )
```
Again, we see the same result. Half of the values for negative affect are below 1.76; half are above.
The *quantile()* function is extremely useful. We can retrieve the raw score at any percentile, and we could ask for as many as we desired. Here's an example.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
quantile(df$nAff, probs=c(.10, .20, .30, .40, .50, .60, .70, .80, .90))
```
**Quartiles** divide the data into four equal parts. The **interquartile range** is the spread of data between the 25th and 75th percentiles (or quartiles). We calculate the interquartile range by first obtaining those values, and then subtracting the lower from the higher.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
quantile(df$nAff, probs = c(.25,.75) )
```
We see that a score of 1.29 is at the 25th percentile and a score of 2.24 is at the 75th percentile. If we subtract 1.29 from 2.24...
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
2.24 - 1.29
```
...we learn that the interquartile range is 0.95. We could also obtain this value by using the *IQR()* function in base R.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
IQR(df$nAff, na.rm=TRUE)
```
You may be asking, "When would we use the interquartile range?" When data are influenced by **outliers** (i.e., extreme scores), using a more truncated range (the middle 50%, 75%, 90%) may be an option (if the dataset it large enough). At this point, though, the goal of this lesson is simply to introduce different ways of examining the variability in a dataset. Ultimately, we are working our way to the **standard deviation**. The next logical step is the **mean deviation.**
### Deviations around the Mean
Nearly all statistics include assessments of variability in their calculation and most are based on deviations around the mean. In fact it might be good to pause for a moment and consider as the lessons in this OER (and those that follow) continue, we will be engaged in *mathematical and statistical modeling*. In a featured article in the *American Psychologist*, Rodgers [-@rodgers_epistemology_2010] described models as a representation of reality that has two features:
* the model describes reality in some important ways, and
* the model is simpler than reality.
Albeit one of the simplest, the mean is a statistical model. Rodgers noted this when he wrote, "The mean and variance have done yeoman service to psychology and other behavioral sciences," [-@rodgers_epistemology_2010, p. 4]. These next statistical operations will walk through the use of the mean, particularly in its role in understanding variance. In later lessons, means and variances are used in understanding relations and differences.
A first step in understanding mean deviation is to ask, "How far does each individual score deviates from the mean of scores?" We can demonstrate this with our toy dataset. I am taking more steps than necessary to (a) make clear how the mean deviation (abbreviated, mdev) is calculated and (b) practice using R.
First, I will create a variable representing the mean:
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
#Dissecting the script,
#each variable is referenced by df_nameDOLLARSIGNvariable_name
toy$mean <- mean(toy$toy, na.rm=TRUE)
head(toy)#displays the first 6 rows of the data
```
Next, I will subtract the mean from each individual score. The result
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
toy$mdev <- toy$toy - toy$mean
head(toy)#displays the first 6 rows of the data
```
The variable, *mdev* (short for "mean deviation") lets us know how far the individual score is from the mean. Unfortunately, it does not provide an overall estimate of variation. Further, summing and averaging these values all result in zero. Take a look:
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
#Dissecting the script,
#Wrapping the sum and mean script in "round" and following with the desired decimal places, provides a rounde result.
round(sum(toy$mdev, na.rm=TRUE),3)
round(mean(toy$mdev, na.rm=TRUE),3)
```
One solution is to create the *mean absolute deviation*. We first transform the mean deviation score to their absolute values, and then sum them.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
toy$abslt_m <- abs(toy$mdev)
head(toy)
```
And now to average them:
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
round(mean(toy$abslt_m, na.rm=TRUE),3)
```
This value tells how far individual observations are from the mean, "on average." In our toy dataset, the average distance from the mean is 1.12.
So that we can keep statistical notation in our mind, this is the formula calculating the absolute mean deviation:
$$\sum_{i=1}^{n}|X_{i} - \bar{X}|$$
Let's quickly repeat the process with the negative affect variable in our research vignette. So that we can more clearly see the relationship of the new variables to negative affect, let me create a df containing only nAff:
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
library(tidyverse)
df_nAff <- df%>%dplyr::select(nAff)
```
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
df_nAff$mdevNA <- df_nAff$nAff - mean(df_nAff$nAff, na.rm=TRUE)
df_nAff$abNAmdev <- abs(df_nAff$mdevNA)
head(df_nAff)
```
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
round(mean(df_nAff$abNAmdev, na.rm=TRUE),3)
```
Thus, the absolute mean deviation for the negative affect variable in our research vignette is 0.521.
Although relatively intuitive, the absolute mean deviation is not all that useful. Most statistics texts include it because it is one of the steps toward variance, and ultimately, the standard deviation.
### Variance
Variance is considered to be an *average* dispersion calculated by summing the squared deviations and dividing by the number of observations (minus 1; more on that in later lessons).
Our next step is to square the mean deviations. This value is also called the *sum of squared errors*, *sum of squared deviations around the mean*, or *sums of squares* and is abbreviated as *SS*. Below are common statistical representations:
$$SS = \sum_{i=1}^{n}(X_{i} - \bar{X})^{^{2}}$$
Let's do it with our toy data.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
toy$mdev2 <- (toy$mdev)*(toy$mdev)
sum(toy$mdev2, na.rm=TRUE)#sum of squared deviations
head(toy)
```
Thus, our *SS* (sums of squares or sums of squared errors) is 9.2.
To obtain the variance we divide by *N* (or *N* - 1; described in later lessons). Here are the updated formulas:
$$s^{2}=\frac{SS}{N-1}=\frac{\sum_{i=1}^{n}(X_{i} - \bar{X})^{^{2}}}{N-1}$$
Let's do this with the toy data:
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
9.2/(5-1)#calculated with the previously obtained values
#to obtain the "correct" calculation by using each of these individual R commands, we need to have non-missing data
toy <- na.omit(toy)
sum(toy$mdev2, na.rm=TRUE)/((nrow(toy)-1))#variance
```
Of course R also has a function that will do all the steps for us:
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
mean(toy$toy, na.rm=TRUE)
var(toy$toy, na.rm=TRUE)
```
The variance around the mean (2.6) of our toy data is 2.3.
Let's quickly repeat this process with the negative affect variable from the research vignette. In prior steps we had calculated the mean deviations by subtracting the mean from each individual score. Next we square the mean deviations....
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
df_nAff$NAmd2 <- (df_nAff$mdevNA)*(df_nAff$mdevNA)
head(df_nAff)
```
... and sum them.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
sum(df_nAff$NAmd2, na.rm=TRUE)#sum of squared deviations
```
Our sums of squared deviations around the mean is 283.44. When we divide it by *N* - 1, we obtain the variance. We can check our work with (a) the values we calculated at each step, (b) the steps written in separate R code, and (c) the *var()* function.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
283.44/(713-1)# calculating with the individual pre-calculated values
sum(df_nAff$NAmd2, na.rm=TRUE)/((nrow(df_nAff)-1))#calculated with steps from separate R code
var(df_nAff$nAff) #calculated using the base R function
```
Unfortunately, because the mean deviations were squared, this doesn't interpret well. Hence, we move to the *standard deviation*.
### Standard Deviation
The standard deviation is simply the square root of the variance. Stated another way, it is an estimate of the average spread of data, presented in the same metric as the data.
Calculating the standard deviation requires earlier steps:
1. Calculating the mean.
2. Calculating mean deviations by subtracting the mean from each individual score.
3. Squaring the mean deviations.
4. Summing the mean deviations to create the *SS*, or sums of squares.
5. Dividing the *SS* by *N* - 1; this results in the *variance* around the mean.
The 6th step is to take the square root of variance. It is represented in the formula, below:
$$s=\sqrt{\frac{SS}{N-1}}=\sqrt{\frac{\sum_{i=1}^{n}(X_{i} - \bar{X})^{^{2}}}{N-1}}$$
Repeated below are each of the six steps for the toy data:
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
#six steps wrapped into 1
toy$mdev <- toy$toy - mean(toy$toy, na.rm=TRUE)
toy$mdev2 <- (toy$mdev)*(toy$mdev)
#I can save the variance calculation as an object for later use
toy_var <- sum(toy$mdev2)/(nrow(toy)-1)
#checking work with the variance function
var(toy$toy)
```
The seventh step is to take the square root of variance.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
#grabbing the mean for quick reference
mean(toy$toy)
#below the "toy_var" object was created in the prior step
sqrt(toy_var)
#checking work with the R function to calculate standard deviation
sd(toy$toy)
```
It is common to report means and standard deviations for continuous variables in our datasets. For the toy data our mean is 2.6 with a standard deviation of 1.52.
Let's repeat the process for the negative affect variable in the research vignette. First the six steps to calculate variance.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
#six steps wrapped into 1
df_nAff$mdevNA <- df_nAff$nAff - mean(df_nAff$nAff, na.rm=TRUE)
df_nAff$NAmd2 <- (df_nAff$mdevNA)*(df_nAff$mdevNA)
#I can save the variance calculation as an object for later use
nAff_var <- sum(df_nAff$NAmd2)/(nrow(df)-1)
#checking work with the variance function
var(df_nAff$nAff)
```
The seventh step is to take the square root of variance.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
#grabbing the mean for quick reference
mean(df_nAff$nAff)
#below the "toy_var" object was created in the prior step
sqrt(nAff_var)
#checking work with the R function to calculate standard deviation
sd(df_nAff$nAff)
```
In APA Style we use *M* and *SD* as abbreviations for mean and standard deviation, respectively. In APA Style, non-Greek statistical symbols such as these are italicized. Thus we would write *M* = 1.81(*SD* = 0.63) in a statistical string of results.
We can examine the standard deviation in relation to its mean to understand how narrowly or broadly the data is distributed. Relative to a same-sized mean, a small standard deviation means that the mean represents the data well. A larger standard deviation, conveys that there is greater variability and the mean, alone, is a less valid representation of the score.
In later lessons we will explore the standard deviation in more detail -- learning how we can use it in the determination of the significance and magnitude of relations and differences.
## Are the Variables Normally Distributed?
Statistics that we use are accompanied by assumptions about the nature of variables in the dataset. A common assumption is that the data are *normally distributed*. That is, the data presumes a standard normal curve.
For a streamlined presentation, let me create a df with three, continuously scaled, variables of interest.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
#I have opened the tidyverse library so that I can use the pipe
library(tidyverse)
df_3vars <- df%>%dplyr::select(nAff, mAggr, drProb)
```
### Skew and Kurtosis
Skew and kurtosis are indicators of non-normality. Skew refers to the degree to which the data is symmetrical. In the figure below, the symmetrical distribution in the center (the black line) has no apparent evidence of skew. In contrast, the red figure whose curve (representing a majority of observations) in the left-most part of the graph (with the tail pulling to the right) is positively skewed; the blue figure whose curve (representing a majority of cases) is in the right-most part of the graph (with the tail pulling to the left) is negatively skewed.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70) , echo=FALSE}
curve(dbeta(x,10,10), xlim=c(0,1), lwd=3, yaxt="n", ylab="", xlab="", xaxt="n")
curve(dbeta(x,2, 5), add=T, col='red',lwd=3, yaxt="n", ylab="", xlab="", xaxt="n")
curve(dbeta(x,5,2), add=T, col='blue',lwd=3, yaxt="n", ylab="", xlab="", xaxt="n")
legend(par('usr')[2], par('usr')[4], xjust=1,
c('normal', 'positive skew', 'negative skew'),
lwd=c(3,3,1), lty=1,
col=c(par('fg'),'red','green'))
title("Positive, Normal, and Negative Skew")
```
Kurtosis refers to the degree to which the distribution of data is flat or peaked. Mesokurtic distributions are considered to be closest to normal. Leptokurtic distributions are peaked and platykurtic distributions are flat. As we will learn as we progress, visual observation of data is a legitimate component in evaluating skew and kurtosis.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70) , echo=FALSE}
curve(dbeta(x,10,10), xlim=c(0,1), lwd=3, yaxt="n", ylab="", xlab="", xaxt="n")
curve(dbeta(x,5, 5), add=T, col='red',lwd=3, yaxt="n", ylab="", xlab="", xaxt="n")
curve(dbeta(x,2,2), add=T, col='blue',lwd=3, yaxt="n", ylab="", xlab="", xaxt="n")
legend(par('usr')[2], par('usr')[4], xjust=1,
c('leptokurtic', 'mesokurtic', 'platykurtic'),
lwd=c(3,3,1), lty=1,
col=c(par('fg'),'red','green'))
title("Kurtosis: Platykurtic, Mesokurtic, Leptokurtic")
```
There have been numerous approaches to calculating and interpreting skew and kurtosis. Consequently, different statistics packages calculate skew and kurtosis differently. The *psych* package (a go-to-for a variety of tasks) offers three different options for calculating skew and kurtosis. These are specified in the script as "type=#" (i.e., 1, 2, or 3 [the default]). Revelle [-@revelle_introduction_2021] refers readers to Joanes and Gill's [-@joanes_comparing_1998] article for detailed information about each. A very helpful resource to understand skew, kurtosis, and its interpretation is found in chapter four (Data Preparation and Psychometrics Review) of Kline's [-@kline_data_2016] SEM text is helpful in the interpretation of skew and kurtosis. Summarizing by simulation studies for structural equation modeling (i.e., multivariate statistics that are generally characterized as large sample studies using maximum likelihood estimation), Kline suggested that *type=1* skew values greater than the absolute value of 3.0 are "severely" skewed. Regarding *type=1* kurtosis, Kline noted the literature has suggested that values from 8.0 to 20.0 have been described as severely kurtotic. As an interpretive framework, Kline suggested that absolute values greater than 10.0 are problematic and values greater than 20 are serious. He added that this rule-of-thumb errs on the conservative side.
The *psych::describe* specification of "type=1" results in the *skew index* and *kurtosis index*. For simplicity sake, I will refer to this specific variation of skew and kutosis as "type=1." This is a very quick way to obtain initial values.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
psych::describe(df_3vars, type=1)
```
Using Kline's [-@kline_data_2016] guidelines for evaluation, a quick review of the type=1 output indicates that no skew value exceeded the absolute value of 3.0. That is, across the nAff, mAggr, and drProb variables the highest type=1 skew value was 0.93 Regarding kurtosis, no value had a greater magnitude then .59 and all fell below the absolute value of 10. A limitation of the type=1 output and Kline's interpretative guidelines is that the simulation studies that led to the interpretive guidelines were based on structural equation modeling. These statistics are multivariate in nature, they typically use maximum likelihood estimators, and are based on large samples.
An alternative tool for identifying distributions that are severely skewed or kurtotic is the "skew.2SE" and "kurt.2SE" output from *pastecs::stat.desc*.
These values represent the type=1 skew (or kurtosis) value divided by two-times its respective standard error (i.e., the standard error of the skew or kurtosis distribution, not the *se* value associated with the variable). The result is a standardized value that, on its own, indicates statistical significance. In the case of skew.2SE and kurt2SE, values of 1 (*p* < .05), 1.29 (*p* < .01), and 1.65 (*p* < .001) represent statistically significant departures from symmetry (skew) and normal peakedness (kurtosis). Unfortunately, this tool is not without criticism.
$skew.2SE = \frac{S-0}{2*SE_{skewness}}$ and $kurt.2SE = \frac{S-0}{2*SE_{kurtosis}}$
The skew.2SE and kurt.2SE values can be obtained with *pastecs::stat.desc* by adding the "norm = TRUE" statement.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
pastecs::stat.desc(df_3vars, norm=TRUE)
```
Statisticians have noted that these standardized values are quite sensitive to sample size [@field_discovering_2012; @kline_data_2016]. In large samples even minor deviations from normality may appear as statistically significant. In contrast, small samples with lower power may be severely non-normal, but skew and kurtosis could go undetected. When sample sizes are smaller, using the 1.96 (or "2") criteria is acceptable in determining a significant skew or kurtosis, however, as the sample size increases, the probability of rejecting the hypothesis that skew (or kurtosis) also zero increases. Field [-@field_discovering_2012] noted that in such cases it is appropriate to relax the standard and evaluate skew or kurtosis against the 1.29 (*p* < .01) criteria. Further, when samples are larger than 200, it may be more appropriate to abandon the interpretation of the *z*-values and, instead, examine the shape of the distribution rather than to interpret these standardized values.
Comparing the skewness and kurtosis type=1 values to the skew.2SE and kurt.2SE values, we can see the interpretive challenges.
|Variable |skewness |skew.2SE |kurtosis |kurt.2SE
|:-----------|:--------:|:--------:|:--------:|:--------:|
|nAff |0.575 |3.141*** |-0.176 |-0.481 |
|mAggr |0.926 |5.059*** |0.575 |1.572** |
|drProb |0.783 |4.278*** |-0.168 |-0.460 |
Values in the "skewness" column are concerning when they are exceed the absolute value of 3.0; none are. Values in the "skew.2SE" column are statistically significant at $p < .001$ when they exceed 1.65. Here, the two approaches to interpreting skew both suggest positive skew (i.e., heavy distribution in the left with a long tail to the right), but only the "skew.2SE" results suggest that the degree of skewness is significant/concerning.
Regarding kurtosis, values in the "kurtosis" column become concerning when they exceed the absolute value of 10; none are. Values in the "kurt.2SE" become statistically significant when they exceed 1.0. The mAggr variable's value of 1.572 is statistically significant at $p < .01$.
So how do we think about skewness and kurtosis in our data? First, I simulated a dataset with more than 700 cases. This far exceeds the "large" sample size of 200. Therefore, interpreting the type 1 skewness and kurtosis values according to Kline's [-@kline_data_2016] criteria of less than the absolute values of 3 and 10, respectively, is probably most appropriate. Further, skewness and kurtosis are only two dimensions of assessing whether or not a distribution is normally distributed. Thus, I will keep these results in mind as I examine additional metrics (especially when we look at histograms with superimposed curve).
In this OER, I will predominantly use the type=1 output from the *psych::describe* package and use Kline's [-@kline_data_2016] interpretive criteria. I do think the "skew.2SE" and "kurt.2SE" metrics can be useful when sample sizes are smaller (perhaps $N = 100$ or less) and ordinary least squares (such as used in ANOVA and regression models) statistics will be utilized. In any case, if I have significant concerns about normality, I always return to more extensive and authoritative sources to make my decisions about preparing my data for analysis.
### Shapiro-Wilk Test of Normality
In addition to skew and kurtosis, there are formal statistical tests that evaluate whether or not our data is statistically significantly different than a normal distribution. One of those is the Shapiro-Wilk test of normality. The output we obtained from *pastecs::stat.desc* included the Shapiro-Wilk test value and the associated *p* value. When $p < .05$, our data is statistically significantly different from a normal distribution.
In our simulated data, all variables were statistically significantly different than a normal distribution ($nAff: W = 0.948, p < .001; mAggr: W = 0.913, p < .001; drProb: W = 0.900, p < .001$).
Just because data is skewed, kurtotic, or non-normally distributed does not (necessarily) mean that we cannot use it. As we move through the lessons in this OER we will evaluate the quality of the data according to the statistical assumptions associated with the statistic we are using. Often there are tools that we can use (e.g., variations of the statistic that are robust to violations of assumptions, deleting univariate or multivariate outliers) in spite of our data characteristics.
## Relations between Variables
Preliminary investigation of data almost always includes a report of their bivariate relations. Correlation coefficients express the magnitude of relationships on a scale ranging from -1 to +1. A correlation coefficient of
* -1.0 implies a 1:1 inverse relationship, such that for every unit of increase in variable A, there is a similar decrease in variable B,
* 0.0 implies no correspondence between two variables,
* 1.0 implies that as A increases by one unit, so does B.
Correlation coefficients are commonly represented in two formulas. In a manner that echoes the calculation of *variance*, the first part of the calculation estimates the covariation (i.e., *covariance*) of the two variables of interest.
$$
\mbox{Cov}(X,Y) = \frac{1}{N-1} \sum_{i=1}^N \left( X_i - \bar{X} \right) \left( Y_i - \bar{Y} \right)
$$
The problem is that the result is unstandardized and difficult to interpret. Therefore, the second part of the calculation of the correlation coefficient results in the standardization of the metric in the -1 to +1 scale.
$$
r_{XY} = \frac{\mbox{Cov}(X,Y)}{ \hat{\sigma}_X \ \hat{\sigma}_Y}
$$
Covariation and correlation matrices are central to many of our statistics therefore, those of who teach statistics believe that it is important to take a look "under the hood." From our research vignette, let's calculate the relationship between negative affect and psychological distress.
Examining the first formula, some parts should look familiar:
* $(X_i - \bar{X})$: We can see that we need to subtract the mean from the first(X) variable in involved in the correlation; we saw this when we calculated *mean deviations*.
* $(Y_i - \bar{Y})$: We repeat the *mean deviation* process for the second (Y) variable.
Let's work step-by-step through the calculation of a correlation coefficient. So that we can more easily see what we are doing with the variables, I will create a super tiny dataframe with the two variables of interest (negative affect and microaggressions):
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
#just in case it turned off, I'm reopening tidyverse so that I can use the pipe
library(tidyverse)
#using the dplyr package to select the two variables in this tiny df
df4corr <- df%>%dplyr::select(nAff, mAggr)
#displaying the first 6 rows of df4corr ("dataframe for correlations" -- I made this up)
head(df4corr)
```
First we calculate the mean deviations for negative affect and microaggressions.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
#calculating the mean deviation for negative affect
df4corr$MDnAff <- df4corr$nAff - mean(df4corr$nAff)
#calculating the mean deviation for microaggressions
df4corr$MDmAggr <- df4corr$mAggr - mean(df4corr$mAggr)
#displaying the first 6 rows of df4corr
head(df4corr)
```
The next part of the formula $\sum_{i=1}^N \left( X_i - \bar{X} \right) \left( Y_i - \bar{Y} \right)$ suggests that we sum the cross-products of these mean deviations. Here we multiply the mean deviations to create the "cross-product."
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
#Creating a crossproduct variabl by multiplying negative affect by psych distress
df4corr$crossproductXY <- df4corr$MDnAff * df4corr$MDmAggr
#displaying the first 6 rows of df4corr
head(df4corr)
```
Next, we sum the column of cross-products.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
sum(df4corr$crossproductXY)
```
To obtain the covariance, the next part of the formula suggests that we multiply the sum of cross-products by $\frac{1}{N-1}$. I will do this in one step.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
#I have created the object "cov" so I can use it in a calculation, later
#The "nrow" function will count the number of rows and use that value
cov <- 1/(nrow(df4corr) - 1)* sum(df4corr$crossproductXY)
#Because I created an object, R markdown won't automatically display it; I have to request it by listing it
cov
```
The covariance between negative affect and psychological distress is 0.373.
We now move to the second part of the formula to create the interpretable, standardized, correlation coefficient.
$$
r_{XY} = \frac{\mbox{Cov}(X,Y)}{ \hat{\sigma}_X \ \hat{\sigma}_Y}
$$
We will use our covariance value in the numerator. The denominator involves the multiplication of the standard deviations of X and Y. Because we have already learned how to calculate standard deviation in a step-by-step manner, I will use code to simplify that process:
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
cov/(sd(df4corr$nAff)*sd(df4corr$mAggr))
```
Our results suggest that the relationship between negative affect and psychological distress is positive, as one increases so does the other. Is it strong? This really depends on your field of scholarship. The traditional values of .10, .30, and .50 are interpreted as small, medium, and large [@cohen_applied_2003]. Hence, when *r* = 0.27, we can say that it is (more-or-less) medium.
Is it statistically significant? Because this is an introductory chapter, we will not calculate this in a stepwise manner, but use the *cor.test()* function in base R to check our prior math and retrieve the *p* value associated with the correlation coefficient.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
cor.test(df4corr$nAff, df4corr$mAggr)
```
In a statistical string we would report the result of this Pearson correlation coefficient as: *r* = 0.27 (*p* < .001).
## Shortcuts to Preliminary Analyses
Unless you teach statistics (or take another statistics class), you may never need to work through all those individual steps again. Rather, a number of R packages make retrieval of these values relatively simple and efficient.
### SPLOM
The *pairs.panels()* function in the *psych* package produces a SPLOM (i.e., scatterplot matrix) which includes:
* histograms of each individual variable within the dataframe with a curve superimposed (located on the diagonal),
* scatterplots of each bivariate combination of variables (located below the diagonal), and
* corrrelation coefficients of each bivariate combination of variables (located above the diagonal).
To provide a simple demonstration this, I will use our df with the three continuously scaled variables of interest:
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
#in the code below, psych points to the package
#pairs.panels points to the function
#we simply add the name of the df; if you want fewer variables than that are in the df, you may wish to create a smaller df
#adding the pch command is optional and produces a finer resolution
psych::pairs.panels(df_3vars, pch = ".")
```
What do we observe?
* There is a more-or-less moderate correlation between negative affect and microaggressions ($r = 0.27$)
* There is a small-to-moderate correlation between negative affect and drinking problems ($r = 0.18$)
* There is a small correlation between microaggressions and drinking problems ($r = 0.09$)
* All variables have a positive skew (with pile-up of scores on the lower end and tail pulling to the right); this is consistent with the values we calculated earlier
* The scatterplots can provide clues to relations that are not necessarily linear.
- Look at the relationship between negative affect and drinking problems. As negative affect hits around 2.75, there is a change in the relationship, such that drinking problems increase.
- Taking time to look at plots such as these can inform subsequent analyses.
### apaTables
Writing up an APA style results section frequently involves tables. A helpful package for doing this is *apaTables*. An instructional article notes the contributions of tools like this to the *reproducibility* of science by reducing errors made when the author or analyst retypes or copies text from output to the manuscript. When the R script is shared through an open science framework, reproducibility is further enhanced [@stanley_reproducible_2018].
We pass the desired df to the *apaTables::apa.cor.table*. Commands allow us to specify what is included in the table and whether it should be displayed in the console or saved as a document to the project's folder.
```{r tidy=TRUE, tidy.opts=list(width.cutoff=70)}
#the apa.cor.table function removes any categorical variables that might be in the df
Table1_Cor <- apaTables::apa.cor.table(df_3vars, filename = "Table1_Cor.doc", table.number = 1, show.conf.interval = FALSE, landscape = TRUE)
#swap in this command to see it in the R Markdown file
print(Table1_Cor)
```
Because I added: *filename = "Table1_Cor.doc"*, a word version of the table will appear in the same file folder as the .rmd file and data. It is easily manipulated with tools in your word processing package.
## An APA Style Writeup
The statistics used in this lesson are often presented in the preliminary results portion of an empirical manuscript. Some of the results are written in text and some are presented in tables. APA Style recommends that the narration of results not duplicate what is presented in the tables. Rather, the write-up only highlights and clarifies what is presented in the table(s).
At the outset, let me note that a primary purpose of the Lui [-@lui_racial_2020] article was to compare the relations of variables between three racial/ethic groups in the U.S. identified as Asian American, Black, and Latinx. Because we did not run separate analyses for each of the groups, my write-up does not make these distinctions. I highly recommend that you examine the write-up of results and the accompanying tables in Lui's article. The presentation is clear and efficient (i.e., it conveys maximal information in as little space as possible).
Below is an example of how I might write up these preliminary results:
**Preliminary Results**
>Our sample included 713 participants who self-identified as Asian American, Black/African American, and Latinx American. Inspection of the characteristics of the three variables of interest (negative affect, microaggressions, drinking problems) indicated that all variables were positively skewed, however the values of skew and kurtosis did not exceed commonly used thressholds of concern [@kline_data_2016]. In contrast, Shapiro-Wilk tests of normality suggested that the distribution of all three variables were statistically significantly different than a normal distribution ($nAff: W = 0.948, p < .001$; $mAggr: W = 0.913, p < .001$; $drProb: W = 0.900, p < .001$). Means, standard deviations, and a correlation matrix are presented in Table 1. We noted that the correlation between negative affect and microaggressions was moderate $(r = 0.27)$; correlations between remaining variables were smaller.
## Practice Problems
The three exercises described below are designed to "meet you where you are" and allow you to challenge your skills depending on your goals as well as your comfort with statistics and R.
Regardless which you choose, work one or more of the problems with R packages:
* Create a smaller df from a larger df selecting a minimum of three continuously scaled variables
* Calculate and interpret descriptive statistics
* Create the SPLOM (pairs.panels)
* Use the *apaTables* package to make an APA style table with means, standard deviations, and correlations
* Write an APA Style results section for these preliminary analyses
Additionally, please complete at least one set of *hand calculations*, that is using the code demonstrated in the chapter to work through the formulas that compute the descriptive statistics that are the focus of this lesson. At this stage in your learning, you may ignore any missingness in your dataset by excluding all rows with missing data in your variables of interest.
### Problem #1: Change the Random Seed
If this topic feels a bit overwhelming, simply change the random seed in the data simulation (at the very top), then rework the lesson exactly as written. This should provide minor changes to the data (maybe in the second or third decimal point), but the results will likely be very similar.
### Problem #2: Swap Variables in the Simulation
Use the simulated data from the Lui [-@lui_racial_2020] study. However, select three continuous variables (2 must be different from mine) and then conduct the analyses. Be sure to select from the variables that are considered to be *continuous* (and not *categorical*).
### Problem #3: Use (or Simulate) Your Own Data
Use data for which you have permission and access. This could be IRB approved data you have collected or from your lab; data you simulate from a published article; data from an open science repository; or data from other chapters in this OER.
### Grading Rubrics
Regardless which option(s) you chose, use the elements in the grading rubrics to guide you through the practice. Worked examples are provided in the [Appendix]({#woRked}).
|Working the problem with R and R packages | Points Poss | Points Earned
|:-----------------------------------------|:-------------:|:--------------|
|1. Create a df with 3 continuously scaled variables of interest| 3 |_______|
|2. Produce descriptive statistics | 3 | _______
|3. Produce SPLOM/pairs.panels | 3 | _______|
|4. Produce an apaTables matrix | 3 | _______|
|5. Produce an APA Style write-up of the preliminary analyses| 5 | _______|
|6. Explanation/discussion with a grader | 5 |_______|
|**Totals | 22 | _______|
|Hand Calculations | Points Possible | Points Earned
|:-----------------------------------------|:---------------:|:--------------|
|1. Create a variable that represents the mean.| 2 | _______|
|2. Create a variable that represents the mean deviation. | 2 |_______ |
|3. What is the value of the *sum* of mean deviations? |2 |_______ |
|4. Create a variable that represents the absolute mean deviation. What is the *sum* of the absolute mean deviation? What is the value of the *mean* of the absolute mean deviation? What does this value tell you?| 4 |_______|
|5. Create a variable that represents the mean deviation squared. | 2 | _______|
|6. What are the values of the sum of squared deviations around the mean $SS$, variance $s^2$, and standard deviation ($s$)? |3 |_______ |
|7. Using the same general approach, calculate the mean deviation and standard deviation for a second, continuously scaled variable. | 5 |_______ |
|8. Create a variable that represents the *cross-product* (of the mean deviations). What is the sum of these cross-products? |2 |_______ |
|9. Calculate the value of their covariance. |2 |_______ |
|8. Calculate value of correlation coefficient. |2 |_______ |
|**Totals** | 26 |_______ |
```{r, child= 'Worked_Examples/15-2-woRked_Preliminaries.Rmd'}
```
```{r include=FALSE}
sessionInfo()
```