Statistical Tests

P-Values

Can we accept or reject the null hypothesis?
- usually choose a threshold
- 0.05 is popular
- 0.01 is safer
- People get uncomfortable when you get values anywhere near there.
- P-values are not really comparable. Maybe in magnitude.
- Large datasets generally cause pretty clear p-values.

T-Test

Student’s T Test
- Originally used to for ensuring the quality of beer
  - Not just any beer: Guiness
- process comparison
- Compare distributions
- Student is William Sealy Gosset, Student is a pen name.
- Why do we care?
  - Compare if two generally normally distributions have equal means?
  - What if we got lucky and found the means to be the same?
    - What is the chance of that?
  - Can we trust it?
- When not to use it:
  - when the data is not normal.

T-Test in R

in R:

t.test(rnorm(10),rnorm(1000))
t.test(runif(100),rnorm(100))
t.test(rnorm(100,1),rnorm(100))

> t.test(rnorm(100),rnorm(100))

	Welch Two Sample t-test

data:  rnorm(100) and rnorm(100) 
t = 0.3034, df = 193.367, p-value = 0.7619
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -0.2330280  0.3177391 
sample estimates:
 mean of x  mean of y 
0.07018111 0.02782554 

> t.test(runif(100),rnorm(100))

	Welch Two Sample t-test

data:  runif(100) and rnorm(100) 
t = 2.7514, df = 115.295, p-value = 0.006893
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 0.08531389 0.52385119 
sample estimates:
mean of x mean of y 
0.4778445 0.1732620 

> t.test(rnorm(100,1),rnorm(100))

	Welch Two Sample t-test

data:  rnorm(100, 1) and rnorm(100) 
t = 6.1553, df = 197.077, p-value = 4.116e-09
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 0.5696437 1.1067288 
sample estimates:
mean of x mean of y 
0.9432221 0.1050358 

> t.test(rnorm(100,1),rnorm(100))

	Welch Two Sample t-test

data:  rnorm(100, 1) and rnorm(100) 
t = 5.6662, df = 195.334, p-value = 5.176e-08
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 0.5409887 1.1186414 
sample estimates:
 mean of x  mean of y 
0.85294237 0.02312732

Wilcoxon

Wilcoxon Test
- non-parametric comparison of distributions
- It is about rank based agreement
  - if we look at the distribution by rank or comparison how similar is it?
- Pairwise comparison
- Good for non-normal distributions
- A little more strict than t-test

Wilcoxon in R

wilcox.test(rnorm(101),rnorm(99))
wilcox.test(runif(104),rnorm(102))
wilcox.test(rnorm(102,1),rnorm(103))
wilcox.test(rnorm(102,1)+10000,rnorm(103))

> wilcox.test(rnorm(100),rnorm(100))

	Wilcoxon rank sum test with continuity correction

data:  rnorm(100) and rnorm(100) 
W = 5490, p-value = 0.2317
alternative hypothesis: true location shift is not equal to 0 

> wilcox.test(runif(100),rnorm(100))

	Wilcoxon rank sum test with continuity correction

data:  runif(100) and rnorm(100) 
W = 6348, p-value = 0.0009931
alternative hypothesis: true location shift is not equal to 0 

> wilcox.test(rnorm(100,1),rnorm(100))

	Wilcoxon rank sum test with continuity correction

data:  rnorm(100, 1) and rnorm(100) 
W = 7418, p-value = 3.486e-09
alternative hypothesis: true location shift is not equal to 0

Wilcoxon signed-rank test (paired)

Compare 2 dependent datasets
- before and after
- before treatment and after treatment
Uses ranks of differences to compare.
Non-parametric
Cares about if the location moves
- doesn’t seem to care about scale

signed-rank test in R

wilcox.test(rnorm(100),rnorm(100),paired=TRUE)

Location matters:

x = rnorm(100) 
wilcox.test(x,x+0.1,paired=TRUE)
wilcox.test(x,x+0.5,paired=TRUE)
wilcox.test(x,x+1.0,paired=TRUE)
y = x - 0.1*rnorm(100)
wilcox.test(x,y,paired=TRUE)
z = x + 0.1*runif(100)
wilcox.test(x,z,paired=TRUE)
w = x*runif(100)
wilcox.test(x,w,paired=TRUE)

But scale might not matter

x = rnorm(100) 
wilcox.test(x,x*0.1,paired=TRUE)
wilcox.test(x,x*0.5,paired=TRUE)
wilcox.test(x,x*1.0,paired=TRUE)
wilcox.test(x,x*4.0,paired=TRUE)

wilcox.test(x,x*0.1+0.1,paired=TRUE)
wilcox.test(x,x*0.5+0.1,paired=TRUE)
wilcox.test(x,x*1.0+0.1,paired=TRUE)
wilcox.test(x,x*4.0+0.1,paired=TRUE)

Kolmogorov-Smirnov Tests

Non parametric
good with SE data and data with skew
compares the maximum distance between CDFs
Usually used on continuous data but works on ECDFs.
Very strict
P-values > 0.05 mean they are similar distributions or not different

R

ks.test(rnorm(100),rnorm(100))
ks.test(runif(100),rnorm(100))
ks.test(rnorm(100,1),rnorm(100))

R Output

> ks.test(rnorm(100),rnorm(100))

	Two-sample Kolmogorov-Smirnov test

data:  rnorm(100) and rnorm(100) 
D = 0.17, p-value = 0.1111
alternative hypothesis: two-sided 

> ks.test(runif(100),rnorm(100))

	Two-sample Kolmogorov-Smirnov test

data:  runif(100) and rnorm(100) 
D = 0.52, p-value = 3.612e-12
alternative hypothesis: two-sided 

> ks.test(rnorm(100,1),rnorm(100))

	Two-sample Kolmogorov-Smirnov test

data:  rnorm(100, 1) and rnorm(100) 
D = 0.46, p-value = 1.292e-09
alternative hypothesis: two-sided

Kruskal Wallace Test

used to determine if a factor matters
kind of strange to use it in a 2 group comparison but it is as safe as the wilcoxon / mann whitney
useful when you have multiple groups and you want to say the group can matter
Doesn’t tell you which group
Workflow:
- Does factor X matter?
  - kruskal wallace test
    - if significant then run a pairwise wilcoxon to find which groups matter

Kruskal Wallace Test in R

kruskal.test(rnorm(100),g=c(rep(0,50),rep(1,50)))
kruskal.test(runif(100),g=c(rep(0,50),rep(1,50)))
kruskal.test(runif(100),g=c(rep(0,33),rep(1,33),rep(2,34)))
x = c(1+runif(33),rnorm(33),rnorm(34))
g = c(rep(0,33),rep(1,33),rep(2,34))
kruskal.test(x,g)
wilcox.test(x[c(1:33)],c(34:34+34))
wilcox.test(x[c(1:33)],c(34+33:34+33+33))
pairwise.wilcox.test(x,g)

X^2 Test

Good for non-parametric distributions
Good for counts
You need to bin your data first
it’s input is a distribution
watch it, the input is a distribution
Not reliable on continuous values because you need to bin values

R

> chisq.test(c(10,10,10,30),p=c(20,20,20,30),rescale.p=TRUE)

	Chi-squared test for given probabilities

data:  c(10, 10, 10, 30)
X-squared = 7.5, df = 3, p-value = 0.05756

> chisq.test(c(10,10,10,30),p=c(4,5,6,7),rescale.p=TRUE)

	Chi-squared test for given probabilities

data:  c(10, 10, 10, 30)
X-squared = 9.754, df = 3, p-value = 0.02078

> chisq.test(c(10,10,10,30),p=c(11,11,11,31),rescale.p=TRUE)

	Chi-squared test for given probabilities

data:  c(10, 10, 10, 30)
X-squared = 0.058651, df = 3, p-value = 0.9963

> chisq.test(c(10,10,10,30),p=c(0,11,11,0),rescale.p=TRUE) # zeros are bad

	Chi-squared test for given probabilities

data:  c(10, 10, 10, 30)
X-squared = Inf, df = 3, p-value < 2.2e-16

Warning message:
In chisq.test(c(10, 10, 10, 30), p = c(0, 11, 11, 0), rescale.p = TRUE) :
  Chi-squared approximation may be incorrect
> 
> south <- c(10,20,30,40)
> north <- c(5,30,30,40)
> nstab <- as.table(rbind(south,north))
> chisq.test(nstab)

	Pearson's Chi-squared test

data:  nstab
X-squared = 3.5468, df = 3, p-value = 0.3147

> south <- c(10,20,30,40)
> north <- c(90,30,30,40)
> nstab <- as.table(rbind(south,north))
> chisq.test(nstab)

	Pearson's Chi-squared test

data:  nstab
X-squared = 42.126, df = 3, p-value = 3.772e-09
    > 
> stbdtypes <- c("Source","Test","Build","Doc")
> maint     <- c("Adaptive","Perfective","Corrective")
> stbds <- stbdtypes[runif(100)*4 + 1]
> maints <- maint[runif(100)*3 + 1]
> head(stbds)
[1] "Build"  "Build"  "Test"   "Source" "Test"   "Doc"   
> head(maints)
[1] "Adaptive"   "Corrective" "Perfective" "Adaptive"   "Adaptive"  
[6] "Corrective"
> st <- table(stbds,maints)
> st
        maints
stbds    Adaptive Corrective Perfective
  Build        11          8          3
  Doc           5         11         13
  Source        7         11          9
  Test          9          5          8
> chisq.test(st)

	Pearson's Chi-squared test

data:  st
X-squared = 10.148, df = 6, p-value = 0.1186

> st2 <- t(cbind(st[,"Adaptive"],st[,"Corrective"]))
> st2
     Build Doc Source Test
[1,]    11   5      7    9
[2,]     8  11     11    5
> chisq.test(st2)

	Pearson's Chi-squared test

data:  st2
X-squared = 4.6304, df = 3, p-value = 0.201

> # example where we make a table with junk results
> st3 <- t(cbind(st[,"Adaptive"],max(5,round(st[,"Corrective"]+10*rnorm(4)))))
> st3
     Build Doc Source Test
[1,]    11   5      7    9
[2,]    11  11     11   11
> chisq.test(st3)

	Pearson's Chi-squared test

data:  st3
X-squared = 1.4811, df = 3, p-value = 0.6866

Effect Size

A change might not be signficant but it is still measurable.
A change might be signficant but its effect is not measurable.
Many tests look for a stastically significant difference, but not in size.
- lots of samples, little difference in size: significant
- few samples, big difference in size: insignficant

Significant but negliable

x = rnorm(10000)
y = rnorm(10000,0.1)
wilcox.test(x,y)
library(effsize)
cohen.d(x, y)

insignificant but different

x = rnorm(10)
y = rnorm(10,0.4)
t.test(x,y)
library(effsize)
cohen.d(x, y)

Cohen’s D

parametric
distributions must be normal
From https://en.wikipedia.org/wiki/Effect_size
- Very small 0.01
- Small 0.20
- Medium 0.50
- Large 0.80
- Very large 1.20
- Huge 2.0

in R you can use the effsize library

library(effsize)
cohen.d( rnorm(100), rnorm(100))

library(effsize)
cohen.d( rnorm(100,0.1), rnorm(100))
cohen.d( rnorm(100,0.3), rnorm(100))
cohen.d( rnorm(100,0.5), rnorm(100))
cohen.d( rnorm(100,0.7), rnorm(100))
cohen.d( rnorm(100,1.0), rnorm(100))
cohen.d( rnorm(100,2.0), rnorm(100))

Must be normal

The results we get from non-normal distributions are pretty suspect

library(effsize)
cohen.d( runif(100,0.001), runif(100))
cohen.d( runif(100,0.01), runif(100))
cohen.d( runif(100,0.1), runif(100))
cohen.d( runif(100,0.3), runif(100))
cohen.d( runif(100,0.5), runif(100))
cohen.d( runif(100,0.7), runif(100))
cohen.d( runif(100,1.0), runif(100))
cohen.d( runif(100,2.0), runif(100))

Cliff’s Delta

Non parametric
pairs well with Mann Witney U test (Wilcoxon non-paired)
0.147 (small), 0.33 (medium), and 0.474 (large)

In R

library(effsize)
cliff.delta( runif(1000,0.001), runif(100))
cliff.delta( runif(1000,0.01), runif(100))
cliff.delta( runif(1000,0.1), runif(100))
cliff.delta( runif(1000,0.5), runif(100))

Bootstrapping and Confidence Intervals

Confidence intervals

Confidence intervals tell us where we expect values to be.
For instance the 95% confidence interval of the mean of our sample is [0.5,1.5].
That would mean that in 95% of the cases derived from the population that we expect the mean to between 0.5 and 1.5.
The 99% confidence interval might be wider: [0.3,1.7]
- That means 99% of the mean estimates will be in that range.
- Higher confidence
- Wider range of the statistic.
It gives us some idea of a range of values from the statistic.
Given a range of statistics if we order them and clip off the bottom alpha/2 and top alpha/2 values we get the remaining confidence interval.
- 95% has an alpha of 5% so clip the top 2.5% and bottom 2.5% off and look at min and max, that’s our confidence interval.

Are 2 distributions similar?

Often we can use confidence intervals of the difference of means to determine if something is statistically significantly different or similar.
Instead of just generating a p-value we can under the range.
if the 95% confidence interval of mean(x) - mean(y) does not cross 0 it suggests that the distributions are significantly different.
- e.g. 95% CI of [-0.5,-0.1] implies that 95% of the time difference of means between x and y is -0.5 to -0.1.
  - statistically significant difference!
- e.g. 95% CI of [0.1,0.5] implies that 95% of the time difference of means between x and y is 0.1 to 0.5
  - statistically significant difference!
- e.g. 95% CI of [-0.5,0.5] implies that 95% of the time difference of means between x and y is -0.5 to 0.5
  - not a statistically significant difference!
  - the interval overlaps 0

How do we calculate?

Informal – we just estimate
Direct calculation – for parametric statistics there are parametric methods of calculating a CI
Bootstrapping! Use a computer and sampling to abuse stats and produce a distribution of statistics!
- we deal with non-parametric data so we like this one

Bootstrapping

https://en.wikipedia.org/wiki/Bootstrapping_(statistics)
Bootstrapping is sampling a lot.
- massive amounts of random sampling without replacement of a sample
What if the best information we have is the current sample?
Boostrapping lets talk about statistics about statistics
We can build confidence intervals with bootstrapping.

Bootstrapping a mean example

we sampled 100 elements
- we calculate 1 mean
- is this good enough?
- what if 1 big value is messing everything up?
- why don’t we sample 100 elements 100 times from the 100 elements.
  - some of the outliers won’t appear in all of the samples
  - we now can calculate the mean of each of the samples
  - we can now see the distribution of means
    - its location
    - its shape
  - fundamentally we are more confident about the expected mean
  - We have a distribution now.
    - so what?
    - take the mean again? Sure whatever.
    - Why not the confidence interval?
    - R quantile will sort and clip for us
    - just return the min and max of the middle X % of the distribution

Bootstrap choices: Number of samples Samples

Bootstrapping is sampling a lot.
- massive amounts of random sampling without replacement of a sample
It is suggested that you have a sample size of 599 or more. Davidson R, MacKinnon JG. Bootstrap tests: How many bootstraps?. Econometric Reviews. 2000 Jan 1;19(1):55-68. https://www.econstor.eu/bitstream/10419/67820/1/587473266.pdf
Sampling to the size of the sample if it is larger than 599 is fine too. Larger samples lead typically to more power / better estimates.

Number of Iterations

it is recommended to sample more than 1000 times, 10k times seems to saturate. Davidson R, MacKinnon JG. Bootstrap tests: How many bootstraps?. Econometric Reviews. 2000 Jan 1;19(1):55-68. https://www.econstor.eu/bitstream/10419/67820/1/587473266.pdf

R setup in org

(org-babel-do-load-languages
 'org-babel-load-languages
 '((R . t)))

R code

B=100
N=100
alpha = 0.05
data = rnorm(B)
mean(data)
mean(sample(data,B,replace=TRUE))
booted <- sapply(c(1:N), function(i) { mean(sample(data,B,replace=TRUE)) })
mean(booted)
summary(booted)
quantile(booted,c(alpha/2,1.0 - alpha/2))

R Run

>     N=100
>     alpha = 0.05
>     data = rnorm(N)
>     mean(data)
[1] 0.1086147
>     mean(sample(data,N,replace=TRUE))
[1] -0.0537734
>     booted <- sapply(c(1:N), function(i) { mean(sample(data,N,replace=TRUE)) })
>     mean(booted)
[1] 0.09648877
>     summary(booted)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.08934  0.02356  0.08938  0.09649  0.16576  0.35589 
>     quantile(booted,c(alpha/2,1.0 - alpha/2))
      2.5%      97.5% 
-0.0690190  0.2792187 
>

Difference of means

we sampled 100 elements from each distribution (2)
- we calculate mean(d1) - means(d2)
- is this good enough?
- what if 1 big value is messing everything up?
- why don’t we sample 100 elements 100 times from the each distribution of 100 elements.
  - we now can calculate the mean difference between each of these samples
  - we can now see the distribution of difference means
    - its location
    - its shape
  - fundamentally we are more confident about the expected mean
  - We have a distribution now.
    - so what?
    - take the mean again? Sure whatever.
    - Why not the confidence interval?
    - R quantile will sort and clip for us
    - just return the min and max of the middle X % of the distribution

R code

N=100
B=100
alpha = 0.05
data1 = rnorm(B)
data2 = rnorm(B,mean=0.5)
mean(data1)
mean(data2)
mean(data1) -  mean(data2)    
booted <- sapply(c(1:N), function(i) { 
      mean( sample(data1,B,replace=TRUE) ) - 
      mean( sample(data2,B,replace=TRUE)  ) })
mean(booted)
summary(booted)
quantile(booted,c(alpha/2,1.0 - alpha/2))

R Run

>     N=100
>     alpha = 0.05
>     data1 = rnorm(N)
>     data2 = rnorm(N,mean=0.5)
>     mean(data1)
[1] 0.01280888
>     mean(data2)
[1] 0.5015766
>     mean(data1) -  mean(data2)    
[1] -0.4887677
>     booted <- sapply(c(1:N), function(i) { 
+           mean( sample(data1,N,replace=TRUE) ) - 
+           mean( sample(data2,N,replace=TRUE)  ) })
>     mean(booted)
[1] -0.475228
>     summary(booted)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.93557 -0.58166 -0.48149 -0.47523 -0.35196 -0.04391 
>     quantile(booted,c(alpha/2,1.0 - alpha/2))
      2.5%      97.5% 
-0.7658303 -0.1847355 
>

Plot it

counts <- c(100,100,100,500,500,500,1000,1000,10000)
boots <- sapply(counts, function(N) {
    alpha = 0.05
    data1 = rnorm(N)
    data2 = rnorm(N,mean=0.5)
    mean(data1)
    mean(data2)
    mean(data1) -  mean(data2)    
    booted <- sapply(c(1:N), function(i) { mean( sample(data1,N,replace=TRUE) ) - mean( sample(data2,N,replace=TRUE)  ) })
    print(mean(booted))
    print(summary(booted))
    print(N)
    print(quantile(booted,c(alpha/2,1.0 - alpha/2)))
    booted
})
plot(density(boots[[length(boots)]]),xlim=c(-1,0.25),ylim=c(0,30))
for (i in c(1:length(boots))) {
    lines(density(boots[[i]]),col=i)
}
legend(0,30,counts,pch=1,col=c(1:length(boots)))

Results

>     counts <- c(100,100,100,500,500,500,1000,1000,10000)
>     boots <- sapply(counts, function(N) {
+         alpha = 0.05
+         data1 = rnorm(N)
+         data2 = rnorm(N,mean=0.5)
+         mean(data1)
+         mean(data2)
+         mean(data1) -  mean(data2)    
+         booted <- sapply(c(1:N), function(i) { mean( sample(data1,N,replace=TRUE) ) - mean( sample(data2,N,replace=TRUE)  ) })
+         print(mean(booted))
+         print(summary(booted))
+         print(N)
+         print(quantile(booted,c(alpha/2,1.0 - alpha/2)))
+         booted
+     })
[1] -0.4917793
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.9578 -0.5718 -0.4916 -0.4918 -0.3826 -0.2077 
[1] 100
      2.5%      97.5% 
-0.8034647 -0.2512316 
[1] -0.4610482
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.7874 -0.5340 -0.4560 -0.4610 -0.3779 -0.1083 
[1] 100
      2.5%      97.5% 
-0.7157980 -0.2230874 
[1] -0.7882121
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.2104 -0.8786 -0.7984 -0.7882 -0.6770 -0.3673 
[1] 100
      2.5%      97.5% 
-1.0942129 -0.4880081 
[1] -0.5126469
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.6824 -0.5531 -0.5115 -0.5126 -0.4705 -0.3081 
[1] 500
      2.5%      97.5% 
-0.6373032 -0.3899576 
[1] -0.4067484
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.5898 -0.4521 -0.4028 -0.4067 -0.3628 -0.1931 
[1] 500
      2.5%      97.5% 
-0.5408313 -0.2826443 
[1] -0.4975045
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.6779 -0.5429 -0.4981 -0.4975 -0.4518 -0.3217 
[1] 500
      2.5%      97.5% 
-0.6297027 -0.3722989 
[1] -0.5537908
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.7010 -0.5827 -0.5536 -0.5538 -0.5249 -0.4185 
[1] 1000
      2.5%      97.5% 
-0.6400824 -0.4657624 
[1] -0.4787345
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.6403 -0.5102 -0.4800 -0.4787 -0.4471 -0.3218 
[1] 1000
      2.5%      97.5% 
-0.5663714 -0.3973088 
[1] -0.4844369
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.5446 -0.4942 -0.4845 -0.4844 -0.4746 -0.4322 
[1] 10000
      2.5%      97.5% 
-0.5119642 -0.4562765 
>     plot(density(boots[[length(boots)]]),xlim=c(-1,0.25),ylim=c(0,30))
>     for (i in c(1:length(boots))) {
+         lines(density(boots[[i]]),col=i)
+     }
>     legend(0,30,counts,pch=1,col=c(1:length(boots)))

Difference of X?

We don’t have to use averages. We can use medians or whatever other statistic you like

R code for difference of skews

# difference of skews
# install.packages("moments")
library(moments)
N=100
alpha = 0.05
data1 = rnorm(N)
data2 = rnorm(N,mean=0.5)
skewness(data1)
skewness(data2)
skewness(data1) -  skewness(data2)    
booted <- sapply(c(1:N), function(i) { skewness( sample(data1,N,replace=TRUE) ) - skewness( sample(data2,N,replace=TRUE)  ) })
mean(booted)
summary(booted)
quantile(booted,c(alpha/2,1.0 - alpha/2))

Results

>     # difference of skews
>     # install.packages("moments")
>     library(moments)
>     N=100
>     alpha = 0.05
>     data1 = rnorm(N)
>     data2 = rnorm(N,mean=0.5)
>     skewness(data1)
[1] 0.02549939
>     skewness(data2)
[1] -0.3078095
>     skewness(data1) -  skewness(data2)    
[1] 0.3333089
>     booted <- sapply(c(1:N), function(i) { skewness( sample(data1,N,replace=TRUE) ) - skewness( sample(data2,N,replace=TRUE)  ) })
>     mean(booted)
[1] 0.3303109
>     summary(booted)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.3472  0.1504  0.3054  0.3303  0.5078  0.9485 
>     quantile(booted,c(alpha/2,1.0 - alpha/2))
      2.5%      97.5% 
-0.1355174  0.8306083

Files

stats-test.org

Latest commit

History

stats-test.org

File metadata and controls

Statistical Tests

P-Values

T-Test

T-Test in R

Wilcoxon

Wilcoxon in R

Wilcoxon signed-rank test (paired)

signed-rank test in R

Kolmogorov-Smirnov Tests

R

R Output

Kruskal Wallace Test

Kruskal Wallace Test in R

X^2 Test

R

Effect Size

Cohen’s D

Must be normal

Cliff’s Delta

In R

Bootstrapping and Confidence Intervals

Confidence intervals

Are 2 distributions similar?

How do we calculate?

Bootstrapping

Bootstrapping a mean example

Bootstrap choices: Number of samples Samples

Number of Iterations

R setup in org

R code

R Run

Difference of means

R code

R Run

Plot it

Results

Difference of X?

R code for difference of skews

Results