diff --git a/2024/lecture_03.html b/2024/lecture_03.html deleted file mode 100644 index da4ddee..0000000 --- a/2024/lecture_03.html +++ /dev/null @@ -1,3131 +0,0 @@ - - - - - - - - - - - - - quarto-inputaa385e80 - - - - - - - - - - - - - - - -
-
- -
-

Uncertainty,
-standard errors and
-confidence intervals

- -
-
-
-Martina Sladekova -
-
-
- -
-
-

Today…

- -
- -
-
- -
-
    -
  1. Recap of sampling from populations

  2. -
  3. Uncertainty in research and estimation

  4. -
  5. Sampling distribution revisited

  6. -
  7. Standard error of the mean

  8. -
  9. Confidence intervals: what they are, and what they are not

  10. -
-
-
- -

LAST WEEK: Jennifer undertook a determined journey to find out just how clever Jessie (left) is compared to all the other dogs in the population.

-

THIS WEEK: What if we don’t know anything about the population?

-
-
-
-
-

Incredibly clever looking fox-terrier

-
-
-
-
-
-

A collage of a small jack russel terrier holding various objects, including a cork, a hammer, a stick, a corn, and a hanger.

-
-
-
-
-
-
-

The whole game

-
    -
  • We want to estimate some population value (e.g. some average value - a mean)
  • -
  • Confidence intervals help us quantify uncertainty around that estimate
  • -
  • To construct a confidence interval, we use the standard error
  • -
  • We can estimate the SE by dividing the standard deviation by the number of people in our sample (specifically, by its square root):
  • -
-

\[ -SE = \frac{\sigma}{\sqrt N} -\]

-
    -
  • Lower and upper limits of the confidence interval can be estimated as:
  • -
-

\[ -\text{CI limits} = mean \pm (1.96 \times{SE}) \\ -\]

-
-
-
-
-

-
-
-
-
-
-
-

Sampling from populations

-
-
-
    -
  • We want to learn something about the population
  • -
  • But we only have access to a sample (often a small one)
  • -
  • The sample estimate is our best guess
  • -
-


-
-
-
-
-


-
-
-
-
-
- -
-

What is an estimate?

-
-
-

An estimate can take many different forms. For example, we might be interested in comparing groups, in which case the estimate can be the difference in group means on some variable of interest. Or we might want to know whether two variables are associated, where the estimate is some measure of association (e.g. a correlation coefficient, or a b value which will be covered later in the term).

-
-
-
-
-


-
-
-
-
- -

Some average facts

-

The average person (apparently)…

-
    -
  • drinks 730 cups of coffee per year (twice as much for academics, incl. students) ☕

  • -
  • spends 192 minutes a day watching TV 📺

  • -
  • eats 250 cloves of garlic per year 🧄

  • -
  • takes 3500 steps each day 🚶

  • -
  • falls asleep in 7 minutes 😴

  • -
-
-
- -
-
-
-

Doomscrolling

-

“… refers to a unique media habit where social media users persistently attend to negative information in their newsfeeds about crises, disasters, and tragedies.”

-

- Sharma, Lee, and Johnson (2022)

-
-
-


A gif of a person scrolling endlessly on their phone

-
-
-
-
-
-
- -
-

Research question...

-
-
-

How much does an average person doomscroll?

-
-
-
-
-
- - -A group of stick figures representing the population
-
- - -A group of stick figures representing the population. Below is a sample of 4 stick figures, labeled with mean of 101 minutes per day.
-
- - -A group of stick figures representing the population. Below are two samples of stick figures drawn from the population, each showing a different value of the mean
-
- - -A group of stick figures representing the population. Below are three samples of stick figures drawn from the population, each showing a different value of the mean
-
- -

Uncertainty in research and estimation

-
- -
-
- -
-
- -
-
-
-

The problem: Each time we take a sample, we get a different estimate.

-
-
-
-
- -
-

How do we know if our estimate is accurate and close to the real population value?

-
-
- -
-
-
-
-
-
-

-
-
-
-
-
-
- -

Uncertainty in research and estimation

-
- -
-
-
-

The problem: Each time we take a sample, we get a different estimate.

-
-
-
-
- -
-

How do we know if our estimate is accurate and close to the real population value?

-
-
- -
-
-
-
-
-
-

-
-
-
-
-
-
-

Sampling distribution revisited

-
    -
  • We can plot the estimates in a histogram to see how they’re distributed
  • -
  • x axis shows the spread of the values, y axis shows how many times each value occurs
  • -
-
- -
- -
-
-

Sampling distribution revisited

-
    -
  • We can repeat the process infinite number of times - as long as our sample is large enough, the sampling distribution of the estimates will be normal
  • -
  • What is “large enough”? Some textbooks say 30, but more is often needed in real research.
  • -
-
- -
- -
-
-

Normal distribution is awesome

-
    -
  • Normal distribution is useful, because we know a lot about it - a good way to describe it is by using a mean (central value) and a standard deviation (spread of the scores)
  • -
  • Standard error is the term we use for standard deviation when talking about sampling distributions
  • -
- -
-
-

Normal distribution is awesome

-
    -
  • We know that 95% of scores will fall within approximately 2 standard errors from the mean (or 1.96 if we want to precise)
  • -
  • We can use this knowledge to construct an interval around the mean showing us a range of plausible values for the population value
  • -
- -
-
-

Normal distribution is awesome

-
    -
  • We know that 95% of scores will fall within approximately 2 standard errors from the mean (or 1.96 if we want to precise)
  • -
  • We can use this knowledge to construct an interval around the mean showing us a range of plausible values
  • -
- -
-
-

But wait…

-
-
-
-
- -
-

Warning

-
-
-

Error 404:Sampling distribution not found.

-
-
-
-
    -
  • Sampling distributions don’t exist “in the wild”. They are a hypothetical statistical concept.

  • -
  • Remember: standard error refers to the standard deviation of the sampling distribution (created by re-sampling and computing the mean infinite number of times), but we only have access one sample with one mean.

  • -
  • Therefore, if we want to use the standard error to construct an interval, we need to estimate it from our sample.

  • -
-
-
-

Estimating the standard error

-

Equation:

-

\[ -SE = \frac{\sigma}{\sqrt N} -\]

-

Translation:

-

\[ -\text{standard error} = \frac{\text{sample standard deviation}}{\text{(the square root of) the sample size}} -\]

-

In R:

-
-
se = sd(data$variable) / sqrt(n)
-
-
-
-

Estimating the standard error

-
Example:
-
    -
  • We collect a sample of 4 individuals.

  • -
  • Each person reports their daily doomscrolling time (in minutes): 86, 114, 97, 107

  • -
  • The mean for the sample is 101 minutes

  • -
  • The standard deviation is:

  • -
-

\[ -\sigma = \sqrt\frac{\sum(x_i - x)^2}{N} = \sqrt\frac{(86-101)^2 + (114-101)^2 + (97 - 101)^2+(107-101)^2}{4} = 12.19 -\]

-
    -
  • Which makes the standard error:
  • -
-

\[ -SE = \frac{\sigma}{\sqrt{N}} = \frac{12.19}{\sqrt{4}} = 6.095 -\]

-
-
-

Confidence intervals

-

Average doomscrolling time for the sample: 101 minutes

-

Standard deviation: 12.19

-

Standard error: 6.095

-

\[ -\text{Lower CI limit} = \text{sample mean} - 1.96 \times\text{SE} \\ -\text{Upper CI limit} = \text{sample mean} + 1.96 \times\text{SE} -\]

-

\[ -\text{Lower CI limit} = 101 - 1.96 \times6.095 = 89.054\\ -\text{Upper CI limit} = 101 + 1.96 \times6.095 = 112.946 -\]

-
-
-
-
-

-
-
-
-
- - - - -
-
-

Confidence intervals for small samples

-
    -
  • Remember: sampling distribution of the mean will have a normal shape as long as the sample size large enough

  • -
  • Smaller samples don’t approximate the normal sampling distribution very well. Because of this, we can’t rely on the value 1.96 to give us accurate intervals.

  • -
  • Instead, we can use the t-distribution

  • -
-



-
-
-
-
-

-
-
-
-
-
-
-

The t-distribution

-
-
-


-
- - - -

-
-


-
-
-
-
-
    -
  • The “critical t value” - i.e. the value of t which will give us the most accurate estimate of the confidence interval - depends on degrees of freedom (df), which in our case are related to the sample size.

  • -
  • When working with the mean, the degrees of freedom are calculated as N - 1.

  • -
-
-
- -
-
-

The t-distribution

-
-
-


-
-

-
-


-
-
-
-
-
    -
  • Instead of multiplying the standard error by 1.96, we multiply by the critical t value.

  • -
  • For example, in our sample of 4, the df is 4 - 1 = 3. Move the slider to df = 3 to see that the critical t value for 3 is 3.182

  • -
-
-
- -
-
-

t-based confidence intervals:

-

Average doomscrolling time for the sample: 101 minutes

-

Standard error: 6.095

-

Critical t value: 3.182

-

\[ -\text{CI Limits} = \text{mean} \pm3.182 \times\text{SE} \\ -\text{CI Limits} = 101 \pm3.182 \times\text{6.095} \\ -\text{CI Limits} = [81.606, 120.394] -\]

-
    -
  • Compare the new confidence interval [81.61, 120.394] with the confidence interval we got using the value 1.96: [89.05, 112.95]. The new CI is wider!
  • -
-
- -
- -
-
-

t-based confidence intervals:

-
-
-


-
-

-
-


-
-
-
-
-
    -
  • This is to be expected - we have a tiny sample (N = 4), so there is a lot of uncertainty around whether the estimate of 101 minutes is actually representative of the population.

  • -
  • The larger the sample, the tighter the confidence intervals,. because the critical t gets smaller and smaller (note how t approaches 1.96 as the sample size (df) increases)

  • -
-
-
- -
-
-

-
-
-
    -
  • We take samples over and over again, compute the mean, and construct confidence intervals around that mean - 95% of them will contain the population value, the remaining 5% will not.

  • -
  • This is known as an interval with 95% coverage. 95% is the most common value that we choose, but it can take on other values as well (e.g 50%, 90%, 99%).

  • -
-
- -
-
-
-

-
-
-
-
-
-

-
-
-
-
-
-
-

How to interpret confidence intervals

-
-
-

\[ -\text{"The average doomscrolling time in our sample was} \\ -\text{101 minutes (SD = 12.19) 95% CI [81.61, 120.39]."} -\]

-
-
-
-
-


-
-
-
-
-
- -
-

Correct interpretation:

-
-
-

ASSUMING THAT our sample is one of the 95% producing confidence intervals that contain the population value, then the population value for time spent doomscrolling per day falls somewhere between 81.61 and 120.39 minutes.

-
-
-
-
-


-
-
-
-
-


-
-
-
-
-
- -
-

However...

-
-
-

There is no guarantee that the assumption above is correct! And we just have to live our lives not knowing…

-
-
-
-
-


-
-
-
-
-

How *not* to interpret confidence intervals

-
-
-
-
-

Drake meme, being unhappy with the interpretation of confidence intervals presented on this slide.

-
-
-
-
-
-
-
- -
-

Nope:

-
-
-

“We can be 95% confident that the population value falls between 81.61 and 120.39.”

-
-
-
-
-
-
-
-

How *not* to interpret confidence intervals

-
-
-
-
-
-
- -
-

Also nope:

-
-
-

“There is 95% probability that the population value falls between 81.61 and 120.39.”

-
-
-
-
-
-
-

Drake meme, being unhappy with the interpretation of confidence intervals presented on this slide.

-
-
-
-
-
-
-

How to interpret confidence intervals

-
-
-
-
-

Drake meme, now happy with the interpretation of confidence intervals presented on this slide.

-
-
-
-
-
-
-
- -
-

Correct interpretation

-
-
-

ASSUMING THAT our sample is one of the 95% producing confidence intervals that contain the population value, then the population value for time spent doomscrolling per day falls somewhere between 81.61 and 120.39 minutes.

-
-
-
-
-
-
-
-

The whole game (again)

-
    -
  • We want to estimate some population value (e.g. some average value - a mean)
  • -
  • Confidence intervals help us quantify uncertainty around that estimate
  • -
  • To construct a confidence interval, we use the standard error which we can estimate as:
  • -
-

\[ -SE = \frac{\sigma}{\sqrt N} -\]

-
    -
  • Lower and upper limits of the confidence interval can be estimated as (replacing 1.96 with critical t for small samples):
  • -
-

\[ -\text{CI limits} = mean \pm (1.96 \times{SE}) \\ -\]

-
-
-
-
-

-
-
-
-
-
-
- -

The bigger picture…

-
-
-
    -
  • When interpreting estimates and confidence intervals for your sample - always consider them as just one of many different possible estimates

  • -
  • This is why replication is important in science - our sample could easily be the one that misses the population value

  • -
  • Always be vary of studies placing too much confidence on a single finding

  • -
-
-
-
-

-
-
-
-
-
-
-

Next week’s fun:

-

Putting it all into practice:

-
    -
  • Research questions

  • -
  • Good and bad hypotheses

  • -
  • Testing hypotheses with Null Hypothesis Significance Testing

  • -
  • A disappointing answer to why we’re so obsessed with the value 95%.

  • -
-
-
-

References

- -
-
-Sharma, Bhakti, Susanna S. Lee, and Benjamin K. Johnson. 2022. “The Dark at the End of the Tunnel: Doomscrolling on Social Media Newsfeeds.” Technology, Mind, and Behavior 3 (1). https://doi.org/10.1037/tmb0000059. -
-
-
-
-
- - - - - - - - - - - - - - - - - - - - - - - - - \ No newline at end of file