Lesson16.Rmd

---
title: "Lesson 16: Describing Categorical Data: Proportions; Sampling Distribution of a Sample Proportion"
output:
  html_document:
    theme: cerulean
    toc: true
    toc_float: false
---

<script type="text/javascript">
 function showhide(id) {
    var e = document.getElementById(id);
    e.style.display = (e.style.display == 'block') ? 'none' : 'block';
 }
</script>

<div style="float:right;width=40%;">
<br/>
<div style="padding-left:10%;">**Optional Lesson Video**</div>
<iframe width="90%" align="right" src="https://www.youtube.com/embed/videoseries?list=PLaZryQtbPQC_379i2-P7s9yhiqN2Vt6zD" frameborder="1" allow="autoplay; encrypted-media" allowfullscreen></iframe>
</div>

## Lesson Outcomes

<a href="javascript:showhide('oc')"><span style="font-size:8pt;">Show/Hide Outcomes</span></a>
<div id="oc" style="display:none;">
By the end of this lesson, you should be able to:

- Calculate and interpret a sample proportion.
- Summarize categorical data with a bar or pie chart.
- Determine the mean, standard deviation and shape of a distribution of sample proportions.
- Calculate probabilities using a distribution of sample proportions.
</div>
<br>


<div style="clear:both;"></div>

<br/>

## Case Study: Eating Habits of College Students

<div style="float:right;padding:10px;">
<img src="./Images/byui_crossroads.jpeg" width="300px">
</div>

<img src="./Images/Step1.png">


Chances are that as a college student you have probably thought to yourself, "It takes too much time to prepare healthy food." In fact, the American College Health Association estimates that only 6% of college students across the U.S. are eating the recommended "five or more servings of fruits and vegetables" daily.[^1] In an effort to better understand the barriers that were keeping college students from eating healthy, a team of researchers from the Department of Psychology at BYU-Idaho undertook a study.[^2] 

<br/>

<img src="./Images/Step2.png">

The team of researchers obtained a convenience sample of $n=517$ college students from General Psychology courses over three different semesters. <a href="javascript:showhide('articledetails')"><span style="font-size:8pt;">Click to Show/Hide More Sampling Details</span></a>

<div id="articledetails" style="display:none;padding-left:30px;padding-right:30px;">
As reported in their article, "Participation in this study was a required experiential component of the course, but students’ grades were based on participation rather than performance and alternate assignments were available." Further, use of the data for research purposes was approved by the university’s Institutional Review Board.
</div>

The survey contained many questions for the students to answer and found some fantastic conclusions about what keeps students from eating healthy. (If you are interested, [click here](https://byuiscroll.org/psychology-students-did-the-research-for-you/) to read an news article highlighting their findings.) One important question on the survey asked  each student whether or not they (the student) ate fresh fruits and vegetables as part of their daily diet. They were allowed to answer with either of

* Yes
* No

Unfortunately, not all students out of the 517 that were surveyed answered the question. We will summarize the results of the answers below as we also present details on how to describe categorical data.

<img src="./Images/Step3.png">

## Numerical and Graphical Summaries of Categorical Data

To describe categorical data, we need a different approach than what we used for quantitative data. Instead of discussing means (like $\mu$ or $\bar{x}$) we will work with proportions. A proportion is a number between 0 and 1 that measures the relative size of a part of something to the whole of that thing. In the current eating habits study, the researchers are interested in knowing the proportion of students who experience barriers to healthy eating. This proportion probably isn't 0 (which would imply no students experience barriers to healthy eating) and probably isn't 1 (which would imply all students experience barriers to healthy eating) but is likely somewhere in between.

<div class="myemphasis">
**Proportions**: A number between 0 and 1 that measures the size of a part to the whole.

The **sample proportion** is a sample statistic. It is written as $\hat{p}$. It is computed by taking the number of "successes" in the data, called $x$, and dividing by the total number of individuals in the sample, $n$ (the sample size).

$$
\text{Sample Proportion:} \quad \hat{p} = \frac{x}{n}
$$

The theoretical **true proportion** for a population is written as just $p$. It is a population parameter. To compute the *true proportion* we must know the total size of the population (or total possible outcomes) $N$, as well as the number of possible *successes* $X$. We rarely know $p$. But we can usually hypothesize a reasonable value for $p$ using a null hypothesis.

$$
 \text{True Proportion:} \quad p = \frac{X}{N} \quad \text{(Usually unknown)}
$$
</div>

Let's take a quick glance at the data for one of the questions from the eating habits study.

| Student | Do you regularly eat fresh fruits and vegetables? | 
|:--------|:-------------------------------------------------:|
| 1  | (No Answer) |
| 2  | (No Answer) |
| 3  | (No Answer) |
| 4  |  Yes |
| 5  |  No  |
| 6  |  No  |
| $\vdots$ | $\vdots$ |
| 515 | Yes |
| 516 | No |
| 517 | No |

Notice that in this data, some students chose not to answer this question, i.e., "No Answer". Other students answered "Yes" and other students answered "No". Our first step to computing the **proportion** of students who answered "Yes" (what we will call successes in this case) is to count up two numbers.

* First, we count how many students said "Yes". This turns out to give $x = 245$.
* Second, we count how many students answered the question. This gives our sample size as $n = 399$.
* Third, we divide to get

$$
 \hat{p} = \frac{x}{n} = \frac{245}{399} \approx 0.614
$$

This shows that 61.4% of students in our sample claim to regularly eat fresh fruits and vegetables as part of their daily diet. That implies that 38.6% (1 - 0.614 = 0.386) of students in our sample, which is a little more than 1 out of every 3 students, *are not* regularly eating fresh fruits and vegetables as part of their daily diet. 

### R-Instructions for Computing Sample Proportions

<div class="SoftwareHeading">R Instructions</div>
<div class="Software">

**To compute the sample proportion in R:**

* Use the `table(...)` function to count up how many times certain observations occur in a dataset.
* Use the `/` symbol to divide $x$ by $n$.

To practice, start by reading in the [EatingHealthy.xlsx](./Data/EatingHealthy.xlsx) data file into RStudio. (For a reminder on how to read in data, [click here](RHelp.html#reading-in-data.).

```{r, include=FALSE}
library(readxl)
EatingHealthy <- read_excel("./Data/EatingHealthy.xlsx")
```

Next, note that the "Yes" and "No" answers to the question "Do you regularly eat fresh fruits and vegetables?" is stored in the column of the dataset called `Eathealthy`. So we use the `table(` function, followed by the name of the dataset `EatingHealthy` followed by a `$` sign, followed by the column name `Eathealthy`.

```{r, comment=NA}
#Count up the "Yes" and "No" answers:
table(EatingHealthy$Eathealthy)
```

From this we learn that there are 118 people who did not answer this question (the "NA" means the answer is "Not Available"), 154 people answered "No", and 245 people answered "Yes". Considering only those that answered this question we have a total of $n=399$ people that answered:

```{r, comment=NA}
#Get the sample size by adding up "Yes" and "No" answers:
154 + 245
```

Dividing the number of "Yes" answers ($x=245$) by the total number of answers ($n=399$) gives the proportion of students in our sample claiming to eat a healthy diet, $\hat{p} = 0.614$ or 61.4%.

```{r, comment=NA}
#Compute the sample proportion:
245/399
```

</div>
<br/>


### Bar Charts

Bar charts are a powerful way to quickly display categorical data.  They provide a bar for each category in the data, and the height of the bar shows the number of observations from the sample that landed in that particular category.

We can represent the various answers of the students concerning whether or not they "regularly eat fresh fruits and vegetables" with a bar chart. 

```{r, echo=FALSE}
barplot(table(EatingHealthy$Eathealthy),
        col=c("orange","firebrick","skyblue"),
        main="College Students were Asked \n \"Do you regularly eat fresh fruits and vegetables?\"",
        xlab="Student Answers",
        ylab="Number of Students")
```

The above chart shows that the 118 students that did not respond to this question represent a substantial portion of the total students surveyed. That could be leading to some bias in the results because we don't know if the "NA" responses would land in the "Yes" or "No" group. But, if we consider just those that answered the question, it seems that many students (61.4% of those that answered the question) are doing a great job eating fresh fruits and vegetables as part of their diet.

### R Instructions for Bar Charts

<div class="SoftwareHeading">R Instructions</div>
<div class="Software">

**To make a bar chart in R:**

- First, read in the data: [EatingHealthy](./Data/EatingHealthy.xlsx). (Review of [reading in data](RHelp.html#reading-in-the-data).)
- Second, use the `table(...)` function, the assignment operator `<-`, and a name for the table of counts that you create (like `mytable`) to save a table of counts.

<div style="padding-left:40px;">

```{r, comment=NA}
#Save your table of counts:
mytable <- table(EatingHealthy$Eathealthy)
mytable
```

Sometimes you might not have kept track of the actual dataset, but you might still know the counts already. In that case, manually type the counts into R using the combine function `c(...)`:

```{r, eval=FALSE}
mytable <- c(`NA` = 118, `No` = 154, `Yes` = 245)
```

Note how the above code puts a label in front of each number. That will make it so the bar chart prints those labels underneath of each bar. Although it is not required, placing a back-tick on each side of your labels (as shown in the code above) is good practice.

</div>

- Use the `barplot(...)` function on your table of counts to create a bar chart of the values.

<div style="padding-left:40px;">

```{r, comment=NA}
#Create a plot of the counts in mytable:
barplot(mytable)
```

- Specify colors for the bar chart using either `col="skyblue"` to give all bars the same color, or use `col=c("orange","firebrick","skyblue")` to give each bar a different color. Of course, use any color names you like: [R Color Options](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf)
- Specify axes labeling with `main="A main title"`, `xlab="An x-axis label"`, and `ylab="A y-axis label"`.

```{r, comment=NA}
#Make a fancier graph using the options:
barplot(mytable,
        col=c("orange","firebrick","skyblue"),
        main="College Students were Asked \n \"Do you regularly eat fresh fruits and vegetables?\"",
        xlab="Student Answers",
        ylab="Number of Students")
```

</div>
<br>
</div>

<br/>

Before we move on, it is worth mentioning that bar charts can represent many categories rather than just "Yes" and "No" answers. For example, here is a bar chart showing the student answers to the question asking students about the barriers they faced to eating a healthy diet. (A healthy diet was stated to be a diet that included fresh fruits and vegetables.) Students were allowed to select as many of the following answers as they felt applicable in response to this question.

* It is too expensive to purchase fruits and vegetables.
* It takes too much time to prepare healthy food (e.g., fruits and vegetables).
* There are not enough options for healthy eating here.
* I am uncomfortable with healthy eating because it is something I am not familiar with.
* It is confusing to know what is considered "healthy eating" anymore.
* I prefer tasty, quick options such as fast food.
* I do not see positive results when I eat healthy (e.g., I am not full, I do not lose weight).
* Other, please specify.

<div style="float:right;background-color:gray;border-radius:5px;"><a href="javascript:showhide('coolplot')"><span style="font-size:12pt;color:white;">&nbsp; Show Plot Code &nbsp;</span></a></div>

<div style="clear:both;"></div>

<div id="coolplot" style="display:none;">

```{r, eval=FALSE}
#Change the outside margins of the plot
par(mai=c(0.1,1,.8,.1))

#Create table of counts column by column
#The square bracket [1] or [2] pulls out the count and leaves out the NA's
mytable <- c(table(EatingHealthy$Barrier1)[1],
             table(EatingHealthy$Barrier2)[1],
             table(EatingHealthy$Barrier3)[2],
             table(EatingHealthy$Barrier4)[1],
             table(EatingHealthy$Barrier5)[1],
             table(EatingHealthy$Barrier6)[1])

#Plot the table of counts
barplot(mytable,
        col=c("skyblue","orange","green3","purple","red2","wheat3"), 
        names="", #turn off the names, too long
        ylab="Number of Students", 
        main="College Student Barriers to Eating Healthy", 
        ylim=c(-100,300), #extend the y-axis
        yaxt='n') #turn off the default axis
#Add a legend to the plot
legend("bottom", #position of the legend
       cex=0.8, #shrink words to 80% of normal size
       legend=names(mytable), #give the names to the legend
       col=c("skyblue","orange","green3","purple","red2","wheat3"), 
       pch=15, #add square boxes to the left of each label
       bty='n') #turn off the box drawn around the legend
axis(side=2, at=c(0,50,100,150,200,250,300)) #add back on a y-axis at the places we want tick marks.
```

</div>
&nbsp;

```{r, echo=FALSE, fig.height=6}
par(mai=c(0.1,1,.8,.1))
mytable <- c(table(EatingHealthy$Barrier1)[1],
             table(EatingHealthy$Barrier2)[1],
             table(EatingHealthy$Barrier3)[2],
             table(EatingHealthy$Barrier4)[1],
             table(EatingHealthy$Barrier5)[1],
             table(EatingHealthy$Barrier6)[1])
barplot(mytable, col=c("skyblue","orange","green3","purple","red2","wheat3"), names="", ylab="Number of Students", main="College Student Barriers to Eating Healthy", ylim=c(-100,300), yaxt='n')
legend("bottom", cex=0.8, legend=names(mytable), col=c("skyblue","orange","green3","purple","red2","wheat3"), pch=15, bty='n')
axis(side=2, at=c(0,50,100,150,200,250,300))
```


<br>


<br>

<img src="./Images/Step4.png">

To learn how to *Make Inference* (Step 4 of the Statistical Process) about the true population proportion $p$, we first need to explore the sampling distribution of the sample proportion. You may recall studying the [sampling distribution of the sample mean](Lesson06.html#introduction-to-sampling-distributions) in Lesson 6. Using those ideas along with the z-score equation of Lesson 7, we were able to use the sample mean, $\bar{x}$, to make inference about the population mean, $\mu$. Now we will study how to use the sample proportion, $\hat{p}$, to make inference about the population proportion, $p$, by studying the sampling distribution of the sample proportion.


<br/>


## Sampling Distribution of the Sample Proportion

### Example: Tossing a Coin

<div style="float:right;padding:10px;">
<img src="./Images/coinflip.jpeg" width="200px">
</div>

If you were asked the question, "What is the probability of flipping a head with a fair coin?" you would likely correctly answer by stating that there is a 50% chance of getting a head. You know this because there are $N=2$ possibilities (a head on one side and a tail on the other) and only one side, $X=1$, can give a head. That gives the true proportion of heads to be $p=X/N=1/2 = 0.5$. However, if you have ever flipped a coin several times and recorded your results, you would know that you don't usually get exactly 50% heads. 

To demonstrate, we actually flipped a coin $n=25$ times and recorded our results. We got $x=12$ heads. That comes out to a sample proportion of $\hat{p} = 12/25 = 0.48$. While not exactly 50%, it is fairly close. We can represent the results in a simple bar chart.

```{r, echo=FALSE}
mytable <- c(`Heads`=12, `Tails`=13)
barplot(mytable, 
        col=c("skyblue","gray"), 
        ylab="Number of Flips",
        main="My Coin Flip Results")
```


Now it's your turn.

<div style="clear:both;"></div>

<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
1. Toss a coin $n=25$ times.  Keep track of the number of heads you observe in your $n=25$ flips. What is your sample proportion for the number of heads you flipped? 

<a href="javascript:showhide('Q1')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q1" style="display:none;">
* All answers will vary in this question. When we tossed the coin 25 times, we got 12 heads. 

This gives our proportion of heads as $\hat{p} = 12/25 = 0.48$. Your proportion will likely be different, but it would be *very*, *very* unlikely for you to get 0/25 or 25/25 heads.

</div>
<br>

2. What does a bar chart showing your results of heads and tails look like?

<a href="javascript:showhide('Q2')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q2" style="display:none;">

To create a bar chart of our heads and tails counts, we used the following code.

```{r}
mytable <- c(`Heads`=12, `Tails`=13)
barplot(mytable, 
        col=c("skyblue","gray"), 
        ylab="Number of Flips",
        main="My Coin Flip Results")
```

</div>
&nbsp;
</div>
<br>

If you just completed the two questions above, you likely got a different answer than what we got for the proportion of heads in your $n=25$ flips. (Although it is somewhat likely, 13.3% chance, that you would have also gotten 12/25 heads.) So why is it that if the true proportion is $p=0.5$, we all can get different answers for $\hat{p}$ when we each flip a coin $n=25$ times? As you likely already know, the reason that answers for $\hat{p}$ can vary from $p$ is because the coin flips are random. This is the very idea behind a *random sample* of data. In fact, this means that we should expect the sample proportion $\hat{p}$ to vary randomly from the true proportion $p$ anytime we take a sample of data. The only way to really know $p$ for sure would be to take a census and survey the entire population! That is typically too much work. And fortunately, thanks to statistically theory about the *sampling distribution of the sample proportion*, we can actually quantify how far away from the true proportion our sample proportion will be, say 95% of the time. Let's see how this work.

### Creating a Sampling Distribution

A true sampling distribution of a sample proportion is obtained by plotting the results from *all possible* samples from a population of size $n$. It isn't reasonable to try to create such a distribution in its entirety, but we can look at a quick example using the data file [CoinTossHeads.xlsx](./Data/CoinTossHeads.xlsx). This data file contains data representing a collection of 900 students' results, where each student tossed a coin $n=25$ times and calculated their proportion of heads. These 900 different sample proportions $\hat{p}$ represent 900 different samples (of $n=25$ coin flips in each sample) from the population of all coin flip possibilities. The goal of the sampling distribution of the sample proportion is to consider the set of all possible sample proportions that could be obtained when getting a sample from a population, and where those sample proportions would land, i.e., how they would be distributed.

The following histogram shows the distribution of the sample proportions $\hat{p}$ for each of the 900 students that participated in the coin flip experiment. It turns out that there were 134 students in the sample of $n=900$ that also flipped 12/25 heads like we did. However, there were also 137 students that flipped 13/25 heads. But notice that very few students (only 5) got as few as 5/25 heads and very few students (only 7) got as many as 19/25 heads. And no one (out of all 900 students that participated) got more than 19 heads (19/25 = 0.76) or fewer than 5 heads (5/25 = 0.2).

```{r, echo=FALSE}
CoinTossHeads <- read_excel("./Data/CoinTossHeads.xlsx")

hist(CoinTossHeads$ProportionHeads,
     col="wheat3",
     main="Resulting Sample Proportions from 900 Students \n Each Student Flipped a Coin 25 Times",
     ylab="Number of Students who Got the Given Proportion",
     xlab="Resulting Sample Proportion",
     breaks=seq(0.02,1,0.04),
     xlim=c(0,1))
curve(dnorm(x, 0.5, 0.1)*35, add=TRUE, n=1000)
```

<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">

3. Find the spot on the horizontal axis of the histogram indicating the proportion of heads ($\widehat{p}$) that *you* observed in Question 1.  Based on your visual observation, would you say your proportion of heads is unusual?

<a href="javascript:showhide('Q3')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q3" style="display:none;">
* Answers will vary.  This depends on your proportion of heads.  If the observed value is far to the right or left, then you would say that it was unusual.  Most students will observe values in the middle of the distribution.
</div>
<br>

4. Visually, estimate the mean and standard deviation of the observed sample proportions shown in the histogram. Please write your answer to this question before continuing.

<a href="javascript:showhide('Q4')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q4" style="display:none;">
* We estimate the mean of the sample proportions at 0.5 and the standard deviation of all of these sample proportions at 0.1. Almost all the sample proportions are found between .2 and .8, which represents the mean plus or minus three standard deviations of the mean.  Your answers may vary.
</div>
<br>

5. The proportion of heads that you will observe in $n=25$ tosses of a fair coin, $\widehat p$, is a random variable.  The true theoretical mean for this random variable is $p=0.5$.  Explain why this value would make sense.

<a href="javascript:showhide('Q5')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q5" style="display:none;">
* Since getting a "heads" is just as likely as getting a "tails", you should end up with half your flips landing as "tails". The theoretical proportion of $p=0.5$ is logical.
</div>
<br>

</div>

<br/>

### Mean and Standard Deviation of the Sample Proportion

<div class="myemphasis">
The symbol $\mu_{\hat{p}}$ represents the mean of the distribution of all possible sample proportions, and this mean is equal to $p$, the true proportion.

$$
\underbrace{\mu_\widehat{p}}_{\textrm{Mean of } \widehat{p}} = p
$$
The symbol $\sigma_{\hat{p}}$ represents the standard deviation of all possible sample proportions, and is equal to $\sqrt{p\cdot(1-p)/n}$.

$$
\underbrace{\sigma_\widehat{p}}_{\textrm{Standard Deviation of}~\widehat p} = \sqrt{\frac{p \cdot (1-p)}{n}}
$$
</div>

Both of the above results are visible in the previous histogram that was drawn of the 900 different sample proportions obtained from 900 different students who each flipped a coin 25 times and recorded the proportion of heads they flipped. The value of $\mu_{\hat{p}}$ being equal to $p$ tells us that the many possible sample proportions we could obtain from our sample are centered around the true proportion $p$. The value of $\sigma_{\hat{p}}$ tells us how far away from $p$ our particular sample proportion $\hat{p}$ may have landed. (Remember the idea that 95% of all possible values are within 1.96 standard deviations of the true mean?)

<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">

6. What is the mean and standard deviation of the sample proportion $\hat{p}$ when sampling $n=25$ coin flips?

<a href="javascript:showhide('Q6')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q6" style="display:none;">
The mean of all possible sample proportions is $p$ according to the statements made above. For a coin, the true proportion of heads is $p=0.5$. So, the mean of the sample proportion $\hat{p}$ when sampling $n=25$ coin flips is also 0.5.

$$
 \mu_{\hat{p}} = 0.5
$$


The true theoretical standard deviation of $\widehat p$ in this case is $\sigma_{\widehat{p}} = 0.1$.  This can be obtained using the formula

$$
\displaystyle{\text{Standard deviation of}~\widehat{p} = \sigma_{\widehat{p}} = \sqrt{ \frac{p(1-p)}{n} } }
$$

where $p$ is the true population proportion, which is also the mean of the distribution of $\widehat{p}$, which is 0.5 for this coin flip scenario.


$$
\text{Standard deviation of}~\widehat{p} = \sqrt{ \frac{p(1-p)}{n} } = \sqrt{ \frac{0.5(1-0.5)}{25} } = 0.1
$$

</div>
<br>

7. In [Lesson 5](Lesson05.html) the $z$-score for an observed data value was shown to be computed using the formula:

$$
\displaystyle { z = \frac{\textrm{value} - \textrm{mean}}{\textrm{standard deviation}} }
$$

Use the mean and standard deviation you came up with in your answer to Question 6 (above), to find the $z$-score corresponding to *your* sample proportion.

<a href="javascript:showhide('Q7')"><span style="font-size:8pt;">Show/Hide Solution</span></a>

<div id="Q7" style="display:none;">
<center>
$$
\displaystyle { z = \frac{\text{_____} - 0.5}{0.1} = \text{_____} }
$$
</center>
</div>
<br>

8. Based on the $z$-score you computed in question 7, is your observed proportion considered unusual?

<a href="javascript:showhide('Q8')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q8" style="display:none;">
* Answers will vary.  $z$-scores between -2 and 2 are not unusual.  If your $z$-score is less than -2 or greater than 2, then it is considered unusual.
</div>
&nbsp;
</div>

<br>
<br>
<div class="message Tip">You may want to refresh your memory on our definition of "unusual events" in the [Normal Distributions](Lesson05.html){target="_blank"} lesson</div>
<br>
<br>

### Central Limit Theorem for the Sample Proportion

If the sample size is large, the sample proportion, $\widehat p$, will be approximately normally distributed.  This is a direct consequence of the Central Limit Theorem.

#### How Large is Large Enough?

We can apply the Central Limit Theorem to a sample proportion (and conclude that $\widehat p$ follows a normal distribution) if both of the following conditions are satisfied:

- $np \ge 10$
- $n(1-p) \ge 10$

It is important to check both conditions.  If one of them is not satisfied, we cannot conclude that $\widehat p$ follows a normal distribution. Observe that the effect of these two conditions is that if $p$ is very close to 0 or 1, then $\widehat{p}$ isn't close to normal unless $n$ is very large.

<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">

9. Is the sample size of $n=25$ large enough for the sample proportion of the number of heads obtained when flipping a coin to be considered normal?

<a href="javascript:showhide('Q9')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q9" style="display:none;">
Yes. By using $p=0.5$ and $n=25$ we compute both requirements.

* $np = (25)(0.5) = 12.5 > 10$ (Yes, checks out.)
* $n(1-p) = (25)(1-0.5) = 12.5 > 10$ (Yes, checks out.)

Since both requirements are satisfied, the distribution of $\hat{p}$ will be normal. We saw that this is true when looking at the previous histogram showing the distribution of sample proportions from all 900 students.

</div>

</div>

<br/>

## Probability Calculations for a Sample Proportion

<br>
<div class="message Note">**Remember:**  We can use the Normal Probability Applet to find probabilities associated with any normally distributed random variable with known mean and standard deviation.</div>
<br>
<br>

If the sample size is sufficiently large, we can use the Normal Probability Applet to make probability calculations for proportions, just as we did for means.  First, we need to find the $z$-score.  This can be done with the equation:

$$
z = \frac{\textrm{value} - \textrm{mean}}{\textrm{standard deviation}}
= \frac{\widehat p - p}{\sqrt{\frac{p \cdot (1-p)}{n}}}
$$

Then, we can enter this $z$-score in the Normal Probability Applet to find the area more extreme than the $z$-score.

#### Worked Example

If a coin is tossed $n=25$ times, and heads is observed 17 times, the sample proportion of heads is
$\displaystyle{\widehat p = \frac{x}{n} = \frac{17}{25} = 0.68}$. We will find the probability that a sample proportion will exceed 0.68. Said differently, we will calculate the probability that someone would flip more than 17 heads out of 25 flips of a coin.

First, we compute the $z$-score corresponding to $\hat{p} = 0.68$.  We can use the mean and standard deviation, which were calculated back in Question 6, or just simply substitute the values of $p=0.5$ and $n=25$ in the equation for the $z$-score.

$$
z = \frac{\textrm{value} - \textrm{mean}}{\textrm{standard deviation}}
= \frac{\widehat p - p}{\sqrt{\frac{p \cdot (1-p)}{n}}}
= \frac{0.68 - 0.5}{\sqrt{\frac{0.5 (1-0.5)}{25}}}
= \frac{0.18}{0.1}
= 1.800
$$

Next, we enter the $z$-score (1.800) in the [Normal Probability Applet](https://byuimath.com/apps/normprob.html) and shade the area to the right of this value.

<a href="https://byuimath.com/apps/normprob.html?z=1.8&mid=0&left=0" target="_blank"><img src="./Images/SampDist-25CoinTosses-Applet.png"></a>

The area to the right of $z=1.800$ is $0.0359$.

<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
11. Suppose a student flipped a coin $n=25$ times and got a sample proportion of $\widehat p = 0.44$, or 44% of their coin tosses resulted in heads.  Find the $z$-score corresponding to this value.

<a href="javascript:showhide('Q11')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q11" style="display:none;">

$$
  z = \frac{\hat{p} - p}{\sqrt{\frac{p(1-p)}{n}}} = \frac{0.44 - 0.5}{\sqrt{0.5(1-0.5)}{25}} = -0.600
$$

* The $z$-score for a $\hat{p}$ of 0.44 is $z=-0.60$.
</div>
<br>

12. Use the $z$-score you computed in Question 11 to find the probability that the proportion of successes, $\widehat p$, will be greater than 0.44 if a coin is tossed $n=25$ times.  

<a href="javascript:showhide('Q12')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q12" style="display:none;">

Using the normal applet we get:

<img src="./Images/normalapp_L16_q12.png" width="500px">


* So the probability of observing a sample proprotion greater than 0.44 is $0.7257$ as shown by the "Area" box in the Normal Probability Applet for a z-score of $z=-0.6$.

</div>
<br>

13. For $n=25$ coin tosses, find the probability that the sample proportion would be between $0.44$ and $0.56$. 

<a href="javascript:showhide('Q13')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q13" style="display:none;">

We already know that the z-score for $\hat{p}=0.44$ is $z=-0.6$ from our work in Question 11. Now we compute the z-score for $\hat{p}=0.56$. 

$$
  z = \frac{\hat{p} - p}{\sqrt{\frac{p(1-p)}{n}}} = \frac{0.56 - 0.5}{\sqrt{0.5(1-0.5)}{25}} = 0.600
$$

Noticing that this z-score is positive, $z=0.6$, we shade the middle area on the Normal Probability Applet.

<img src="./Images/normalapp_L16_q13.png" width="500px">

* The probability the sample proportion would be between $\hat{p}=0.44$ ($z=-0.6$) and $\hat{p}=0.56$ ($z=0.6$) is $0.4515$. 
</div>
&nbsp;
</div>
<br>

<img src="./Images/Political-symbols-democrat-republican-o.png">

#### Example: Political Elections

During political elections in the United States, residents are inundated with polls.  Many people conduct polls to estimate the proportion of the population that will vote for each candidate.  The pollsters report the number of people who were contacted and the proportion who said they would favor a particular candidate.  The poll results are a prediction of the future election results.

In these polls, individuals are asked the question, "If the election were held today, which candidate would you most likely support?"  In one survey, $n=1024$ people were polled, and $x=565$ of the respondents said that they would vote for the Republican candidate. In this case, the "proportion" of people who favored the Republican candidate was:
$$
\widehat p = \frac{x}{n} = \frac{565}{1024} = 0.552
$$
That suggests that 55.2% of the people polled plan to vote for the Republican.  This does not mean that this candidate will win the election.  However, it looks like they might be in the lead at this point.

Consider the following question:

"If the true proportion of people who support a particular political candidate is $p=0.48$, and if $n=1041$ people are surveyed, what is the probability that the results of the survey will suggest that the candidate will win the election?"

To address this question, we first note that the survey will suggest that the candidate will win if more than 50% of the people surveyed favor the candidate. So, we need to find the following probability: $P(\widehat p > 0.5)$.  First we find the $z$ score:

$$
z = \frac{\widehat p - p}{\sqrt{\frac{p(1-p)}{n}}} = \frac{0.5-0.48}{\sqrt{\frac{0.48(1-0.48)}{1041}}} = 1.292
$$

Now, we look up this value using the Normal Probability Applet and find the area to the right. Using the Normal Probability Applet, we find that $P(\widehat p > 0.5)=0.0982$.  So, even though this candidate is actually behind in the popular vote, there is a chance of 0.0982 that they will appear to be winning!

This calculation was done in the same way we have done normal calculations in the past.  The only difference is that instead of using $\bar x$ and its mean and standard deviation, we used $\widehat p$ and its mean and standard deviation.  Otherwise, they are the same.

<br>

## Summary

<div class="SummaryHeading">Remember...</div>
<div class="Summary">

- **Pie charts** are used when you want to represent the observations as part of a whole, where each slice (sector) of the pie chart represents a proportion or percentage of the whole.

- **Bar charts** present the same information as pie charts and are used when our data represent counts. A **Pareto chart** is a bar chart where the height of the bars is presented in descending order.

- The sample proportion is calculated by $\displaystyle{\widehat p = \frac{x}{n}}$

- The sample proportion $\widehat p$ is a point estimator for the true population proportion $p$. 

- The sampling distribution of $\widehat p$ has a mean of $\mu_{\widehat{p}} = p$ and a standard deviation of $\sigma_{\widehat{p}} = \displaystyle{\sqrt{\frac{p\cdot(1-p)}{n}}}$

- If $np \ge 10$ and $n(1-p) \ge 10$, then the sampling distribution of the sample proportion will be normally distributed and you can conduct **probability calculations** for $\widehat{p}$ using the Normal Probability Applet and the z-score equation:
$$
\displaystyle {z = \frac{\textrm{value} - \textrm{mean}}{\textrm{standard deviation}}
= \frac{\widehat p - p}{\sqrt{\frac{p \cdot (1-p)}{n}}}}
$$

- To compute a sample proportion in R, use the `table(...)` function and the division symbol `/`. ([Click here](Lesson16.html#r-instructions-for-computing-sample-proportions) to review.)

- To create a bar chart of data in R, use the `barplot(...)` function after saving your `table(...)` of counts. ([Click here](Lesson16.html#r-instructions-for-bar-charts) to review.)

<br>
</div>
<br>

## Navigation

<center>
| **Previous Reading** | **This Reading** | **Next Reading** |
| :------------------: | :--------------: | :--------------: |
| [Lesson 15: <br> Review for Exam 2](Lesson15.html) | Lesson 16: <br> Describing Categorical Data: Proportions; <br> Sampling Distribution of a Sample Proportion | [Lesson 17: <br> Inference for One Proportion](Lesson17.html) |
</center>

[^1]: American College Health Association (2014). American College Health Association-National College Health Assessment II: Undergraduate students reference group executive summary spring 2014. Hanover, MD: American College Health Association.

[^2]: Robert R. Wright, Jack Shuai, Yovanny Maldonado, and Caleb Nelson, all originally from the Department of Psychology at BYU-Idaho, published their study results under the title "The CENTS Program" which is currently under review.