From 9c2e0508a25b7d256b17195a346c8830e40d0530 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 17:39:40 +0100 Subject: [PATCH 01/31] Remove top level imports and add warning supression --- source/inference.md | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/source/inference.md b/source/inference.md index bb1d78d4..1f2d2d80 100644 --- a/source/inference.md +++ b/source/inference.md @@ -19,18 +19,16 @@ kernelspec: :tags: [remove-cell] import altair as alt -import numpy as np -import pandas as pd -from sklearn.utils import resample -alt.data_transformers.disable_max_rows() +from myst_nb import glue +import warnings -# alt.data_transformers.enable('data_server') -# alt.renderers.enable('mimetype') -from myst_nb import glue +warnings.filterwarnings("ignore", category=FutureWarning) +alt.data_transformers.disable_max_rows() ``` ## Overview + A typical data analysis task in practice is to draw conclusions about some unknown aspect of a population of interest based on observed data sampled from that population; we typically do not get data on the *entire* population. Data From 86b25ae15a8f65918864fb012d3128a53488d20d Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 17:40:45 +0100 Subject: [PATCH 02/31] Remove commented out sections --- source/inference.md | 45 ++++----------------------------------------- 1 file changed, 4 insertions(+), 41 deletions(-) diff --git a/source/inference.md b/source/inference.md index 1f2d2d80..669a5cc2 100644 --- a/source/inference.md +++ b/source/inference.md @@ -39,6 +39,7 @@ populations and then introduce two common techniques in statistical inference: *point estimation* and *interval estimation*. ## Chapter learning objectives + By the end of the chapter, readers will be able to do the following: * Describe real-world examples of questions that can be answered with statistical inference. @@ -55,6 +56,7 @@ By the end of the chapter, readers will be able to do the following: +++ ## Why do we need sampling? + We often need to understand how quantities we observe in a subset of data relate to the same quantities in the broader population. For example, suppose a retailer is considering selling iPhone accessories, and they want to estimate @@ -153,17 +155,7 @@ includes an ID number, neighborhood, type of room, the number of people the rental accommodates, number of bathrooms, bedrooms, beds, and the price per night. - - ```{code-cell} ipython3 -import altair as alt import pandas as pd airbnb = pd.read_csv("data/listings.csv") @@ -213,15 +205,7 @@ We will use the `sample` method of the `pandas.DataFrame` object to take the sample. The argument `n` of `sample` is the size of the sample to take. ```{code-cell} ipython3 -:tags: [remove-cell] - -# Instead, perhaps we can approximate it with a small subset of data! -# To investigate this idea, let's try randomly selecting 40 listings (*i.e.,* taking a random sample of -# size 40 from our population), and computing the proportion for that sample. -# We will use the `rep_sample_n` function \index{rep\_sample\_n} from the `infer` -# package \index{infer} to take the sample. The arguments of `rep_sample_n` are (1) the data frame to -# sample from, and (2) the size of the sample to take. -``` +import numpy as np ```{code-cell} ipython3 sample_1 = airbnb.sample(n=40, random_state=12) @@ -311,15 +295,7 @@ each listing belongs. Above, `pandas.DataFrame` by default shows the first and l rows, so we can verify that we indeed created 20,000 samples (or replicates). -```{code-cell} ipython3 -:tags: [remove-cell] - -# Notice that the column `replicate` indicates the replicate, or sample, to which -# each listing belongs. Above, since by default R only prints the first few rows, -# it looks like all of the listings have `replicate` set to 1. But you can -# check the last few entries using the `tail()` function to verify that -# we indeed created 20,000 samples (or replicates). -``` ++++ Now that we have obtained the samples, we need to compute the proportion of entire home/apartment listings in each sample. @@ -331,19 +307,6 @@ Both the first and last few entries of the resulting data frame are printed below to show that we end up with 20,000 point estimates, one for each of the 20,000 samples. ```{code-cell} ipython3 -:tags: [remove-cell] - -# Now that we have obtained the samples, we need to compute the -# proportion of entire home/apartment listings in each sample. -# We first group the data by the `replicate` variable—to group the -# set of listings in each sample together—and then use `summarize` -# to compute the proportion in each sample. -# We print both the first and last few entries of the resulting data frame -# below to show that we end up with 20,000 point estimates, one for each of the 20,000 samples. -``` - -```{code-cell} ipython3 -# calculate the the number of observations in each sample with room_type == 'Entire home/apt' sample_estimates = ( samples.query("room_type == 'Entire home/apt'") .groupby("replicate")["room_type"] From 9760f170d105a9025d4cecd519eb80cbee3e64d5 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 17:41:51 +0100 Subject: [PATCH 03/31] Remove unnecessary assignment, set the seed properly, and use value_counts instead of manual computation --- source/inference.md | 38 ++++++++++++++------------------------ 1 file changed, 14 insertions(+), 24 deletions(-) diff --git a/source/inference.md b/source/inference.md index 669a5cc2..4a429669 100644 --- a/source/inference.md +++ b/source/inference.md @@ -166,27 +166,25 @@ Suppose the city of Vancouver wants information about Airbnb rentals to help plan city bylaws, and they want to know how many Airbnb places are listed as entire homes and apartments (rather than as private or shared rooms). Therefore they may want to estimate the true proportion of all Airbnb listings where the -"type of place" is listed as "entire home or apartment." Of course, we usually +room type is listed as "entire home or apartment." Of course, we usually do not have access to the true population, but here let's imagine (for learning purposes) that our data set represents the population of all Airbnb rental -listings in Vancouver, Canada. We can find the proportion of listings where -`room_type == "Entire home/apt"`. +listings in Vancouver, Canada. +We can find the proportion of listings for each room type +by using the `value_counts` function with the `normalize` parameter +as we did in previous chapters. ```{index} pandas.DataFrame; df[], count, len ``` ```{code-cell} ipython3 -population_summary = pd.DataFrame() -population_summary["n"] = [airbnb.query("room_type == 'Entire home/apt'")["id"].count()] -population_summary["proportion"] = population_summary["n"] / len(airbnb) - -population_summary +airbnb['room_type'].value_counts(normalize=True) ``` ```{code-cell} ipython3 :tags: [remove-cell] -glue("population_proportion", round(population_summary["proportion"][0], 3)) +glue("population_proportion", airbnb['room_type'].value_counts(normalize=True)['Entire home/apt'].round(3)) ``` We can see that the proportion of `Entire home/apt` listings in @@ -202,25 +200,23 @@ Instead, perhaps we can approximate it with a small subset of data! To investigate this idea, let's try randomly selecting 40 listings (*i.e.,* taking a random sample of size 40 from our population), and computing the proportion for that sample. We will use the `sample` method of the `pandas.DataFrame` -object to take the sample. The argument `n` of `sample` is the size of the sample to take. +object to take the sample. The argument `n` of `sample` is the size of the sample to take +and since we are starting to use randomness here, +we are also setting the random seed via numpy to make the results reproducible. ```{code-cell} ipython3 import numpy as np -```{code-cell} ipython3 -sample_1 = airbnb.sample(n=40, random_state=12) -airbnb_sample_1 = pd.DataFrame() -airbnb_sample_1["n"] = [sample_1.query("room_type == 'Entire home/apt'")["id"].count()] -airbnb_sample_1["proportion"] = airbnb_sample_1["n"] / len(sample_1) +np.random.seed(155) -airbnb_sample_1 +airbnb.sample(n=40)['room_type'].value_counts(normalize=True) ``` ```{code-cell} ipython3 :tags: [remove-cell] -glue("sample_1_proportion", round(airbnb_sample_1["proportion"][0], 2)) +glue("sample_1_proportion", airbnb.sample(n=40, random_state=155)['room_type'].value_counts(normalize=True)['Entire home/apt'].round(3)) ``` Here we see that the proportion of entire home/apartment listings in this @@ -234,13 +230,7 @@ if we were to take *another* random sample of size 40 and compute the proportion we would not get the same answer: ```{code-cell} ipython3 -sample_2 = airbnb.sample(n=40, random_state=1234) - -airbnb_sample_2 = pd.DataFrame() -airbnb_sample_2["n"] = [sample_2.query("room_type == 'Entire home/apt'")["id"].count()] -airbnb_sample_2["proportion"] = airbnb_sample_2["n"] / len(sample_2) - -airbnb_sample_2 +airbnb.sample(n=40)['room_type'].value_counts(normalize=True) ``` Confirmed! We get a different value for our estimate this time. From 18364de0174a2ceaf3a925a1f003f92556e155e3 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 17:42:21 +0100 Subject: [PATCH 04/31] Replace for loop with list comprehension --- source/inference.md | 34 ++++++++++------------------------ 1 file changed, 10 insertions(+), 24 deletions(-) diff --git a/source/inference.md b/source/inference.md index 4a429669..b757fc1b 100644 --- a/source/inference.md +++ b/source/inference.md @@ -255,35 +255,21 @@ expect our sample proportions from this population to vary for samples of size 4 ``` We again use the `sample` to take samples of size 40 from our -population of Airbnb listings. But this time we use a for loop -to take 20,000 samples of size 40. +population of Airbnb listings. But this time we use a list comprehension +to repeat an operation multiple time (as in the previous chapter). +In this case we are taking 20,000 samples of size 40 +and to make it clear which rows in the data frame come +which of the 20,000 samples, +we also add a column called `replicate` with this information. ```{code-cell} ipython3 -:tags: [remove-cell] - -# We again use the `rep_sample_n` to take samples of size 40 from our -# population of Airbnb listings. But this time we set the `reps` argument to 20,000 to specify -# that we want to take 20,000 samples of size 40. \index{rep\_sample\_n!reps argument} -# \index{rep\_sample\_n!size argument} -``` - -```{code-cell} ipython3 -np.random.seed(1) - -samples = [] -for rep in range(20000): - sample = airbnb.sample(40) - sample = sample.assign(replicate=rep) - samples.append(sample) -samples = pd.concat([samples[i] for i in range(len(samples))]) - +samples = pd.concat([airbnb.sample(40).assign(replicate=n) for n in range(20_000)]) samples ``` -Notice that the column `replicate` indicates the replicate, or sample, to which -each listing belongs. Above, `pandas.DataFrame` by default shows the first and last few -rows, so we can verify that -we indeed created 20,000 samples (or replicates). +Since the column `replicate` indicates the replicate/sample number, +we can verify that we indeed seem to have 20,0000 samples +starting at sample 0 and ending at sample 19,999. +++ From 1e26ae598de97e043c218732b81a112c7f370e76 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 17:42:47 +0100 Subject: [PATCH 05/31] Use value_counts instead of manual computation --- source/inference.md | 20 +++++++------------- 1 file changed, 7 insertions(+), 13 deletions(-) diff --git a/source/inference.md b/source/inference.md index b757fc1b..af623a4e 100644 --- a/source/inference.md +++ b/source/inference.md @@ -284,19 +284,13 @@ below to show that we end up with 20,000 point estimates, one for each of the 20 ```{code-cell} ipython3 sample_estimates = ( - samples.query("room_type == 'Entire home/apt'") - .groupby("replicate")["room_type"] - .count() - .reset_index() - .rename(columns={"room_type": "counts"}) + samples + .groupby('replicate') + ['room_type'] + .value_counts(normalize=True) + .reset_index(name='sample_proportion') + .query('room_type=="Entire home/apt"') ) - -# calculate the proportion -sample_estimates["sample_proportion"] = sample_estimates["counts"] / 40 - -# drop the count column -sample_estimates = sample_estimates.drop(columns=["counts"]) - sample_estimates ``` @@ -312,7 +306,7 @@ sampling distribution directly for learning purposes. :tags: [remove-output] sampling_distribution = alt.Chart(sample_estimates).mark_bar().encode( - x=alt.X("sample_proportion", bin=alt.Bin(maxbins=12), title="Sample proportions"), + x=alt.X("sample_proportion", title="Sample proportions", bin=alt.Bin(maxbins=20)), y=alt.Y("count()", title="Count"), ) From dfbf592e8ed50c39ad3f160843ab489b079917ff Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 17:42:54 +0100 Subject: [PATCH 06/31] Fix typo --- source/inference.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/source/inference.md b/source/inference.md index af623a4e..cab4e770 100644 --- a/source/inference.md +++ b/source/inference.md @@ -328,9 +328,9 @@ Sampling distribution of the sample proportion for sample size 40. ```{code-cell} ipython3 :tags: [remove-cell] -glue("sample_propotion_center", round(sample_estimates["sample_proportion"].mean(), 1)) -glue("sample_propotion_min", round(sample_estimates["sample_proportion"].min(), 1)) -glue("sample_propotion_max", round(sample_estimates["sample_proportion"].max(), 1)) +glue("sample_proportion_center", round(sample_estimates["sample_proportion"].mean(), 1)) +glue("sample_proportion_min", round(sample_estimates["sample_proportion"].min(), 1)) +glue("sample_proportion_max", round(sample_estimates["sample_proportion"].max(), 1)) ``` ```{index} sampling distribution; shape @@ -338,9 +338,9 @@ glue("sample_propotion_max", round(sample_estimates["sample_proportion"].max(), The sampling distribution in {numref}`fig:11-example-proportions7` appears to be bell-shaped, is roughly symmetric, and has one peak. It is centered -around {glue:}`sample_propotion_center` and the sample proportions -range from about {glue:}`sample_propotion_min` to about -{glue:}`sample_propotion_max`. In fact, we can +around {glue:}`sample_proportion_center` and the sample proportions +range from about {glue:}`sample_proportion_min` to about +{glue:}`sample_proportion_max`. In fact, we can calculate the mean of the sample proportions. ```{code-cell} ipython3 From c8508905d49e3a9cfb0cae8de7c142d19bd8b8d4 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 17:43:10 +0100 Subject: [PATCH 07/31] Correct viz syntax --- source/inference.md | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/source/inference.md b/source/inference.md index cab4e770..76860295 100644 --- a/source/inference.md +++ b/source/inference.md @@ -381,15 +381,13 @@ We can visualize the population distribution of the price per night with a histo ```{code-cell} ipython3 :tags: [remove-output] -population_distribution = ( - alt.Chart(airbnb) - .mark_bar() - .encode( - x=alt.X( - "price", bin=alt.Bin(maxbins=30), title="Price per night (Canadian dollars)" - ), - y=alt.Y("count()", title="Count"), - ) +population_distribution = alt.Chart(airbnb).mark_bar().encode( + x=alt.X( + "price", + bin=alt.Bin(maxbins=30), + title="Price per night (Canadian dollars)" + ), + y=alt.Y("count()", title="Count"), ) population_distribution From 6fddb54ca7efa4f7e326b09e08bf05259791a21e Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 17:43:41 +0100 Subject: [PATCH 08/31] Remove unnecessary assignment, and simplify computation --- source/inference.md | 19 ++++++------------- 1 file changed, 6 insertions(+), 13 deletions(-) diff --git a/source/inference.md b/source/inference.md index 76860295..dff1defc 100644 --- a/source/inference.md +++ b/source/inference.md @@ -418,15 +418,13 @@ Along with visualizing the population, we can calculate the population mean, the average price per night for all the Airbnb listings. ```{code-cell} ipython3 -population_parameters = airbnb["price"].mean() - -population_parameters +airbnb["price"].mean() ``` ```{code-cell} ipython3 :tags: [remove-cell] -glue("population_mean", round(population_parameters, 2)) +glue("population_mean", airbnb["price"].mean().round(2)) ``` ```{index} population; parameter @@ -448,7 +446,7 @@ access to the population data and simulate taking one random sample of 40 listings in Python, again using `sample`. ```{code-cell} ipython3 -one_sample = airbnb.sample(40, random_state=1) +one_sample = airbnb.sample(40) ``` We can create a histogram to visualize the distribution of observations in the @@ -479,19 +477,14 @@ Distribution of price per night (Canadian dollars) for sample of 40 Airbnb listi ::: ```{code-cell} ipython3 -estimates = one_sample["price"].mean() - -estimates +one_sample["price"].mean() ``` ```{code-cell} ipython3 :tags: [remove-cell] -glue("estimate_mean", round(estimates, 2)) -glue( - "diff_perc", - round(100 * abs(estimates - population_parameters) / population_parameters, 1), -) +glue("estimate_mean", one_sample["price"].mean().round(2)) +glue("diff_perc", round(100 * abs(1 - (one_sample['price'].mean() / airbnb['price'].mean())), 1)) ``` The average value of the sample of size 40 From 75576d938c315cf6929cf71e9ab00a87f8aec216 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 17:44:24 +0100 Subject: [PATCH 09/31] Remove unnecessary second loop --- source/inference.md | 23 ++++++++--------------- 1 file changed, 8 insertions(+), 15 deletions(-) diff --git a/source/inference.md b/source/inference.md index dff1defc..6907b60e 100644 --- a/source/inference.md +++ b/source/inference.md @@ -511,26 +511,19 @@ get a sense for this variation. In this case, we'll use 20,000 samples of size 40. ```{code-cell} ipython3 -np.random.seed(2) - -samples = [] -for rep in range(20000): - sample = airbnb.sample(40) - sample = sample.assign(replicate=rep) - samples.append(sample) -samples = pd.concat([samples[i] for i in range(len(samples))]) - samples ``` -Now we can calculate the sample mean for each replicate and plot the sampling -distribution of sample means for samples of size 40. - ```{code-cell} ipython3 -sample_estimates = samples.groupby("replicate")["price"].mean().reset_index().rename( - columns={"price": "sample_mean"} +sample_estimates = ( + samples + .query('room_type == "Entire home/apt"') + .groupby('replicate') + ['price'] + .mean() + .reset_index() + .rename(columns={'price': 'sample_mean'}) ) - sample_estimates ``` From d93648fa056836a413767b94ecd5eff1d680e625 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 17:45:56 +0100 Subject: [PATCH 10/31] Simplify plotting syntax and clarify variable name --- source/inference.md | 91 ++++++++++++++++----------------------------- 1 file changed, 32 insertions(+), 59 deletions(-) diff --git a/source/inference.md b/source/inference.md index 6907b60e..b53da531 100644 --- a/source/inference.md +++ b/source/inference.md @@ -527,29 +527,28 @@ sample_estimates = ( sample_estimates ``` +Now we can calculate the sample mean for each replicate and plot the sampling +distribution of sample means for samples of size 40. + ```{code-cell} ipython3 :tags: [remove-output] -sampling_distribution_40 = ( - alt.Chart(sample_estimates) - .mark_bar() - .encode( - x=alt.X( - "sample_mean", - bin=alt.Bin(maxbins=30), - title="Sample mean price per night (Canadian dollars)", - ), - y=alt.Y("count()", title="Count"), - ) +sampling_distribution = alt.Chart(sample_estimates).mark_bar().encode( + x=alt.X( + "sample_mean", + bin=alt.Bin(maxbins=30), + title="Sample mean price per night (Canadian dollars)", + ), + y=alt.Y("count()", title="Count"), ) -sampling_distribution_40 +sampling_distribution ``` ```{code-cell} ipython3 :tags: [remove-cell] -glue("fig:11-example-means4", sampling_distribution_40) +glue("fig:11-example-means4", sampling_distribution) ``` :::{glue:figure} fig:11-example-means4 @@ -561,8 +560,8 @@ Sampling distribution of the sample means for sample size of 40. ```{code-cell} ipython3 :tags: [remove-cell] -glue("quantile_1", round(int(sample_estimates["sample_mean"].quantile(0.25)), -1)) -glue("quantile_3", round(int(sample_estimates["sample_mean"].quantile(0.75)), -1)) +glue("quantile_1", round(int(sample_estimates["sample_mean"].quantile(0.25)), - 1)) +glue("quantile_3", round(int(sample_estimates["sample_mean"].quantile(0.75)), - 1)) ``` ```{index} sampling distribution; shape @@ -603,54 +602,28 @@ was \$`r round(mean(airbnb$price),2)`. ```{code-cell} ipython3 :tags: [remove-input] -( - ( - alt.Chart(airbnb, title="Population") - .mark_bar(clip=True) - .encode( - x=alt.X( - "price", - bin=alt.Bin(maxbins=30), - title="Price per night (Canadian dollars)", - axis=alt.Axis(values=list(range(50, 601, 50))), - scale=alt.Scale(domain=(min(airbnb["price"]), 600)), - ), - y=alt.Y("count()", title="Count"), - ) - .properties(width=400, height=150) - ) - & ( - alt.Chart(one_sample, title="Sample (n = 40)") - .mark_bar(clip=True) - .encode( +glue( + "fig:11-example-means5", + alt.vconcat( + population_distribution.mark_bar(clip=True).encode( x=alt.X( "price", bin=alt.Bin(maxbins=30), title="Price per night (Canadian dollars)", - axis=alt.Axis(values=list(range(50, 601, 50))), - scale=alt.Scale(domain=(min(airbnb["price"]), 600)), - ), - y=alt.Y("count()", title="Count"), - ) - .properties(width=400, height=150) - ) - & ( - alt.Chart( - sample_estimates, - title="Sampling distribution of the mean for samples of size 40", - ) - .mark_bar(clip=True) - .encode( - x=alt.X( - "sample_mean", - bin=True, - title="Sample mean price per night (Canadian dollars)", - axis=alt.Axis(values=list(range(50, 601, 50))), - scale=alt.Scale(domain=(min(airbnb["price"]), 600)), - ), - y=alt.Y("count()", title="Count"), - ) - .properties(width=400, height=150) + scale=alt.Scale(domainMax=700) + ) + ).properties( + title='Population', height=150 + ), + sample_distribution.properties(title="Sample (n = 40)").properties(height=150), + sampling_distribution.properties( + title=alt.TitleParams( + "Sampling distribution of the mean", + subtitle="For 20,000 samples of size 40" + ) + ).properties(height=150) + ).resolve_scale( + x='shared' ) ) ``` From ad623dc067ada431785e0c45e3bb9780db3cfc99 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 20:26:40 +0100 Subject: [PATCH 11/31] Simplify comparison plot of sample sizes --- source/inference.md | 213 ++++++++++---------------------------------- 1 file changed, 46 insertions(+), 167 deletions(-) diff --git a/source/inference.md b/source/inference.md index b53da531..6e239260 100644 --- a/source/inference.md +++ b/source/inference.md @@ -646,178 +646,57 @@ sampling distribution of the sample mean. We indicate the mean of the sampling distribution with a red vertical line. ```{code-cell} ipython3 -:tags: [remove-cell] - -# Initially thought of using a loop, but Jupyter book failed to build because "cell execution -# timed out..." -# ## Sampling n = 20, 50, 100, 500 -# np.random.seed(2) -# sample_dict = {} -# for sample_n in [20, 50, 100, 500]: -# samples = [] -# for rep in range(20000): -# sample = airbnb.sample(sample_n) -# sample = sample.assign(replicate=rep) -# samples.append(sample) -# samples = pd.concat([samples[i] for i in range(len(samples))]) - -# sample_dict[f"sample_estimates_{sample_n}"] = ( -# samples.groupby("replicate")["price"] -# .mean() -# .reset_index() -# .rename(columns={"price": "sample_mean"}) -# ) -``` - -```{code-cell} ipython3 -:tags: [remove-cell] - -sample_dict = {} - -# Sampling n = 20 -samples = [] -for rep in range(20000): - sample = airbnb.sample(20) - sample = sample.assign(replicate=rep) - samples.append(sample) -samples = pd.concat([samples[i] for i in range(len(samples))]) - -sample_dict[f"sample_estimates_20"] = ( - samples.groupby("replicate")["price"] - .mean() - .reset_index() - .rename(columns={"price": "sample_mean"}) -) -``` - -```{code-cell} ipython3 -:tags: [remove-cell] - -# Sampling n = 50 -samples = [] -for rep in range(20000): - sample = airbnb.sample(50) - sample = sample.assign(replicate=rep) - samples.append(sample) -samples = pd.concat([samples[i] for i in range(len(samples))]) - -sample_dict[f"sample_estimates_50"] = ( - samples.groupby("replicate")["price"] - .mean() - .reset_index() - .rename(columns={"price": "sample_mean"}) -) -``` - -```{code-cell} ipython3 -:tags: [remove-cell] - -# Sampling n = 100 -samples = [] -for rep in range(20000): - sample = airbnb.sample(100) - sample = sample.assign(replicate=rep) - samples.append(sample) -samples = pd.concat([samples[i] for i in range(len(samples))]) - -sample_dict[f"sample_estimates_100"] = ( - samples.groupby("replicate")["price"] - .mean() - .reset_index() - .rename(columns={"price": "sample_mean"}) -) -``` - -```{code-cell} ipython3 -:tags: [remove-cell] - -# Sampling n = 500 -samples = [] -for rep in range(20000): - sample = airbnb.sample(500) - sample = sample.assign(replicate=rep) - samples.append(sample) -samples = pd.concat([samples[i] for i in range(len(samples))]) +:tags: [remove-input] -sample_dict[f"sample_estimates_500"] = ( - samples.groupby("replicate")["price"] - .mean() - .reset_index() - .rename(columns={"price": "sample_mean"}) +# Plot sampling distributions for multiple sample sizes +base = alt.Chart( + pd.concat([ + pd.concat([ + airbnb.sample(sample_size).assign(sample_size=sample_size, replicate=replicate) + for sample_size in [20, 50, 100, 500] + ]) + for replicate in range(20_000) + ]).groupby( + ["sample_size", 'replicate'], + as_index=False + )["price"].mean(), + height=150 ) -``` - -```{code-cell} ipython3 -:tags: [remove-cell] - -## Plot sampling distribution n = 20, 50, 100, 500 -sample_plot = {} -plot_min_x = sample_dict["sample_estimates_20"]["sample_mean"].min() -plot_max_x = sample_dict["sample_estimates_20"]["sample_mean"].max() - - -# def max_bins(distribution): -# if int(distribution.split("_")[-1]) >= 100: -# return 10 -# else: -# return 30 - - -def text_y(distribution): - sample_n = int(distribution.split("_")[-1]) - if sample_n == 20: - return 2000 - elif sample_n == 50: - return 3000 - elif sample_n == 100: - return 4500 - else: - return 10000 - -for sample_n, df in sample_dict.items(): - sample_plot[sample_n] = ( - alt.Chart(df, title=f"n = {sample_n.split('_')[-1]}") - .mark_bar() - .encode( - x=alt.X( - "sample_mean", - title="Sample mean price per night (Canadian dollars)", - bin=alt.Bin(extent=[80, 300], step=50/7), # maxbins=max_bins(sample_n) - scale=alt.Scale(domain=(plot_min_x, plot_max_x)), - axis=alt.Axis(values=list(range(80, 301, 20))) - ), - y=alt.Y("count()", title="Count"), +glue( + "fig:11-example-means7", + alt.layer( + base.mark_bar().encode( + alt.X('price', bin=alt.Bin(maxbins=30)), + alt.Y('count()') + ), + base.mark_rule(color='coral', size=3).encode( + x='mean(price)' + ), + base.mark_text(align='left', color='coral', size=12, fontWeight='bold', dx=10).transform_aggregate( + mean_price = 'mean(price)', + ).transform_calculate( + label = "'Mean = ' + round(datum.mean_price * 10) / 10" + ).encode( + x=alt.X('mean_price:Q', title="Sample mean price per night (Canadian dollars)"), + y=alt.value(10), + text='label:N' ) - ) - - sample_mean = sample_dict[sample_n]["sample_mean"].mean() - sample_plot[sample_n] = ( - sample_plot[sample_n] - + alt.Chart(pd.DataFrame({"x": [sample_mean]})) - .mark_rule(color="red", size=2) - .encode(x="x") - + ( - alt.Chart( - pd.DataFrame( - { - "x": [plot_max_x - 20], - "y": [text_y(sample_n)], - "text": [f"mean = {round(sample_mean, 1)}"], - } - ) + ).facet( + alt.Facet( + 'sample_size', + header=alt.Header( + title='', + labelFontWeight='bold', + labelFontSize=12, + labelPadding=3, + labelExpr='"Sample size = " + datum.value' ) - .mark_text(dy=-5, size=15) - .encode(x="x", y="y", text="text") - ) - ).properties(width=350, height=250) -``` - -```{code-cell} ipython3 -:tags: [remove-input] - -(sample_plot["sample_estimates_20"] | sample_plot["sample_estimates_50"]) & ( - sample_plot["sample_estimates_100"] | sample_plot["sample_estimates_500"] + ), + columns=1, + ).resolve_scale( + y='independent' + ) ) ``` From 0e39c33b17a3b9f9f0e17f0c7eabbf1274c18d61 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 20:26:57 +0100 Subject: [PATCH 12/31] Use bullets to make points clearer --- source/inference.md | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/source/inference.md b/source/inference.md index 6e239260..f5095c03 100644 --- a/source/inference.md +++ b/source/inference.md @@ -713,13 +713,16 @@ Comparison of sampling distributions, with mean highlighted as a vertical red li ``` Based on the visualization in {numref}`fig:11-example-means7`, three points -about the sample mean become clear. First, the mean of the sample mean (across -samples) is equal to the population mean. In other words, the sampling -distribution is centered at the population mean. Second, increasing the size of -the sample decreases the spread (i.e., the variability) of the sampling -distribution. Therefore, a larger sample size results in a more reliable point -estimate of the population parameter. And third, the distribution of the sample -mean is roughly bell-shaped. +about the sample mean become clear: + +1. The mean of the sample mean (across + samples) is equal to the population mean. In other words, the sampling + distribution is centered at the population mean. +2. Increasing the size of + the sample decreases the spread (i.e., the variability) of the sampling + distribution. Therefore, a larger sample size results in a more reliable point + estimate of the population parameter. +3. The distribution of the sample mean is roughly bell-shaped. > **Note:** You might notice that in the `n = 20` case in {numref}`fig:11-example-means7`, > the distribution is not *quite* bell-shaped. There is a bit of skew towards the right! @@ -736,6 +739,7 @@ mean is roughly bell-shaped. +++ ### Summary + 1. A point estimate is a single value computed using a sample from a population (e.g., a mean or proportion). 2. The sampling distribution of an estimate is the distribution of the estimate for all possible samples of a fixed size from the same population. 3. The shape of the sampling distribution is usually bell-shaped with one peak and centered at the population mean or proportion. From 0c0a956222e328fa5eea504dc180243839ad8e7b Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 20:27:06 +0100 Subject: [PATCH 13/31] Remove display option --- source/inference.md | 6 ------ 1 file changed, 6 deletions(-) diff --git a/source/inference.md b/source/inference.md index f5095c03..db34f9b8 100644 --- a/source/inference.md +++ b/source/inference.md @@ -892,12 +892,6 @@ listings in Vancouver, Canada, using a single sample size of 40. Recall our point estimate was \${glue:}`estimate_mean`. The histogram of prices in the sample is displayed in {numref}`fig:11-bootstrapping1`. -```{code-cell} ipython3 -:tags: [remove-cell] - -pd.set_option('display.max_rows', 20) -``` - ```{code-cell} ipython3 one_sample ``` From 9315dfede81332b7c2ab617f2ec51c4bb5851ff4 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 22:15:07 +0100 Subject: [PATCH 14/31] Remove incorrect query for means --- source/inference.md | 1 - 1 file changed, 1 deletion(-) diff --git a/source/inference.md b/source/inference.md index db34f9b8..b7e67666 100644 --- a/source/inference.md +++ b/source/inference.md @@ -517,7 +517,6 @@ samples ```{code-cell} ipython3 sample_estimates = ( samples - .query('room_type == "Entire home/apt"') .groupby('replicate') ['price'] .mean() From 60a104d1d00b353c7344fa9cf539682e8acc952c Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 22:15:33 +0100 Subject: [PATCH 15/31] Reduce height to make histograms easier to read --- source/inference.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/source/inference.md b/source/inference.md index b7e67666..de035eaa 100644 --- a/source/inference.md +++ b/source/inference.md @@ -809,7 +809,7 @@ for sample_n in [10, 20, 50, 100, 200]: ), y=alt.Y("count()", title="Count"), ) - ).properties(width=350, height=250) + ).properties(width=350, height=150) # add title and standardize the x axis ticks for population histogram population_distribution.title = "Population distribution" population_distribution.encoding["x"]["bin"] = alt.Bin(extent=[0, 600], step=20) @@ -827,7 +827,7 @@ glue( ) & ( sample_distribution_dict["sample_distribution_200"] - | population_distribution.properties(width=350, height=250) + | population_distribution.properties(width=350, height=150) ) ), ) From 4fd81f1a57f6072a6cc39e6598cd38f22726630a Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 22:16:22 +0100 Subject: [PATCH 16/31] Change from sklearns `resample` and for loops to keep using `.sample` and list comprehensions --- source/inference.md | 30 ++++++++++++------------------ 1 file changed, 12 insertions(+), 18 deletions(-) diff --git a/source/inference.md b/source/inference.md index de035eaa..47fce52f 100644 --- a/source/inference.md +++ b/source/inference.md @@ -929,15 +929,17 @@ this sample and estimate are the only data we can work with. We now perform steps 1–5 listed above to generate a single bootstrap sample in Python and calculate a point estimate from that bootstrap sample. We will -use the `resample` function from the `scikit-learn` package. Critically, note that we now -pass `one_sample`—our single sample of size 40—as the first argument. -And since we need to sample with replacement, -we keep the argument for `replace` to its default value of `True`. +continue using the `sample` function of our dataframe, +Critically, note that we now +set `frac=1` ("fraction") to indicate that we want to draw as many samples as there are rows in the dataframe +(we could also have set `n=40` but then we would need to manually keep track of how many rows there are). +Since we need to sample with replacement when bootstrapping, +we change the `replace` parameter to `True`. ```{code-cell} ipython3 :tags: [] -boot1 = resample(one_sample, replace=True, n_samples=40, random_state=2) +boot1 = one_sample.sample(frac=1, replace=True) boot1_dist = alt.Chart(boot1).mark_bar().encode( x=alt.X( "price", @@ -972,21 +974,13 @@ mimic drawing another sample from the population by drawing one from our origina sample. Let's now take 20,000 bootstrap samples from the original sample (`one_sample`) -using `resample`, and calculate the means for +and calculate the means for each of those replicates. Recall that this assumes that `one_sample` *looks like* our original population; but since we do not have access to the population itself, this is often the best we can do. ```{code-cell} ipython3 -np.random.seed(2) - -boot20000 = [] -for rep in range(20000): - sample = resample(one_sample, replace=True, n_samples=40) - sample = sample.assign(replicate=rep) - boot20000.append(sample) -boot20000 = pd.concat([boot20000[i] for i in range(len(boot20000))]) - +boot20000 = pd.concat([one_sample.sample(frac=1, replace=True).assign(replicate=n) for n in range(1_000)]) boot20000 ``` @@ -1040,7 +1034,7 @@ our point estimate to behave if we took another sample. ```{code-cell} ipython3 boot20000_means = boot20000.groupby("replicate")["price"].mean().reset_index().rename( - columns={"price": "mean"} + columns={"price": "sample_mean"} ) boot20000_means @@ -1051,8 +1045,8 @@ boot20000_means boot_est_dist = alt.Chart(boot20000_means).mark_bar().encode( x=alt.X( - "mean", - bin=alt.Bin(extent=[95, 245], step=5), + "sample_mean", + bin=alt.Bin(maxbins=20), title="Sample mean price per night (Canadian dollars)", ), y=alt.Y("count()", title="Count"), From a287e8fd20bc926a85dad11dc59287ea90479e42 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 22:16:48 +0100 Subject: [PATCH 17/31] Simplify visualization code --- source/inference.md | 144 +++++++++++--------------------------------- 1 file changed, 36 insertions(+), 108 deletions(-) diff --git a/source/inference.md b/source/inference.md index 47fce52f..98a44ce6 100644 --- a/source/inference.md +++ b/source/inference.md @@ -1070,72 +1070,28 @@ the true sampling distribution—which corresponds to taking many samples fr ```{code-cell} ipython3 :tags: [remove-input] -samples = [] -for rep in range(20000): - sample = airbnb.sample(40) - sample = sample.assign(replicate=rep) - samples.append(sample) -samples = pd.concat([samples[i] for i in range(len(samples))]) - -sample_estimates = samples.groupby("replicate")["price"].mean().reset_index().rename( - columns={"price": "sample_mean"} -) - -plot_min_x = sample_estimates["sample_mean"].min() -plot_max_x = sample_estimates["sample_mean"].max() -sampling_mean = sample_estimates["sample_mean"].mean() -boot_sampling_mean = boot20000_means["mean"].mean() - -sampling_dist = alt.Chart(sample_estimates).mark_bar().encode( - x=alt.X( - "sample_mean", - bin=alt.Bin(extent=[95, 245], step=5), - # scale=alt.Scale(domain=(plot_min_x, plot_max_x)), - title="Sample mean price per night (Canadian dollars)", - ), - y=alt.Y("count()", title="Count"), -) - -annotated_sampling_dist = ( - sampling_dist - + alt.Chart(pd.DataFrame({"x": [sampling_mean]}), title="Sampling distribution") - .mark_rule(color="red", size=2) - .encode(x="x") - + ( - alt.Chart( - pd.DataFrame( - { - "x": [plot_max_x - 20], - "y": [2000], - "text": [f"mean = {round(sampling_mean, 1)}"], - } - ) +alt.vconcat( + alt.layer( + sampling_distribution, + sampling_distribution.mark_rule(color='coral', size=2).encode(x='mean(sample_mean)', y=alt.Y()), + sampling_distribution.mark_text(color='coral', size=12, align='left', dx=16, fontWeight='bold').encode( + x='mean(sample_mean)', + y=alt.value(7), + text=alt.value(f"Mean = {sampling_distribution['data']['sample_mean'].mean().round(1)}") ) - .mark_text(dy=-5, size=15) - .encode(x="x", y="y", text="text") - ) -) - -annotated_boot_est_dist = boot_est_dist + ( - alt.Chart(pd.DataFrame({"x": [boot_sampling_mean]}), title="Bootstrap distribution") - .mark_rule(color="red", size=2) - .encode(x="x") - + ( - alt.Chart( - pd.DataFrame( - { - "x": [plot_max_x - 20], - "y": [1500], - "text": [f"mean = {round(boot_sampling_mean, 1)}"], - } - ) + ).properties(title='Sampling distribution', height=150), + alt.layer( + boot_est_dist, + boot_est_dist.mark_rule(color='coral', size=2).encode(x='mean(sample_mean)', y=alt.Y()), + boot_est_dist.mark_text(color='coral', size=12, align='left', dx=18, fontWeight='bold').encode( + x='mean(sample_mean)', + y=alt.value(6), + text=alt.value(f"Mean = {boot_est_dist['data']['sample_mean'].mean().round(1)}") ) - .mark_text(dy=-5, size=15) - .encode(x="x", y="y", text="text") - ) + ).properties(title='Bootstrap distribution', height=150) +).resolve_scale( + x='shared' ) - -annotated_sampling_dist | annotated_boot_est_dist ``` ```{figure} data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 @@ -1226,17 +1182,15 @@ To do this in Python, we can use the `percentile()` function from the `numpy` pa ``` ```{code-cell} ipython3 -import numpy as np -bounds = np.percentile(boot20000_means["mean"], [2.5, 97.5]) - -bounds +ci_bounds = boot20000_means["sample_mean"].quantile([0.025, 0.975]) +ci_bounds ``` ```{code-cell} ipython3 :tags: [remove-cell] -glue("ci_lower", round(bounds[0], 2)) -glue("ci_upper", round(bounds[1], 2)) +glue("ci_lower", round(ci_bounds[0.025], 2)) +glue("ci_upper", round(ci_bounds[0.975], 2)) ``` Our interval, \${glue:}`ci_lower` to \${glue:}`ci_upper`, captures @@ -1244,45 +1198,19 @@ the middle 95\% of the sample mean prices in the bootstrap distribution. We can visualize the interval on our distribution in {numref}`fig:11-bootstrapping9`. ```{code-cell} ipython3 -:tags: [remove-input] - -boot_est_dist + ( - ( - alt.Chart(pd.DataFrame({"x": [bounds[0]]})) - .mark_rule(color="#E69F00", size=3, strokeDash=[8, 8]) - .encode(x="x") - ) - + ( - alt.Chart( - pd.DataFrame( - { - "x": [bounds[0] - 10], - "y": [1600], - "text": [f"2.5th percentile = {round(bounds[0], 2)}"], - } - ) - ) - .mark_text(dy=-5, size=12) - .encode(x="x", y="y", text="text") - ) -) + ( - ( - alt.Chart(pd.DataFrame({"x": [bounds[1]]})) - .mark_rule(color="#E69F00", size=3, strokeDash=[8, 8]) - .encode(x="x") - ) - + ( - alt.Chart( - pd.DataFrame( - { - "x": [bounds[1]], - "y": [1600], - "text": [f"97.5th percentile = {round(bounds[1], 2)}"], - } - ) - ) - .mark_text(dy=-5, size=12) - .encode(x="x", y="y", text="text") +alt.layer( + boot_est_dist, + alt.Chart().mark_rule(color='coral', size=3, strokeDash=[5]).encode(x=alt.datum(ci_bounds[0.025])), + alt.Chart().mark_text(color='coral', size=12, fontWeight='bold').encode( + x=alt.datum(ci_bounds[0.025]), + y=alt.value(-10), + text=alt.datum(f'2.5th percentile ({ci_bounds[0.025].round(2)})') + ), + alt.Chart().mark_rule(color='coral', size=3, strokeDash=[5]).encode(x=alt.datum(ci_bounds[0.975])), + alt.Chart().mark_text(color='coral', size=12, fontWeight='bold').encode( + x=alt.datum(ci_bounds[0.975]), + y=alt.value(-10), + text=alt.datum(f'97.5th percentile ({ci_bounds[0.975].round(2)})') ) ) ``` From be19704db98200affa4c7026f5c97e0e6fae7007 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 22:17:07 +0100 Subject: [PATCH 18/31] Fix minor formatting and wording issue --- source/inference.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/source/inference.md b/source/inference.md index 98a44ce6..3f453128 100644 --- a/source/inference.md +++ b/source/inference.md @@ -1115,7 +1115,7 @@ There are two essential points that we can take away from distribution and the bootstrap distribution are similar; the bootstrap distribution lets us get a sense of the point estimate's variability. The second important point is that the means of these two distributions are -different. The sampling distribution is centered at +slightly different. The sampling distribution is centered at \${glue:}`population_mean`, the population mean value. However, the bootstrap distribution is centered at the original sample's mean price per night, \${glue:}`one_sample_mean`. Because we are resampling from the @@ -1170,8 +1170,6 @@ confidence level. To calculate a 95\% percentile bootstrap confidence interval, we will do the following: -+++ - 1. Arrange the observations in the bootstrap distribution in ascending order. 2. Find the value such that 2.5\% of observations fall below it (the 2.5\% percentile). Use that value as the lower bound of the interval. 3. Find the value such that 97.5\% of observations fall below it (the 97.5\% percentile). Use that value as the upper bound of the interval. From c718eac287d8db2dbdc5f0af69d6f6de69ce815f Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 22:50:27 +0100 Subject: [PATCH 19/31] Clarify language around sampling --- source/inference.md | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/source/inference.md b/source/inference.md index 3f453128..c5c6b838 100644 --- a/source/inference.md +++ b/source/inference.md @@ -506,13 +506,12 @@ took another random sample from the population, our estimate's value might change. So then, did we just get lucky with our point estimate above? How much does our estimate vary across different samples of size 40 in this example? Again, since we have access to the population, we can take many samples and -plot the sampling distribution of sample means for samples of size 40 to -get a sense for this variation. In this case, we'll use 20,000 samples of size -40. - -```{code-cell} ipython3 -samples -``` +plot the sampling distribution of sample means to get a sense for this variation. +In this case, we'll use the 20,000 samples of size +40 that we already stored in the `samples` variable. +First we will calculate the sample mean for each replicate +and then plot the sampling +distribution of sample means for samples of size 40. ```{code-cell} ipython3 sample_estimates = ( @@ -526,9 +525,6 @@ sample_estimates = ( sample_estimates ``` -Now we can calculate the sample mean for each replicate and plot the sampling -distribution of sample means for samples of size 40. - ```{code-cell} ipython3 :tags: [remove-output] From fb3184b24107bffa0e4133a285d28288179a90c2 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 22:51:07 +0100 Subject: [PATCH 20/31] Switch to orange annotations for better contrast --- source/inference.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/source/inference.md b/source/inference.md index c5c6b838..df9e361f 100644 --- a/source/inference.md +++ b/source/inference.md @@ -638,7 +638,7 @@ reliable—is there any way to improve the estimate? One way to improve a point estimate is to take a *larger* sample. To illustrate what effect this has, we will take many samples of size 20, 50, 100, and 500, and plot the sampling distribution of the sample mean. We indicate the mean of the sampling -distribution with a red vertical line. +distribution with a orange vertical line. ```{code-cell} ipython3 :tags: [remove-input] @@ -665,10 +665,10 @@ glue( alt.X('price', bin=alt.Bin(maxbins=30)), alt.Y('count()') ), - base.mark_rule(color='coral', size=3).encode( + base.mark_rule(color='#f58518', size=3).encode( x='mean(price)' ), - base.mark_text(align='left', color='coral', size=12, fontWeight='bold', dx=10).transform_aggregate( + base.mark_text(align='left', color='#f58518', size=12, fontWeight='bold', dx=10).transform_aggregate( mean_price = 'mean(price)', ).transform_calculate( label = "'Mean = ' + round(datum.mean_price * 10) / 10" @@ -699,7 +699,7 @@ glue( :name: fig:11-example-means7 :figclass: caption-hack -Comparison of sampling distributions, with mean highlighted as a vertical red line. +Comparison of sampling distributions, with mean highlighted as a vertical orange line. ``` +++ @@ -1069,8 +1069,8 @@ the true sampling distribution—which corresponds to taking many samples fr alt.vconcat( alt.layer( sampling_distribution, - sampling_distribution.mark_rule(color='coral', size=2).encode(x='mean(sample_mean)', y=alt.Y()), - sampling_distribution.mark_text(color='coral', size=12, align='left', dx=16, fontWeight='bold').encode( + sampling_distribution.mark_rule(color='#f58518', size=2).encode(x='mean(sample_mean)', y=alt.Y()), + sampling_distribution.mark_text(color='#f58518', size=12, align='left', dx=16, fontWeight='bold').encode( x='mean(sample_mean)', y=alt.value(7), text=alt.value(f"Mean = {sampling_distribution['data']['sample_mean'].mean().round(1)}") @@ -1078,8 +1078,8 @@ alt.vconcat( ).properties(title='Sampling distribution', height=150), alt.layer( boot_est_dist, - boot_est_dist.mark_rule(color='coral', size=2).encode(x='mean(sample_mean)', y=alt.Y()), - boot_est_dist.mark_text(color='coral', size=12, align='left', dx=18, fontWeight='bold').encode( + boot_est_dist.mark_rule(color='#f58518', size=2).encode(x='mean(sample_mean)', y=alt.Y()), + boot_est_dist.mark_text(color='#f58518', size=12, align='left', dx=18, fontWeight='bold').encode( x='mean(sample_mean)', y=alt.value(6), text=alt.value(f"Mean = {boot_est_dist['data']['sample_mean'].mean().round(1)}") @@ -1194,14 +1194,14 @@ visualize the interval on our distribution in {numref}`fig:11-bootstrapping9`. ```{code-cell} ipython3 alt.layer( boot_est_dist, - alt.Chart().mark_rule(color='coral', size=3, strokeDash=[5]).encode(x=alt.datum(ci_bounds[0.025])), - alt.Chart().mark_text(color='coral', size=12, fontWeight='bold').encode( + alt.Chart().mark_rule(color='#f58518', size=3, strokeDash=[5]).encode(x=alt.datum(ci_bounds[0.025])), + alt.Chart().mark_text(color='#f58518', size=12, fontWeight='bold').encode( x=alt.datum(ci_bounds[0.025]), y=alt.value(-10), text=alt.datum(f'2.5th percentile ({ci_bounds[0.025].round(2)})') ), - alt.Chart().mark_rule(color='coral', size=3, strokeDash=[5]).encode(x=alt.datum(ci_bounds[0.975])), - alt.Chart().mark_text(color='coral', size=12, fontWeight='bold').encode( + alt.Chart().mark_rule(color='#f58518', size=3, strokeDash=[5]).encode(x=alt.datum(ci_bounds[0.975])), + alt.Chart().mark_text(color='#f58518', size=12, fontWeight='bold').encode( x=alt.datum(ci_bounds[0.975]), y=alt.value(-10), text=alt.datum(f'97.5th percentile ({ci_bounds[0.975].round(2)})') From 3393808df20e38de106e0c9d6c71494573c52bc3 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 22:51:29 +0100 Subject: [PATCH 21/31] Improve plotting syntax and remove re-specification of seed --- source/inference.md | 30 ++++++++++++------------------ 1 file changed, 12 insertions(+), 18 deletions(-) diff --git a/source/inference.md b/source/inference.md index df9e361f..c131a30c 100644 --- a/source/inference.md +++ b/source/inference.md @@ -791,13 +791,10 @@ large enough sample. # plot sample distributions for n = 10, 20, 50, 100, 200 and population distribution sample_distribution_dict = {} -np.random.seed(12) for sample_n in [10, 20, 50, 100, 200]: sample = airbnb.sample(sample_n) sample_distribution_dict[f"sample_distribution_{sample_n}"] = ( - alt.Chart(sample, title=f"n = {sample_n}") - .mark_bar() - .encode( + alt.Chart(sample, title=f"n = {sample_n}").mark_bar().encode( x=alt.X( "price", bin=alt.Bin(extent=[0, 600], step=20), @@ -805,7 +802,7 @@ for sample_n in [10, 20, 50, 100, 200]: ), y=alt.Y("count()", title="Count"), ) - ).properties(width=350, height=150) + ).properties(height=150) # add title and standardize the x axis ticks for population histogram population_distribution.title = "Population distribution" population_distribution.encoding["x"]["bin"] = alt.Bin(extent=[0, 600], step=20) @@ -986,19 +983,16 @@ Let's take a look at histograms of the first six replicates of our bootstrap sam :tags: [] six_bootstrap_samples = boot20000.query("replicate < 6") - -( - alt.Chart(six_bootstrap_samples) - .mark_bar() - .encode( - x=alt.X( - "price", - bin=alt.Bin(maxbins=20), - title="Price per night (Canadian dollars)", - ), - y=alt.Y("count()", title="Count"), - ).properties(width=250, height=200) - .facet("replicate", columns=3) +alt.Chart(six_bootstrap_samples, height=150).mark_bar().encode( + x=alt.X( + "price", + bin=alt.Bin(maxbins=20), + title="Price per night (Canadian dollars)", + ), + y=alt.Y("count()", title="Count") +).facet( + "replicate", + columns=2 ) ``` From 1cb4cf59cddc55cedbfbdfea1f143f661a5be83d Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 22:51:44 +0100 Subject: [PATCH 22/31] Change to actual number of samples --- source/inference.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/inference.md b/source/inference.md index c131a30c..8b9654ad 100644 --- a/source/inference.md +++ b/source/inference.md @@ -973,7 +973,7 @@ our original population; but since we do not have access to the population itsel this is often the best we can do. ```{code-cell} ipython3 -boot20000 = pd.concat([one_sample.sample(frac=1, replace=True).assign(replicate=n) for n in range(1_000)]) +boot20000 = pd.concat([one_sample.sample(frac=1, replace=True).assign(replicate=n) for n in range(20_000)]) boot20000 ``` From 5b4bbac5100d36228fdc985a58cd13d2c0a107a6 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Tue, 24 Jan 2023 22:52:02 +0100 Subject: [PATCH 23/31] Round to one decimal to reduce noise --- source/inference.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/source/inference.md b/source/inference.md index 8b9654ad..b11c0547 100644 --- a/source/inference.md +++ b/source/inference.md @@ -1177,8 +1177,8 @@ ci_bounds ```{code-cell} ipython3 :tags: [remove-cell] -glue("ci_lower", round(ci_bounds[0.025], 2)) -glue("ci_upper", round(ci_bounds[0.975], 2)) +glue("ci_lower", round(ci_bounds[0.025], 1)) +glue("ci_upper", round(ci_bounds[0.975], 1)) ``` Our interval, \${glue:}`ci_lower` to \${glue:}`ci_upper`, captures @@ -1192,14 +1192,15 @@ alt.layer( alt.Chart().mark_text(color='#f58518', size=12, fontWeight='bold').encode( x=alt.datum(ci_bounds[0.025]), y=alt.value(-10), - text=alt.datum(f'2.5th percentile ({ci_bounds[0.025].round(2)})') + text=alt.datum(f'2.5th percentile ({ci_bounds[0.025].round(1)})') ), alt.Chart().mark_rule(color='#f58518', size=3, strokeDash=[5]).encode(x=alt.datum(ci_bounds[0.975])), alt.Chart().mark_text(color='#f58518', size=12, fontWeight='bold').encode( x=alt.datum(ci_bounds[0.975]), y=alt.value(-10), - text=alt.datum(f'97.5th percentile ({ci_bounds[0.975].round(2)})') - ) + text=alt.datum(f'97.5th percentile ({ci_bounds[0.975].round(1)})') + ), + width=500 ) ``` From bb548d17f1f683af17acdc8008bbed2a47627745 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Wed, 25 Jan 2023 13:42:15 +0100 Subject: [PATCH 24/31] Round histogram limits more nicely --- source/inference.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/source/inference.md b/source/inference.md index b11c0547..ee08709f 100644 --- a/source/inference.md +++ b/source/inference.md @@ -328,9 +328,9 @@ Sampling distribution of the sample proportion for sample size 40. ```{code-cell} ipython3 :tags: [remove-cell] -glue("sample_proportion_center", round(sample_estimates["sample_proportion"].mean(), 1)) -glue("sample_proportion_min", round(sample_estimates["sample_proportion"].min(), 1)) -glue("sample_proportion_max", round(sample_estimates["sample_proportion"].max(), 1)) +glue("sample_proportion_center", round(sample_estimates["sample_proportion"].mean(), 2)) +glue("sample_proportion_min", round(sample_estimates["sample_proportion"].quantile(0.004), 2)) +glue("sample_proportion_max", round(sample_estimates["sample_proportion"].quantile(0.9997), 2)) ``` ```{index} sampling distribution; shape From f8de446a2473a7f8db2d7ab3703f9d055f61416f Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Wed, 25 Jan 2023 13:54:09 +0100 Subject: [PATCH 25/31] Reformat and improve the explanation of the bootstrapped mean calculation --- source/inference.md | 49 +++++++++++++++++++++++++++++++++------------ 1 file changed, 36 insertions(+), 13 deletions(-) diff --git a/source/inference.md b/source/inference.md index ee08709f..b1f750bc 100644 --- a/source/inference.md +++ b/source/inference.md @@ -971,9 +971,14 @@ and calculate the means for each of those replicates. Recall that this assumes that `one_sample` *looks like* our original population; but since we do not have access to the population itself, this is often the best we can do. +Note that here we break the list comprehension over multiple lines +so that it is easier to read. ```{code-cell} ipython3 -boot20000 = pd.concat([one_sample.sample(frac=1, replace=True).assign(replicate=n) for n in range(20_000)]) +boot20000 = pd.concat([ + one_sample.sample(frac=1, replace=True).assign(replicate=n) + for n in range(20_000) +]) boot20000 ``` @@ -1005,26 +1010,44 @@ Histograms of first six replicates of bootstrap samples. +++ -We see in {numref}`fig:11-bootstrapping-six-bootstrap-samples` how the -bootstrap samples differ. We can also calculate the sample mean for each of -these six replicates. +We see in {numref}`fig:11-bootstrapping-six-bootstrap-samples` how the distributions of the +bootstrap samples differ. If we calculate the sample mean for each of +these six samples, we can see that these are also different between samples. +To compute the mean for each sample, +we first group by the "replicate" which is the column containing the sample/replicate number. +Then we compute the mean of the `price` column and rename it to `sample_mean` +for it to be more descriptive. +Finally we use `reset_index` to get the `replicate` values back as a column in the dataframe. ```{code-cell} ipython3 -six_bootstrap_samples.groupby("replicate")["price"].mean().reset_index().rename( - columns={"price": "mean"} +( + six_bootstrap_samples + .groupby("replicate") + ["price"] + .mean() + .rename(columns={"price": "sample_mean"}) + .reset_index() ) ``` -We can see that the bootstrap sample distributions and the sample means are -different. They are different because we are sampling *with replacement*. We -will now calculate point estimates for our 20,000 bootstrap samples and -generate a bootstrap distribution of our point estimates. The bootstrap +The distributions and the means differ between the bootstrapped samples +because we are sampling *with replacement*. +If we instead would have sampled *without replacement*, +we would end up with the exact same values in the sample each time. + +We will now calculate point estimates of the mean for our 20,000 bootstrap samples and +generate a bootstrap distribution of these point estimates. The bootstrap distribution ({numref}`fig:11-bootstrapping5`) suggests how we might expect -our point estimate to behave if we took another sample. +our point estimate to behave if we take multiple samples. ```{code-cell} ipython3 -boot20000_means = boot20000.groupby("replicate")["price"].mean().reset_index().rename( - columns={"price": "sample_mean"} +boot20000_means = ( + boot20000 + .groupby("replicate") + ["price"] + .mean() + .rename(columns={"price": "sample_mean"}) + .reset_index() ) boot20000_means From 5cbc443e6c21aa2f03b3cee718210d7a26948af1 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Wed, 25 Jan 2023 13:58:28 +0100 Subject: [PATCH 26/31] Add description about quantiles --- source/inference.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/source/inference.md b/source/inference.md index b1f750bc..75edcebc 100644 --- a/source/inference.md +++ b/source/inference.md @@ -1187,7 +1187,10 @@ To calculate a 95\% percentile bootstrap confidence interval, we will do the fol 2. Find the value such that 2.5\% of observations fall below it (the 2.5\% percentile). Use that value as the lower bound of the interval. 3. Find the value such that 97.5\% of observations fall below it (the 97.5\% percentile). Use that value as the upper bound of the interval. -To do this in Python, we can use the `percentile()` function from the `numpy` package: +To do this in Python, we can use the `quantile` function of our DataFrame. +Quantiles are expressed in proportions rather than percentages, +so the 2.5th and 97.5th percentiles +would be quantiles 0.025 and 0.975, respectively. ```{index} numpy; percentile, pandas.DataFrame; df[] ``` From be9f87208bf3818b3478a1a80c86a9ade289e800 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Wed, 25 Jan 2023 13:58:37 +0100 Subject: [PATCH 27/31] Link to python version of worksheets --- source/inference.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/inference.md b/source/inference.md index 75edcebc..96258d15 100644 --- a/source/inference.md +++ b/source/inference.md @@ -1265,7 +1265,7 @@ statistical techniques you may learn about in the future! Practice exercises for the material covered in this chapter can be found in the accompanying -[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme) +[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-python-worksheets) in the two "Statistical inference" rows. You can launch an interactive version of each worksheet in your browser by clicking the "launch binder" button. You can also preview a non-interactive version of each worksheet by clicking "view worksheet." From c216738ccc35e471eb4aedfadca6bd727839ac62 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Thu, 26 Jan 2023 17:08:04 +0100 Subject: [PATCH 28/31] Explain the name param in reset index --- source/inference.md | 42 +++++++++++++++++++++++++++++++++++++++++- 1 file changed, 41 insertions(+), 1 deletion(-) diff --git a/source/inference.md b/source/inference.md index 96258d15..3e9da09d 100644 --- a/source/inference.md +++ b/source/inference.md @@ -282,6 +282,45 @@ to compute the number of qualified observations in each sample; finally compute Both the first and last few entries of the resulting data frame are printed below to show that we end up with 20,000 point estimates, one for each of the 20,000 samples. +```{code-cell} ipython3 +( + samples + .groupby('replicate') + ['room_type'] + .value_counts(normalize=True) +) +``` + +The returned object is a series, +and as we have previously learned +we can use `reset_index` to change it to a data frame. +However, +there is one caveat here: +when we use the `value_counts` function +on a grouped series and try to `reset_index` +we will end up with two columns with the same name +and therefor get an error +(in this case, `room_type` will occur twice). +Fortunately, +there is a simple solution: +when we call `reset_index`, +we can specify the name of the new column +with the `name` parameter: + +```{code-cell} ipython3 +( + samples + .groupby('replicate') + ['room_type'] + .value_counts(normalize=True) + .reset_index(name='sample_proportion') +) +``` + +Below we put everything together +and also filter the data frame to keep only the room types +that we are interested in. + ```{code-cell} ipython3 sample_estimates = ( samples @@ -289,8 +328,9 @@ sample_estimates = ( ['room_type'] .value_counts(normalize=True) .reset_index(name='sample_proportion') - .query('room_type=="Entire home/apt"') ) + +sample_estimates = sample_estimates[sample_estimates['room_type'] == 'Entire home/apt'] sample_estimates ``` From bbcfeb318531edebe238800abc37e89db542da4a Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Thu, 26 Jan 2023 17:35:07 +0100 Subject: [PATCH 29/31] Fix typo --- source/inference.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/inference.md b/source/inference.md index 3e9da09d..10cda95f 100644 --- a/source/inference.md +++ b/source/inference.md @@ -299,7 +299,7 @@ there is one caveat here: when we use the `value_counts` function on a grouped series and try to `reset_index` we will end up with two columns with the same name -and therefor get an error +and therefore get an error (in this case, `room_type` will occur twice). Fortunately, there is a simple solution: From 8c3b8b6c44b7897352430463da042c703e537173 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Thu, 26 Jan 2023 17:41:55 +0100 Subject: [PATCH 30/31] Clarify concatenation --- source/inference.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/source/inference.md b/source/inference.md index 10cda95f..fbe83ae3 100644 --- a/source/inference.md +++ b/source/inference.md @@ -261,6 +261,8 @@ In this case we are taking 20,000 samples of size 40 and to make it clear which rows in the data frame come which of the 20,000 samples, we also add a column called `replicate` with this information. +The call to `concat` concatenates all the 20,000 data frames +returned from the list comprehension into a single big data frame. ```{code-cell} ipython3 samples = pd.concat([airbnb.sample(40).assign(replicate=n) for n in range(20_000)]) From 325881a76f76c8282066979f62a527559154e899 Mon Sep 17 00:00:00 2001 From: Joel Ostblom Date: Fri, 27 Jan 2023 21:35:43 +0100 Subject: [PATCH 31/31] Move rename last so we can use the same rename syntax students are used to --- source/inference.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/source/inference.md b/source/inference.md index fbe83ae3..6c01b6d8 100644 --- a/source/inference.md +++ b/source/inference.md @@ -1067,8 +1067,8 @@ Finally we use `reset_index` to get the `replicate` values back as a column in t .groupby("replicate") ["price"] .mean() - .rename(columns={"price": "sample_mean"}) .reset_index() + .rename(columns={"price": "sample_mean"}) ) ``` @@ -1088,8 +1088,8 @@ boot20000_means = ( .groupby("replicate") ["price"] .mean() - .rename(columns={"price": "sample_mean"}) .reset_index() + .rename(columns={"price": "sample_mean"}) ) boot20000_means