diff --git a/_freeze/problem-set-keys/ps-key-12/execute-results/html.json b/_freeze/problem-set-keys/ps-key-12/execute-results/html.json index 2f04edc4..418467de 100644 --- a/_freeze/problem-set-keys/ps-key-12/execute-results/html.json +++ b/_freeze/problem-set-keys/ps-key-12/execute-results/html.json @@ -1,10 +1,9 @@ { - "hash": "010c77c827df10b8f9e339499a5625e0", + "hash": "4b7363e82864ae1ce1f3913b05f66223", "result": { - "markdown": "---\ntitle: \"Problem Set Stats Bootcamp - class 12\"\nsubtitle: \"Hypothesis Testing\"\nauthor: \"Neelanjan Mukherjee\"\neditor: visual\n---\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\n-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --\nv dplyr 1.1.3 v readr 2.1.4\nv forcats 1.0.0 v stringr 1.5.0\nv ggplot2 3.4.3 v tibble 3.2.1\nv lubridate 1.9.2 v tidyr 1.3.0\nv purrr 1.0.2 \n-- Conflicts ------------------------------------------ tidyverse_conflicts() --\nx dplyr::filter() masks stats::filter()\nx dplyr::lag() masks stats::lag()\ni Use the conflicted package () to force all conflicts to become errors\n\nAttaching package: 'rstatix'\n\n\nThe following object is masked from 'package:stats':\n\n filter\n\n\n\nAttaching package: 'janitor'\n\n\nThe following object is masked from 'package:rstatix':\n\n make_clean_names\n\n\nThe following objects are masked from 'package:stats':\n\n chisq.test, fisher.test\n\n\nhere() starts at /Users/jayhesselberth/devel/rnabioco/molb-7950\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nbiochem <- read_tsv(\"http://mtweb.cs.ucl.ac.uk/HSMICE/PHENOTYPES/Biochemistry.txt\", show_col_types = FALSE) |>\n janitor::clean_names()\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n```{.r .cell-code}\n# simplify names a bit more\ncolnames(biochem) <- gsub(pattern = \"biochem_\", replacement = \"\", colnames(biochem))\n\n# we are going to simplify this a bit and only keep some columns\nkeep <- colnames(biochem)[c(1, 6, 9, 14, 15, 24:28)]\nbiochem <- biochem[, keep]\n\n# get weights for each individual mouse\n# careful: did not come with column names\nweight <- read_tsv(\"http://mtweb.cs.ucl.ac.uk/HSMICE/PHENOTYPES/weight\", col_names = F, show_col_types = FALSE)\n\n# add column names\ncolnames(weight) <- c(\"subject_name\", \"weight\")\n\n# add weight to biochem table and get rid of NAs\n# rename gender to sex\nb <- inner_join(biochem, weight, by = \"subject_name\") |>\n na.omit() |>\n rename(sex = gender)\n```\n:::\n\n\n## Problem \\# 1\n\nDoes mouse sex explain mouse total cholesterol levels? Make sure to run chunks above.\n\n### 1. Examine and specify the variable(s)\n\n> The response variable y is $tot\\_cholesterol$\\\n> The explantory variable x is $sex$\n\n### Make a plot:\n\nresponse variable on the y-axis\n\nexplanatory variable on the x-axis\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggviolin(\n data = b,\n y = \"tot_cholesterol\",\n x = \"sex\",\n fill = \"sex\",\n add = \"mean_sd\"\n)\n```\n\n::: {.cell-output-display}\n![](ps-key-12_files/figure-html/unnamed-chunk-3-1.png){width=672}\n:::\n:::\n\n\n### Get n, mean, median, sd\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb |>\n group_by(sex) |>\n get_summary_stats(tot_cholesterol,\n type = \"common\",\n show = c(\"mean\", \"median\", \"sd\")\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 x 6\n sex variable n mean median sd\n \n1 F tot_cholesterol 891 2.80 2.75 0.566\n2 M tot_cholesterol 891 3.34 3.3 0.587\n```\n\n\n:::\n:::\n\n\n### Is it normally distribute?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggqqplot(\n data = b,\n x = \"tot_cholesterol\",\n color = \"sex\"\n)\n```\n\n::: {.cell-output-display}\n![](ps-key-12_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n\n```{.r .cell-code}\nb |>\n group_by(sex) |>\n shapiro_test(tot_cholesterol) |>\n gt()\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n\n\n\n \n\n\n\n \n \n \n
sexvariablestatisticp
Ftot_cholesterol0.99132844.268190e-05
Mtot_cholesterol0.98689033.759361e-07
\n
\n```\n\n:::\n:::\n\n\n> Yes, based on the qq-plot and the high $n$, but i do understand if you want to play it safe due to the shapiro_test p-value.\n\n### Is it variance similar between groups?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb |>\n levene_test(tot_cholesterol ~ sex) |>\n gt()\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in leveneTest.default(y = y, group = group, ...): group coerced to\nfactor.\n```\n\n\n:::\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n\n\n\n \n \n \n
df1df2statisticp
117800.64608770.4216222
\n
\n```\n\n:::\n:::\n\n\n> Yes\n\n### What kind of test are you picking and why?\n\n> t_test since i think it is normally distribute, with equal variance based on levene test\n\n### 2. Declare null hypothesis $\\mathcal{H}_0$\n\n$\\mathcal{H}_0$ is that $sex$ does not explain $tot\\_cholesterol$\n\n### 3. Calculate test-statistic, exact p-value and plot\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb |>\n t_test(tot_cholesterol ~ sex,\n var.equal = T,\n ref.group = \"F\"\n ) |>\n gt()\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n\n\n\n \n \n \n
.y.group1group2n1n2statisticdfp
tot_cholesterolFM891891-20.0193317801.43e-80
\n
\n```\n\n:::\n\n```{.r .cell-code}\nstatres <- b |>\n t_test(tot_cholesterol ~ sex,\n var.equal = T,\n ref.group = \"F\"\n )\n\n\nggviolin(\n data = b,\n y = \"tot_cholesterol\",\n x = \"sex\",\n fill = \"sex\",\n add = \"mean_sd\"\n) +\n stat_pvalue_manual(\n statres,\n label = \"p\",\n y.position = 5.8\n ) +\n ylim(1, 6)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: Removed 1 rows containing missing values (`geom_violin()`).\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](ps-key-12_files/figure-html/unnamed-chunk-7-1.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# i have pre-selected some families to compare\nmyfams <- c(\n \"B1.5:E1.4(4) B1.5:A1.4(5)\",\n \"F1.3:A1.2(3) F1.3:E2.2(3)\",\n \"A1.3:D1.2(3) A1.3:H1.2(3)\",\n \"D5.4:G2.3(4) D5.4:C4.3(4)\"\n)\n\n# only keep the familys in myfams\nbfam <- b |>\n filter(family %in% myfams) |>\n droplevels()\n\n# simplify family names and make factor\nbfam$family <- gsub(pattern = \"\\\\..*\", replacement = \"\", x = bfam$family) |>\n as.factor()\n\n\n# make B1 the reference (most similar to overall mean)\nbfam$family <- relevel(x = bfam$family, ref = \"B1\")\n```\n:::\n\n\n## Problem \\# 2\n\nDoes mouse family explain mouse total cholesterol levels? Make sure to run chunk above.\n\n### 1. Examine and specify the variable(s)\n\n> The response variable y is $tot\\_cholesterol$\\\n> The explantory variable x is $family$\n\n### Make a plot:\n\nresponse variable on the y-axis\n\nexplanatory variable on the x-axis\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggviolin(\n data = bfam,\n y = \"tot_cholesterol\",\n x = \"family\",\n fill = \"family\",\n add = \"mean_sd\"\n)\n```\n\n::: {.cell-output-display}\n![](ps-key-12_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n\n### Get n, mean, median, sd\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbfam |>\n group_by(family) |>\n get_summary_stats(tot_cholesterol,\n type = \"common\",\n show = c(\"mean\", \"median\", \"sd\")\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 4 x 6\n family variable n mean median sd\n \n1 B1 tot_cholesterol 11 2.68 2.6 0.555\n2 A1 tot_cholesterol 12 3.51 3.41 0.722\n3 D5 tot_cholesterol 20 3.58 3.58 0.617\n4 F1 tot_cholesterol 11 3.28 3.43 0.579\n```\n\n\n:::\n:::\n\n\n### Is it normally distribute?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggqqplot(\n data = bfam,\n x = \"tot_cholesterol\",\n color = \"family\"\n)\n```\n\n::: {.cell-output-display}\n![](ps-key-12_files/figure-html/unnamed-chunk-11-1.png){width=672}\n:::\n\n```{.r .cell-code}\nbfam |>\n group_by(family) |>\n shapiro_test(tot_cholesterol) |>\n gt()\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n \n \n
familyvariablestatisticp
B1tot_cholesterol0.88640500.1253641
A1tot_cholesterol0.93047220.3851437
D5tot_cholesterol0.94702420.3241655
F1tot_cholesterol0.91731140.2968795
\n
\n```\n\n:::\n:::\n\n\n> Yes\n\n### Is it variance similar between groups?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb |>\n levene_test(tot_cholesterol ~ family) |>\n gt()\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in leveneTest.default(y = y, group = group, ...): group coerced to\nfactor.\n```\n\n\n:::\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n\n\n\n \n \n \n
df1df2statisticp
17016111.1390810.1165763
\n
\n```\n\n:::\n:::\n\n\n> Yes\n\n### What kind of test are you picking and why?\n\n> anova_test since i think it is normally distribute and has equal variance\n\n### 2. Declare null hypothesis $\\mathcal{H}_0$\n\n$\\mathcal{H}_0$ is that $family$ does not explain $tot\\_cholesterol$\n\n### 3. Calculate test-statistic, exact p-value and plot\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbfam |>\n anova_test(tot_cholesterol ~ family) |>\n gt()\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n\n\n \n \n \n
EffectDFnDFdFpp<.05ges
family3505.4010.003*0.245
\n
\n```\n\n:::\n\n```{.r .cell-code}\nggviolin(\n data = bfam,\n y = \"tot_cholesterol\",\n x = \"family\",\n fill = \"family\",\n add = \"mean_sd\"\n) +\n stat_anova_test()\n```\n\n::: {.cell-output-display}\n![](ps-key-12_files/figure-html/unnamed-chunk-13-1.png){width=672}\n:::\n:::\n", - "supporting": [ - "ps-key-12_files" - ], + "engine": "knitr", + "markdown": "---\ntitle: \"Problem Set Stats Bootcamp - class 12\"\nsubtitle: \"Hypothesis Testing\"\nauthor: \"Neelanjan Mukherjee\"\neditor: visual\n---\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.1 ✔ tibble 3.2.1\n✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n✔ purrr 1.0.2 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package () to force all conflicts to become errors\n\nAttaching package: 'rstatix'\n\n\nThe following object is masked from 'package:stats':\n\n filter\n\n\n\nAttaching package: 'janitor'\n\n\nThe following object is masked from 'package:rstatix':\n\n make_clean_names\n\n\nThe following objects are masked from 'package:stats':\n\n chisq.test, fisher.test\n\n\nhere() starts at /Users/mtaliaferro/Documents/GitHub/molb-7950\n```\n\n\n:::\n:::\n\n\n\n## Problem \\# 1\n\nDoes mouse sex explain mouse total cholesterol levels? Make sure to run chunks above.\n\n### 1. Examine and specify the variable(s) (1 pt)\n\n> The response variable y is $tot\\_cholesterol$\\\n> The explantory variable x is $sex$\n\n### Make a violin plot: (2 pt)\n\nresponse variable on the y-axis\n\nexplanatory variable on the x-axis\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggviolin(\n data = b,\n y = \"tot_cholesterol\",\n x = \"sex\",\n fill = \"sex\",\n add = \"mean_sd\"\n)\n```\n:::\n\n\n\n### Get n, mean, median, sd (1 pt)\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb |>\n group_by(sex) |>\n get_summary_stats(tot_cholesterol,\n type = \"common\",\n show = c(\"mean\", \"median\", \"sd\")\n )\n```\n:::\n\n\n\n### Is it normally distribute? (1 pt)\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggqqplot(\n data = b,\n x = \"tot_cholesterol\",\n color = \"sex\"\n)\n\n\nb |>\n group_by(sex) |>\n shapiro_test(tot_cholesterol) |>\n gt()\n```\n:::\n\n\n\n> Yes, based on the qq-plot and the high $n$, but i do understand if you want to play it safe due to the shapiro_test p-value.\n\n### Is it variance similar between groups? (1 pt)\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb |>\n levene_test(tot_cholesterol ~ sex) |>\n gt()\n```\n:::\n\n\n\n> Yes\n\n### What kind of test are you picking and why? (1 pt)\n\n> t_test since i think it is normally distribute, with equal variance based on levene test\n\n### 2. Declare null hypothesis $\\mathcal{H}_0$ (1 pt)\n\n$\\mathcal{H}_0$ is that $sex$ does not explain $tot\\_cholesterol$\n\n### 3. Calculate test-statistic, exact p-value and plot (2 pt)\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb |>\n t_test(tot_cholesterol ~ sex,\n var.equal = T,\n ref.group = \"F\"\n ) |>\n gt()\n\nstatres <- b |>\n t_test(tot_cholesterol ~ sex,\n var.equal = T,\n ref.group = \"F\"\n )\n\n\nggviolin(\n data = b,\n y = \"tot_cholesterol\",\n x = \"sex\",\n fill = \"sex\",\n add = \"mean_sd\"\n) +\n stat_pvalue_manual(\n statres,\n label = \"p\",\n y.position = 5.8\n ) +\n ylim(1, 6)\n```\n:::\n\n\n\n> can reject null hypothesis\n\n## Problem \\# 2\n\nDoes mouse family explain mouse total cholesterol levels? Make sure to run chunk above.\n\n### 1. Examine and specify the variable(s) (1 pt)\n\n> The response variable y is $tot\\_cholesterol$\\\n> The explantory variable x is $family$\n\n### Make a plot: (2 pt)\n\nresponse variable on the y-axis\n\nexplanatory variable on the x-axis\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggviolin(\n data = bfam,\n y = \"tot_cholesterol\",\n x = \"family\",\n fill = \"family\",\n add = \"mean_sd\"\n)\n```\n:::\n\n\n\n### Get n, mean, median, sd (1 pt)\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbfam |>\n group_by(family) |>\n get_summary_stats(tot_cholesterol,\n type = \"common\",\n show = c(\"mean\", \"median\", \"sd\")\n )\n```\n:::\n\n\n\n### Is it normally distribute? (1 pt)\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggqqplot(\n data = bfam,\n x = \"tot_cholesterol\",\n color = \"family\"\n)\n\n\nbfam |>\n group_by(family) |>\n shapiro_test(tot_cholesterol) |>\n gt()\n```\n:::\n\n\n\n> yes\n\n### Is it variance similar between groups? (1 pt)\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb |>\n levene_test(tot_cholesterol ~ family) |>\n gt()\n```\n:::\n\n\n\n> yes\n\n### What kind of test are you picking and why? (1 pt)\n\n> anova_test since i think it is normally distribute and has equal variance\n\n### 2. Declare null hypothesis $\\mathcal{H}_0$\n\n\\mathcal{H}\\_0\\$ is that $family$ does not explain $tot\\_cholesterol$ (1 pt)\n\n### 3. Calculate test-statistic, exact p-value and plot (2 pt)\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbfam |>\n anova_test(tot_cholesterol ~ family) |>\n gt()\n\n\nggviolin(\n data = bfam,\n y = \"tot_cholesterol\",\n x = \"family\",\n fill = \"family\",\n add = \"mean_sd\"\n) +\n stat_anova_test()\n```\n:::\n\n\n\n> My interpretation of the result\n", + "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" ], diff --git a/_freeze/problem-set-keys/ps-key-13/execute-results/html.json b/_freeze/problem-set-keys/ps-key-13/execute-results/html.json index 48820b04..f38a63d1 100644 --- a/_freeze/problem-set-keys/ps-key-13/execute-results/html.json +++ b/_freeze/problem-set-keys/ps-key-13/execute-results/html.json @@ -1,7 +1,8 @@ { - "hash": "8fda253a8f9461778affa81a4de6952a", + "hash": "c03c8d4f7168ba772a9d1a0e552c15b4", "result": { - "markdown": "@@ -1,204 +0,0 @@\n\n---\ntitle: \"Problem Set Stats Bootcamp - class 12\"\nsubtitle: \"Hypothesis Testing\"\nauthor: \"Neelanjan Mukherjee\"\neditor: visual\n---\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.3 ✔ readr 2.1.4\n✔ forcats 1.0.0 ✔ stringr 1.5.0\n✔ ggplot2 3.4.3 ✔ tibble 3.2.1\n✔ lubridate 1.9.2 ✔ tidyr 1.3.0\n✔ purrr 1.0.2 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package () to force all conflicts to become errors\n\nAttaching package: 'rstatix'\n\n\nThe following object is masked from 'package:stats':\n\n filter\n\n\n\nAttaching package: 'janitor'\n\n\nThe following object is masked from 'package:rstatix':\n\n make_clean_names\n\n\nThe following objects are masked from 'package:stats':\n\n chisq.test, fisher.test\n\n\nhere() starts at /Users/jayhesselberth/devel/rnabioco/molb-7950\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nbiochem <- read_tsv(\"http://mtweb.cs.ucl.ac.uk/HSMICE/PHENOTYPES/Biochemistry.txt\", show_col_types = FALSE) |>\n janitor::clean_names()\n\n# simplify names a bit more\ncolnames(biochem) <- gsub(pattern = \"biochem_\", replacement = \"\", colnames(biochem))\n\n# we are going to simplify this a bit and only keep some columns\nkeep <- colnames(biochem)[c(1, 6, 9, 14, 15, 24:28)]\nbiochem <- biochem[, keep]\n\n# get weights for each individual mouse\n# careful: did not come with column names\nweight <- read_tsv(\"http://mtweb.cs.ucl.ac.uk/HSMICE/PHENOTYPES/weight\", col_names = F, show_col_types = FALSE)\n\n# add column names\ncolnames(weight) <- c(\"subject_name\", \"weight\")\n\n# add weight to biochem table and get rid of NAs\n# rename gender to sex\nb <- inner_join(biochem, weight, by = \"subject_name\") |>\n na.omit() |>\n rename(sex = gender)\n```\n:::\n\n\n## Problem \\# 1\n\nIs there an association between mouse calcium and sodium levels?\n\n### 1. Make a scatterplot to inspect variable\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggscatter(\n data = b,\n y = \"calcium\",\n x = \"sodium\"\n)\n```\n\n::: {.cell-output-display}\n![](ps-key-13_files/figure-html/unnamed-chunk-3-1.png){width=672}\n:::\n:::\n\n\n### 2. Are they normal (enough)?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggqqplot(data = b, x = \"calcium\")\n```\n\n::: {.cell-output-display}\n![](ps-key-13_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n\n```{.r .cell-code}\nb |> shapiro_test(calcium)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 3\n variable statistic p\n \n1 calcium 0.995 0.0000108\n```\n\n\n:::\n\n```{.r .cell-code}\nggqqplot(data = b, x = \"sodium\")\n```\n\n::: {.cell-output-display}\n![](ps-key-13_files/figure-html/unnamed-chunk-4-2.png){width=672}\n:::\n\n```{.r .cell-code}\nb |> shapiro_test(sodium)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 3\n variable statistic p\n \n1 sodium 0.995 0.00000432\n```\n\n\n:::\n:::\n\n\nWhich test will you use and why?\n\n> I will use the Pearson since the qqplots look ok and there are \\~1700 observations. Spearman is fine too - especially since sodium looks a little like an integer and not continuous.\n\n### 3. Declare null hypothesis $\\mathcal{H}_0$\n\n$\\mathcal{H}_0$ is that there is no dependency/association between $calcium$ and $sodium$\n\n### 4. Calculate and plot the correlation on a scatterplot\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggscatter(\n data = b,\n y = \"calcium\",\n x = \"sodium\"\n) +\n stat_cor(\n method = \"pearson\",\n label.x = 110,\n label.y = 2.4\n )\n```\n\n::: {.cell-output-display}\n![](ps-key-13_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n\n\n\n\n## Problem \\# 2\n\n## Do mouse calcium levels explain mouse sodium levels? If so, to what extent? \n\nUse a linear model to do the following: \n\n### 1. Specify the Response and Explanatory variables --- (2 pts) \n\n> The response variable y is sodium\n> The explantory variable x is calcium\n\n\n### 2. Declare the null hypothesis --- (1 pts)\n\n> The null hypothesis is calcium levels do not explain/predict sodium levels.\n\n### 3. Use the `lm` function to create a fit (linear model) \n\n\nalso save the slope and intercept for later\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfit <- lm(formula = sodium ~ 1 + calcium, data = b)\n\n\nint <- tidy(fit)[1, 2]\nslope <- tidy(fit)[2, 2]\n```\n:::\n\n\n### 4. Add residuals to the data and create a plot visualizing the residuals\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb_fit <- augment(fit, data = b)\n\navg_sod <- mean(b$sodium)\n\nggplot(\n data = b_fit,\n aes(x = calcium, y = sodium)\n) +\n geom_point(size = 1, aes(color = .resid)) +\n geom_abline(\n intercept = pull(int),\n slope = pull(slope),\n col = \"red\"\n ) +\n scale_color_gradient2(\n low = \"blue\",\n mid = \"black\",\n high = \"yellow\"\n ) +\n geom_segment(\n aes(\n xend = calcium,\n yend = .fitted\n ),\n alpha = .1\n ) + # plot line representing residuals\n theme_linedraw()\n```\n\n::: {.cell-output-display}\n![](ps-key-13_files/figure-html/augment and viz-1.png){width=672}\n:::\n:::\n\n\n### 5. Calculate the $R^2$ and compare to $R^2$ from fit\n\n$R^2 = 1 - \\displaystyle \\frac {SS_{fit}}{SS_{null}}$\n\n$SS_{fit} = \\sum_{i=1}^{n} (data - line)^2 = \\sum_{i=1}^{n} (y_{i} - (\\beta_0 \\cdot 1+ \\beta_1 \\cdot x)^2$\n\n$SS_{null}$ — sum of squared errors around the mean of $y$\n\n$SS_{null} = \\sum_{i=1}^{n} (data - mean)^2 = \\sum_{i=1}^{n} (y_{i} - \\overline{y})^2$\n\n\n \n\n\n::: {.cell}\n\n```{.r .cell-code}\nss_fit <- sum(b_fit$.resid^2)\nss_null <- sum(\n (b_fit$sodium - avg_sod)^2\n)\nrsq <- 1 - ss_fit / ss_null\nglance(fit) |> select(r.squared)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 1\n r.squared\n \n1 0.684\n```\n\n\n:::\n:::\n\n\n\n### 6. Using $R^2$, describe the extent to which calcium explains sodium levels\n\n> $calcium$ explains ~68% of variation in $sodium$ levels \n\n### 7. Report (do not calculate) the $p-value$ and your decision on the null\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfit |>\n glance() |>\n select(p.value)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 1\n p.value\n \n1 0\n```\n\n\n:::\n:::\n\n\n> The null hypothesis is calcium levels do not explain/predict sodium levels IS NOT SUPPORTED\n\nCalcium levels to predict sodium levels.\n\n## Problem \\# 3\n\nWhat is the association between mouse weight and age levels for different sexes?\n\n\n### 1. Calculate the pearson correlation coefficient between weight and age for females and males\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb |>\n group_by(factor(sex)) |>\n cor_test(weight, age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 9\n `factor(sex)` var1 var2 cor statistic p conf.low conf.high method \n \n1 F weight age 0.21 6.30 4.67e-10 0.143 0.269 Pearson\n2 M weight age 0.37 12.0 1.22e-30 0.314 0.427 Pearson\n```\n\n\n:::\n:::\n\n\n### 2. Describe your observations\n\n> The relationship between weight and age is stronger for males than it is for females.\n", + "engine": "knitr", + "markdown": "---\ntitle: \"Problem Set Stats Bootcamp - class 13\"\nsubtitle: \"Hypothesis Testing\"\nauthor: \"Neelanjan Mukherjee\"\neditor: visual\n---\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.1 ✔ tibble 3.2.1\n✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n✔ purrr 1.0.2 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package () to force all conflicts to become errors\n\nAttaching package: 'rstatix'\n\n\nThe following object is masked from 'package:stats':\n\n filter\n\n\n\nAttaching package: 'janitor'\n\n\nThe following object is masked from 'package:rstatix':\n\n make_clean_names\n\n\nThe following objects are masked from 'package:stats':\n\n chisq.test, fisher.test\n\n\nhere() starts at /Users/mtaliaferro/Documents/GitHub/molb-7950\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nbiochem <- read_tsv(\"http://mtweb.cs.ucl.ac.uk/HSMICE/PHENOTYPES/Biochemistry.txt\", show_col_types = FALSE) |>\n janitor::clean_names()\n\n# simplify names a bit more\ncolnames(biochem) <- gsub(pattern = \"biochem_\", replacement = \"\", colnames(biochem))\n\n# we are going to simplify this a bit and only keep some columns\nkeep <- colnames(biochem)[c(1, 6, 9, 14, 15, 24:28)]\nbiochem <- biochem[, keep]\n\n# get weights for each individual mouse\n# careful: did not come with column names\nweight <- read_tsv(\"http://mtweb.cs.ucl.ac.uk/HSMICE/PHENOTYPES/weight\", col_names = F, show_col_types = FALSE)\n\n# add column names\ncolnames(weight) <- c(\"subject_name\", \"weight\")\n\n# add weight to biochem table and get rid of NAs\n# rename gender to sex\nb <- inner_join(biochem, weight, by = \"subject_name\") |>\n na.omit() |>\n rename(sex = gender)\n```\n:::\n\n\n\n## Problem \\# 1\n\nIs there an association between mouse calcium and sodium levels?\n\n### 1. Make a scatterplot to inspect variable --- (2 pts)\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggscatter(\n data = b,\n y = \"calcium\",\n x = \"sodium\"\n)\n```\n\n::: {.cell-output-display}\n![](ps-key-13_files/figure-html/unnamed-chunk-3-1.png){width=672}\n:::\n:::\n\n\n\n### 2. Are they normal (enough)? --- (1 pts)\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggqqplot(data = b, x = \"calcium\")\n```\n\n::: {.cell-output-display}\n![](ps-key-13_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n\n```{.r .cell-code}\nb |> shapiro_test(calcium)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 3\n variable statistic p\n \n1 calcium 0.995 0.0000108\n```\n\n\n:::\n\n```{.r .cell-code}\nggqqplot(data = b, x = \"sodium\")\n```\n\n::: {.cell-output-display}\n![](ps-key-13_files/figure-html/unnamed-chunk-4-2.png){width=672}\n:::\n\n```{.r .cell-code}\nb |> shapiro_test(sodium)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 3\n variable statistic p\n \n1 sodium 0.995 0.00000432\n```\n\n\n:::\n:::\n\n\n\nWhich test will you use and why?\n\n> I will use the Pearson since the qqplots look ok and there are \\~1700 observations. Spearman is fine too - especially since sodium looks a little like an integer and not continuous.\n\n### 3. Declare null hypothesis $\\mathcal{H}_0$ --- (1 pts)\n\n$\\mathcal{H}_0$ is that there is no dependency/association between $calcium$ and $sodium$\n\n### 4. Calculate and plot the correlation on a scatterplot --- (2 pts)\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggscatter(\n data = b,\n y = \"calcium\",\n x = \"sodium\"\n) +\n stat_cor(\n method = \"pearson\",\n label.x = 110,\n label.y = 2.4\n )\n```\n\n::: {.cell-output-display}\n![](ps-key-13_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n\n\n## Problem \\# 2\n\n## Do mouse calcium levels explain mouse sodium levels? If so, to what extent?\n\nUse a linear model to do the following:\n\n### 1. Specify the Response and Explanatory variables --- (2 pts)\n\n> The response variable y is sodium The explantory variable x is calcium\n\n### 2. Declare the null hypothesis --- (1 pts)\n\n> The null hypothesis is calcium levels do not explain/predict sodium levels.\n\n### 3. Use the `lm` function to create a fit (linear model) --- (1 pts)\n\nalso save the slope and intercept for later\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfit <- lm(formula = sodium ~ 1 + calcium, data = b)\n\n\nint <- tidy(fit)[1, 2]\nslope <- tidy(fit)[2, 2]\n```\n:::\n\n\n\n### 4. Add residuals to the data and create a plot visualizing the residuals --- (1 pts)\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb_fit <- augment(fit, data = b)\n\navg_sod <- mean(b$sodium)\n\nggplot(\n data = b_fit,\n aes(x = calcium, y = sodium)\n) +\n geom_point(size = 1, aes(color = .resid)) +\n geom_abline(\n intercept = pull(int),\n slope = pull(slope),\n col = \"red\"\n ) +\n scale_color_gradient2(\n low = \"blue\",\n mid = \"black\",\n high = \"yellow\"\n ) +\n geom_segment(\n aes(\n xend = calcium,\n yend = .fitted\n ),\n alpha = .1\n ) + # plot line representing residuals\n theme_linedraw()\n```\n\n::: {.cell-output-display}\n![](ps-key-13_files/figure-html/augment and viz-1.png){width=672}\n:::\n:::\n\n\n\n### 5. Calculate the $R^2$ and compare to $R^2$ from fit --- (2 pts)\n\n$R^2 = 1 - \\displaystyle \\frac {SS_{fit}}{SS_{null}}$\n\n$SS_{fit} = \\sum_{i=1}^{n} (data - line)^2 = \\sum_{i=1}^{n} (y_{i} - (\\beta_0 \\cdot 1+ \\beta_1 \\cdot x)^2$\n\n$SS_{null}$ --- sum of squared errors around the mean of $y$\n\n$SS_{null} = \\sum_{i=1}^{n} (data - mean)^2 = \\sum_{i=1}^{n} (y_{i} - \\overline{y})^2$\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nss_fit <- sum(b_fit$.resid^2)\nss_null <- sum(\n (b_fit$sodium - avg_sod)^2\n)\nrsq <- 1 - ss_fit / ss_null\nglance(fit) |> select(r.squared)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 1\n r.squared\n \n1 0.684\n```\n\n\n:::\n:::\n\n\n\n### 6. Using $R^2$, describe the extent to which calcium explains sodium levels --- (2 pts)\n\n> $calcium$ explains \\~68% of variation in $sodium$ levels\n\n### 7. Report (do not calculate) the $p-value$ and your decision on the null --- (1 pts)\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfit |>\n glance() |>\n select(p.value)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 1\n p.value\n \n1 0\n```\n\n\n:::\n:::\n\n\n\n> The null hypothesis is calcium levels do not explain/predict sodium levels IS NOT SUPPORTED\n\nCalcium levels to predict sodium levels.\n\n## Problem \\# 3\n\nWhat is the association between mouse weight and age levels for different sexes?\n\n### 1. Calculate the pearson correlation coefficient between weight and age for females and males --- (2 pts)\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb |>\n group_by(factor(sex)) |>\n cor_test(weight, age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 9\n `factor(sex)` var1 var2 cor statistic p conf.low conf.high method \n \n1 F weight age 0.21 6.30 4.67e-10 0.143 0.269 Pearson\n2 M weight age 0.37 12.0 1.22e-30 0.314 0.427 Pearson\n```\n\n\n:::\n:::\n\n\n\n### 2. Describe your observations --- (2 pts)\n\n> The relationship between weight and age is stronger for males than it is for females.\n", "supporting": [ "ps-key-13_files" ], diff --git a/_freeze/problem-set-keys/ps-key-13/figure-html/augment and viz-1.png b/_freeze/problem-set-keys/ps-key-13/figure-html/augment and viz-1.png index 8df78ce0..a28c69a6 100644 Binary files a/_freeze/problem-set-keys/ps-key-13/figure-html/augment and viz-1.png and b/_freeze/problem-set-keys/ps-key-13/figure-html/augment and viz-1.png differ diff --git a/_freeze/problem-set-keys/ps-key-13/figure-html/unnamed-chunk-4-1.png b/_freeze/problem-set-keys/ps-key-13/figure-html/unnamed-chunk-4-1.png index d71a7e55..ef88423b 100644 Binary files a/_freeze/problem-set-keys/ps-key-13/figure-html/unnamed-chunk-4-1.png and b/_freeze/problem-set-keys/ps-key-13/figure-html/unnamed-chunk-4-1.png differ diff --git a/_freeze/problem-set-keys/ps-key-13/figure-html/unnamed-chunk-4-2.png b/_freeze/problem-set-keys/ps-key-13/figure-html/unnamed-chunk-4-2.png index a6413880..2dad5816 100644 Binary files a/_freeze/problem-set-keys/ps-key-13/figure-html/unnamed-chunk-4-2.png and b/_freeze/problem-set-keys/ps-key-13/figure-html/unnamed-chunk-4-2.png differ diff --git a/_freeze/problem-set-keys/ps-key-14/execute-results/html.json b/_freeze/problem-set-keys/ps-key-14/execute-results/html.json index dd8c4edc..3a9c5450 100644 --- a/_freeze/problem-set-keys/ps-key-14/execute-results/html.json +++ b/_freeze/problem-set-keys/ps-key-14/execute-results/html.json @@ -1,7 +1,8 @@ { - "hash": "1b791afe0ba2e1cc10052a1e42f6c2b1", + "hash": "efd7c47e2ce80b20b2a8da99f95b441d", "result": { - "markdown": "---\ntitle: \"Problem Set Stats Bootcamp - class 12\"\nsubtitle: \"Hypothesis Testing\"\nauthor: \"Neelanjan Mukherjee\"\neditor: visual\n---\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\n-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --\nv dplyr 1.1.3 v readr 2.1.4\nv forcats 1.0.0 v stringr 1.5.0\nv ggplot2 3.4.3 v tibble 3.2.1\nv lubridate 1.9.2 v tidyr 1.3.0\nv purrr 1.0.2 \n-- Conflicts ------------------------------------------ tidyverse_conflicts() --\nx dplyr::filter() masks stats::filter()\nx dplyr::lag() masks stats::lag()\ni Use the conflicted package () to force all conflicts to become errors\n\nAttaching package: 'rstatix'\n\n\nThe following object is masked from 'package:stats':\n\n filter\n\n\n\nAttaching package: 'janitor'\n\n\nThe following object is masked from 'package:rstatix':\n\n make_clean_names\n\n\nThe following objects are masked from 'package:stats':\n\n chisq.test, fisher.test\n\n\nhere() starts at /Users/jayhesselberth/devel/rnabioco/molb-7950\n\n\nAttaching package: 'cowplot'\n\n\nThe following object is masked from 'package:ggpubr':\n\n get_legend\n\n\nThe following object is masked from 'package:lubridate':\n\n stamp\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nbiochem <- read_tsv(\"http://mtweb.cs.ucl.ac.uk/HSMICE/PHENOTYPES/Biochemistry.txt\", show_col_types = FALSE) |>\n janitor::clean_names()\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n```{.r .cell-code}\n# simplify names a bit more\ncolnames(biochem) <- gsub(pattern = \"biochem_\", replacement = \"\", colnames(biochem))\n\n# we are going to simplify this a bit and only keep some columns\nkeep <- colnames(biochem)[c(1, 6, 9, 14, 15, 24:28)]\nbiochem <- biochem[, keep]\n\n# get weights for each individual mouse\n# careful: did not come with column names\nweight <- read_tsv(\"http://mtweb.cs.ucl.ac.uk/HSMICE/PHENOTYPES/weight\", col_names = F, show_col_types = FALSE)\n\n# add column names\ncolnames(weight) <- c(\"subject_name\", \"weight\")\n\n# add weight to biochem table and get rid of NAs\n# rename gender to sex\nb <- inner_join(biochem, weight, by = \"subject_name\") |>\n na.omit() |>\n rename(sex = gender)\n```\n:::\n\n\n## Problem \\# 1\n\nCan mouse sex explain mouse cholesterol? {.smaller}\n\n## STEP 1: Null hypothesis and variable specification\n\n$\\mathcal{H}_0:$ mouse $sex$ does NOT explain $cholesterol$\n\n> $cholesterol$ is the response variable\n\n> $sex$ is the explanatory variable\n\n## STEP 2: Fit linear model and examine results {.smaller}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfit_cs <- lm(formula = tot_cholesterol ~ 1 + sex, data = b)\n```\n:::\n\n\nFit summary:\n\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\nglance(fit_cs) |>\n gt() |>\n fmt_number(columns = r.squared:statistic, decimals = 3)\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n \n \n \n
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.1840.1830.576400.7741.425116e-801-1546.043098.0813114.537591.575317801782
\n
\n```\n\n:::\n:::\n\n\nCoefficient summary:\n\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\ntidy(fit_cs) |>\n gt() |>\n fmt_number(columns = estimate:statistic, decimals = 3)\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n \n\n\n\n\n \n \n \n
termestimatestd.errorstatisticp.value
(Intercept)2.7970.019144.8120.000000e+00
sexM0.5470.02720.0191.425116e-80
\n
\n```\n\n:::\n:::\n\n\n## Collecting residuals and other information {.smaller}\n\nadd residuals and other information\n\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\n# augment\nb_cs <- augment(fit_cs, data = b)\n\n\navg_c <- mean(b_cs$tot_cholesterol)\n\nc <- b |>\n group_by(sex) |>\n get_summary_stats(tot_cholesterol, type = \"mean\")\n\n\n# mean chol female\navg_cf <- pull(c[1, 4])\n\n\n# mean chol male\navg_cm <- pull(c[2, 4])\n```\n:::\n\n\n## STEP 4: Visualize the error around fit {.smaller}\n\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\n# plot of data with mean and colored by residuals\np_cs <- ggplot(\n b_cs,\n aes(x = sex, y = tot_cholesterol)\n) +\n geom_point(\n position = position_jitter(),\n aes(color = .resid)\n ) +\n scale_color_gradient2(\n low = \"blue\",\n mid = \"black\",\n high = \"yellow\"\n ) +\n geom_hline(\n yintercept = avg_c,\n color = \"darkgrey\"\n ) +\n geom_segment(\n aes(\n x = .5, xend = 1.5,\n y = avg_cf, yend = avg_cf\n ),\n color = \"red\"\n ) +\n geom_segment(\n aes(\n x = 1.5, xend = 2.5,\n y = avg_cm\n ),\n yend = avg_cm,\n color = \"red\"\n ) +\n theme_cowplot()\n\np_cs\n```\n\n::: {.cell-output-display}\n![](ps-key-14_files/figure-html/unnamed-chunk-7-1.png){width=672}\n:::\n:::\n\n\n## STEP 3: Visualize the error around the null (mean weight) {.smaller}\n\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\np_c <- ggplot(\n b_cs,\n aes(x = sex, y = tot_cholesterol)\n) +\n geom_point(\n position = position_jitter(),\n aes(color = tot_cholesterol - avg_c)\n ) +\n scale_color_gradient2(\n low = \"blue\",\n mid = \"black\",\n high = \"yellow\"\n ) +\n geom_hline(\n yintercept = avg_c,\n color = \"darkgrey\"\n ) +\n theme_cowplot()\n\np_c\n```\n\n::: {.cell-output-display}\n![](ps-key-14_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\n## Plot the fit error and the null error as 2 panels {.smaller}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot_grid(p_cs, p_c, ncol = 2, labels = c(\"cholesterol by sex\", \"cholesterol\"))\n```\n\n::: {.cell-output-display}\n![](ps-key-14_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n\n## Calculate $R^2$ {.smaller}\n\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\nss_fit <- sum(b_cs$.resid^2)\n\nss_null <- sum(\n (b_cs$tot_cholesterol - avg_c)^2\n)\n```\n:::\n\n\n$R^2 = 1 - \\displaystyle \\frac {SS_{fit}}{SS_{null}}$\n\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\nrsq <- 1 - ss_fit / ss_null\nrsq\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.1837759\n```\n\n\n:::\n:::\n\n\ncheck agains Rsq in your fit\n\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\nglance(fit_cs) |> select(r.squared)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 x 1\n r.squared\n \n1 0.184\n```\n\n\n:::\n:::\n\n\n## Compare to traditional t-test {.smaller}\n\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\n# run analagous t-test\nb |>\n t_test(tot_cholesterol ~ 1 + sex) |>\n select(-c(n1, n2, df)) |>\n gt()\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n \n \n \n
.y.group1group2statisticp
tot_cholesterolFM-20.019331.46e-80
\n
\n```\n\n:::\n:::\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\ntidy(fit_cs) |>\n select(term, estimate, statistic, p.value) |>\n gt()\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n\n\n\n \n\n\n\n \n \n \n
termestimatestatisticp.value
(Intercept)2.7968013144.812300.000000e+00
sexM0.546790120.019331.425116e-80
\n
\n```\n\n:::\n:::\n\n\n## Provide your interpreation of the result\n\nThe null model that mouse $sex$ does NOT explain $cholesterol$ is not well supported. Therefore, i believe that mouse $sex$ is able to explain \\~%18 of the variation in $cholesterol$.\n", + "engine": "knitr", + "markdown": "---\ntitle: \"Problem Set Stats Bootcamp - class 14\"\nsubtitle: \"Hypothesis Testing\"\nauthor: \"Neelanjan Mukherjee\"\neditor: visual\n---\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.1 ✔ tibble 3.2.1\n✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n✔ purrr 1.0.2 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package () to force all conflicts to become errors\n\nAttaching package: 'rstatix'\n\n\nThe following object is masked from 'package:stats':\n\n filter\n\n\n\nAttaching package: 'janitor'\n\n\nThe following object is masked from 'package:rstatix':\n\n make_clean_names\n\n\nThe following objects are masked from 'package:stats':\n\n chisq.test, fisher.test\n\n\nhere() starts at /Users/mtaliaferro/Documents/GitHub/molb-7950\n\n\nAttaching package: 'cowplot'\n\n\nThe following object is masked from 'package:gt':\n\n as_gtable\n\n\nThe following object is masked from 'package:ggpubr':\n\n get_legend\n\n\nThe following object is masked from 'package:lubridate':\n\n stamp\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nbiochem <- read_tsv(\"http://mtweb.cs.ucl.ac.uk/HSMICE/PHENOTYPES/Biochemistry.txt\", show_col_types = FALSE) |>\n janitor::clean_names()\n\n# simplify names a bit more\ncolnames(biochem) <- gsub(pattern = \"biochem_\", replacement = \"\", colnames(biochem))\n\n# we are going to simplify this a bit and only keep some columns\nkeep <- colnames(biochem)[c(1, 6, 9, 14, 15, 24:28)]\nbiochem <- biochem[, keep]\n\n# get weights for each individual mouse\n# careful: did not come with column names\nweight <- read_tsv(\"http://mtweb.cs.ucl.ac.uk/HSMICE/PHENOTYPES/weight\", col_names = F, show_col_types = FALSE)\n\n# add column names\ncolnames(weight) <- c(\"subject_name\", \"weight\")\n\n# add weight to biochem table and get rid of NAs\n# rename gender to sex\nb <- inner_join(biochem, weight, by = \"subject_name\") |>\n na.omit() |>\n rename(sex = gender)\n```\n:::\n\n\n\n## Problem \\# 1\n\nCan mouse sex explain mouse cholesterol? {.smaller}\n\n## STEP 1: Null hypothesis and variable specification --- (2 pts)\n\n$\\mathcal{H}_0:$ mouse $sex$ does NOT explain $cholesterol$\n\n> $cholesterol$ is the response variable\n\n> $sex$ is the explanatory variable\n\n## STEP 2: Fit linear model and examine results --- (3 pts) {.smaller}\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfit_cs <- lm(formula = tot_cholesterol ~ 1 + sex, data = b)\n```\n:::\n\n\n\nFit summary:\n\n\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\nglance(fit_cs) |>\n gt() |>\n fmt_number(columns = r.squared:statistic, decimals = 3)\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n \n \n \n
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.1840.1830.576400.7741.425116e-801-1546.043098.0813114.537591.575317801782
\n
\n```\n\n:::\n:::\n\n\n\nCoefficient summary:\n\n\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\ntidy(fit_cs) |>\n gt() |>\n fmt_number(columns = estimate:statistic, decimals = 3)\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n \n\n\n\n\n \n \n \n
termestimatestd.errorstatisticp.value
(Intercept)2.7970.019144.8120.000000e+00
sexM0.5470.02720.0191.425116e-80
\n
\n```\n\n:::\n:::\n\n\n\n## Collecting residuals and other information --- (2 pts) {.smaller}\n\nadd residuals and other information\n\n\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\n# augment\nb_cs <- augment(fit_cs, data = b)\n\n\navg_c <- mean(b_cs$tot_cholesterol)\n\nc <- b |>\n group_by(sex) |>\n get_summary_stats(tot_cholesterol, type = \"mean\")\n\n\n# mean chol female\navg_cf <- pull(c[1, 4])\n\n\n# mean chol male\navg_cm <- pull(c[2, 4])\n```\n:::\n\n\n\n## STEP 4: Visualize the error around fit --- (2 pts) {.smaller}\n\n\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\n# plot of data with mean and colored by residuals\np_cs <- ggplot(\n b_cs,\n aes(x = sex, y = tot_cholesterol)\n) +\n geom_point(\n position = position_jitter(),\n aes(color = .resid)\n ) +\n scale_color_gradient2(\n low = \"blue\",\n mid = \"black\",\n high = \"yellow\"\n ) +\n geom_hline(\n yintercept = avg_c,\n color = \"darkgrey\"\n ) +\n geom_segment(\n aes(\n x = .5, xend = 1.5,\n y = avg_cf, yend = avg_cf\n ),\n color = \"red\"\n ) +\n geom_segment(\n aes(\n x = 1.5, xend = 2.5,\n y = avg_cm\n ),\n yend = avg_cm,\n color = \"red\"\n ) +\n theme_cowplot()\n\np_cs\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in geom_segment(aes(x = 0.5, xend = 1.5, y = avg_cf, yend = avg_cf), : All aesthetics have length 1, but the data has 1782 rows.\nℹ Please consider using `annotate()` or provide this layer with data containing\n a single row.\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in geom_segment(aes(x = 1.5, xend = 2.5, y = avg_cm), yend = avg_cm, : All aesthetics have length 1, but the data has 1782 rows.\nℹ Please consider using `annotate()` or provide this layer with data containing\n a single row.\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](ps-key-14_files/figure-html/unnamed-chunk-7-1.png){width=672}\n:::\n:::\n\n\n\n## STEP 5: Visualize the error around the null (mean weight) --- (2 pts) {.smaller}\n\n\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\np_c <- ggplot(\n b_cs,\n aes(x = sex, y = tot_cholesterol)\n) +\n geom_point(\n position = position_jitter(),\n aes(color = tot_cholesterol - avg_c)\n ) +\n scale_color_gradient2(\n low = \"blue\",\n mid = \"black\",\n high = \"yellow\"\n ) +\n geom_hline(\n yintercept = avg_c,\n color = \"darkgrey\"\n ) +\n theme_cowplot()\n\np_c\n```\n\n::: {.cell-output-display}\n![](ps-key-14_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\n\n## Plot the fit error and the null error as 2 panels --- (2 pts) {.smaller}\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot_grid(p_cs, p_c, ncol = 2, labels = c(\"cholesterol by sex\", \"cholesterol\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in geom_segment(aes(x = 0.5, xend = 1.5, y = avg_cf, yend = avg_cf), : All aesthetics have length 1, but the data has 1782 rows.\nℹ Please consider using `annotate()` or provide this layer with data containing\n a single row.\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in geom_segment(aes(x = 1.5, xend = 2.5, y = avg_cm), yend = avg_cm, : All aesthetics have length 1, but the data has 1782 rows.\nℹ Please consider using `annotate()` or provide this layer with data containing\n a single row.\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](ps-key-14_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n\n\n## Calculate $R^2$ --- (3 pts) {.smaller}\n\n\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\nss_fit <- sum(b_cs$.resid^2)\n\nss_null <- sum(\n (b_cs$tot_cholesterol - avg_c)^2\n)\n```\n:::\n\n\n\n$R^2 = 1 - \\displaystyle \\frac {SS_{fit}}{SS_{null}}$\n\n\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\nrsq <- 1 - ss_fit / ss_null\nrsq\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.1837759\n```\n\n\n:::\n:::\n\n\n\ncheck agains Rsq in your fit\n\n\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\nglance(fit_cs) |> select(r.squared)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 1\n r.squared\n \n1 0.184\n```\n\n\n:::\n:::\n\n\n\n## Compare to traditional t-test --- (2 pts) {.smaller}\n\n\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\n# run analagous t-test\nb |>\n t_test(tot_cholesterol ~ 1 + sex) |>\n select(-c(n1, n2, df)) |>\n gt()\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n \n \n \n
.y.group1group2statisticp
tot_cholesterolFM-20.019331.46e-80
\n
\n```\n\n:::\n:::\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\ntidy(fit_cs) |>\n select(term, estimate, statistic, p.value) |>\n gt()\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n\n\n\n \n\n\n\n \n \n \n
termestimatestatisticp.value
(Intercept)2.7968013144.812300.000000e+00
sexM0.546790120.019331.425116e-80
\n
\n```\n\n:::\n:::\n\n\n\n## Provide your interpretation of the result --- (2 pts)\n\nThe null model that mouse $sex$ does NOT explain $cholesterol$ is not well supported. Therefore, i believe that mouse $sex$ is able to explain \\~%18 of the variation in $cholesterol$.", "supporting": [ "ps-key-14_files" ], diff --git a/_freeze/problem-set-keys/ps-key-14/figure-html/unnamed-chunk-7-1.png b/_freeze/problem-set-keys/ps-key-14/figure-html/unnamed-chunk-7-1.png index f602229c..e9534848 100644 Binary files a/_freeze/problem-set-keys/ps-key-14/figure-html/unnamed-chunk-7-1.png and b/_freeze/problem-set-keys/ps-key-14/figure-html/unnamed-chunk-7-1.png differ diff --git a/_freeze/problem-set-keys/ps-key-14/figure-html/unnamed-chunk-8-1.png b/_freeze/problem-set-keys/ps-key-14/figure-html/unnamed-chunk-8-1.png index 530f7c1a..3ef72b22 100644 Binary files a/_freeze/problem-set-keys/ps-key-14/figure-html/unnamed-chunk-8-1.png and b/_freeze/problem-set-keys/ps-key-14/figure-html/unnamed-chunk-8-1.png differ diff --git a/_freeze/problem-set-keys/ps-key-14/figure-html/unnamed-chunk-9-1.png b/_freeze/problem-set-keys/ps-key-14/figure-html/unnamed-chunk-9-1.png index 41fa700f..812a812a 100644 Binary files a/_freeze/problem-set-keys/ps-key-14/figure-html/unnamed-chunk-9-1.png and b/_freeze/problem-set-keys/ps-key-14/figure-html/unnamed-chunk-9-1.png differ diff --git a/_freeze/problem-set-keys/ps-key-15/execute-results/html.json b/_freeze/problem-set-keys/ps-key-15/execute-results/html.json index d0343e10..1ad9636f 100644 --- a/_freeze/problem-set-keys/ps-key-15/execute-results/html.json +++ b/_freeze/problem-set-keys/ps-key-15/execute-results/html.json @@ -1,7 +1,8 @@ { - "hash": "494ed1f55e551327bb51aacf87423e3d", + "hash": "d8844a4f3b71052cd5cb4e4df126de3e", "result": { - "markdown": "---\ntitle: \"Problem Set Stats Bootcamp - class 15\"\nsubtitle: \"Dealing with big data\"\nauthor: \"Neelanjan Mukherjee\"\neditor: visual\n---\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\n-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --\nv dplyr 1.1.3 v readr 2.1.4\nv forcats 1.0.0 v stringr 1.5.0\nv ggplot2 3.4.3 v tibble 3.2.1\nv lubridate 1.9.2 v tidyr 1.3.0\nv purrr 1.0.2 \n-- Conflicts ------------------------------------------ tidyverse_conflicts() --\nx dplyr::filter() masks stats::filter()\nx dplyr::lag() masks stats::lag()\ni Use the conflicted package () to force all conflicts to become errors\n\nAttaching package: 'rstatix'\n\n\nThe following object is masked from 'package:stats':\n\n filter\n\n\n\nAttaching package: 'janitor'\n\n\nThe following object is masked from 'package:rstatix':\n\n make_clean_names\n\n\nThe following objects are masked from 'package:stats':\n\n chisq.test, fisher.test\n\n\nhere() starts at /Users/jayhesselberth/devel/rnabioco/molb-7950\n\n\nAttaching package: 'cowplot'\n\n\nThe following object is masked from 'package:lubridate':\n\n stamp\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nang <- read_csv(here(\"data/bootcamp/edger.csv.gz\")) |>\n clean_names() |>\n filter(fdr < 0.05) |>\n select(log_fc_time0_25:log_fc_time8) |>\n as.matrix()\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 17942 Columns: 17\n-- Column specification --------------------------------------------------------\nDelimiter: \",\"\nchr (1): gene\ndbl (16): FDR, maxabsfc, logFC.Time0.25, logFC.Time0.5, logFC.Time0.75, logF...\n\ni Use `spec()` to retrieve the full column specification for this data.\ni Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n```{.r .cell-code}\ncolnames(ang) <- gsub(pattern = \"log_fc_\", \"\", colnames(ang))\n```\n:::\n\n\n## Problem \\# 1\n\nMake sure to run the chunk above. The data represent the avg fold change in gene expression for an angiotensin II time course (.25, .5, .75, 1, 1.5, 2, 3, 4, 6, 8, 24 hrs) compared to unstimulated.\n\n## correlation\n\nCreate hierarchical clustering heatmap of pairwise pearson correlation coefficients. And provide 1-2 observations.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# scale\nang <- t(scale(t(ang)))\n\n\n# pairwise pearson correlation\np_ang <- cor(ang, method = \"pearson\")\n\n# make heatmap\npheatmap(\n mat = p_ang,\n clustering_distance_rows = \"euclidean\",\n clustering_distance_cols = \"euclidean\",\n clustering_method = \"ward.D2\"\n)\n```\n\n::: {.cell-output-display}\n![](ps-key-15_files/figure-html/unnamed-chunk-3-1.png){width=672}\n:::\n:::\n\n\nTimepoints close to each other tend to correlate strongly with each other. The 4,6, and 8 hr time points are the most different from all others.\n\n## PCA\n\nPerform PCA and visualize PC1 vs PC2.Provide 1-2 observations.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# pca\npc_ang <- prcomp(ang)\n\n# gather info from summary\npca_data_info <- summary(pc_ang)$importance |> as.data.frame()\n\npca_data_info <- round(x = pca_data_info, digits = 3)\n\n# we make a dataframe out of the rotations and will use this to plot\npca_plot_data <- pc_ang$rotation |>\n as.data.frame() |>\n rownames_to_column(var = \"ID\")\n\n# plot\nggplot(data = pca_plot_data, mapping = aes(x = PC1, y = PC2, color = ID)) +\n geom_point() +\n xlab(paste(\"PC1, %\", 100 * pca_data_info[\"Proportion of Variance\", \"PC1\"])) +\n ylab(paste(\"PC2, %\", 100 * pca_data_info[\"Proportion of Variance\", \"PC2\"])) +\n ggtitle(\"PCA for angII timecourse\") +\n theme_cowplot()\n```\n\n::: {.cell-output-display}\n![](ps-key-15_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n:::\n\n\nThere is a a circular patter that seems to correspond to the timepoints. Interestingly, 24 appears to group back with 0.25 indicating the system is resetting w/respect to RNA levels.\n\n## Calculate the empirical p-value of the cluster most enriched for DUX4 targets by sampling {.smaller}\n\nIn order to do this, you will need to:\n\n1. Identify which cluster is the most enriched for DUX4 targets.\n - Determine how many genes are in the cluster. You will need to know this to figure out how many genes to sample from the whole data set.\n - Determine how many of the genes in the cluster are DUX4 targets. This is the metric that you are interested in comparing between the null distribution and your observation.\n2. Generate 1000 random sample of the proper size from all genes and find out how many of them are DUX4 targets.\n3. Visualize the distribution of DUX4 targets in these 1000 random (your null distribution) and overlay the number of DUX4 targets you observed in the cluster that was most enriched for DUX4 targets.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# read in data\ncd <- read_tsv(here(\"data\", \"dux4_clustering_results.csv.gz\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 10566 Columns: 15\n-- Column specification --------------------------------------------------------\nDelimiter: \"\\t\"\nchr (2): gene_symbol, target\ndbl (13): hour00_rep1, hour00_rep2, hour00_rep3, hour04_rep1, hour04_rep2, h...\n\ni Use `spec()` to retrieve the full column specification for this data.\ni Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n\n```{.r .cell-code}\n# how many genes are in cluster 5?\nc5 <- cd |>\n filter(Cluster == \"5\") |>\n nrow()\n\n# how many dux targets are in cluster 5?\nc5t <- cd |>\n filter(Cluster == \"5\" & target == \"target\") |>\n nrow()\n\nsampled_targets <- vector()\n\nfor (i in 1:1000) {\n sampled_targets[i] <- sample_n(tbl = cd, size = c5) |>\n group_by(target) |>\n tally() |>\n # need this so all groups have at least 1.\n mutate(n = n + 1) |>\n filter(target == \"target\") |>\n pull(n)\n}\n\nn <- tibble(\n targets = sampled_targets,\n type = \"null\"\n)\n\nn |>\n arrange(-sampled_targets) |>\n top_n(10)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nSelecting by type\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1,000 x 2\n targets type \n \n 1 18 null \n 2 18 null \n 3 17 null \n 4 17 null \n 5 17 null \n 6 17 null \n 7 16 null \n 8 16 null \n 9 16 null \n10 16 null \n# i 990 more rows\n```\n\n\n:::\n\n```{.r .cell-code}\nggplot(data = n, aes(x = targets)) +\n geom_density() +\n geom_vline(xintercept = c5t, color = \"red\") +\n theme_cowplot()\n```\n\n::: {.cell-output-display}\n![](ps-key-15_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n\n### What is the p-value?\n\np \\< 0.001\n\n### What is your interpretation?\n\nThe null hypothesis that the number of DUX4 targets in c5t can be explained by chance - IS NOT WELL SUPPORTED.\n\nThe number of DUX4 targets in c5t CANNOT be explained by chance.\n", + "engine": "knitr", + "markdown": "---\ntitle: \"Problem Set Stats Bootcamp - class 15\"\nsubtitle: \"Dealing with big data\"\nauthor: \"Neelanjan Mukherjee\"\neditor: visual\n---\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.1 ✔ tibble 3.2.1\n✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n✔ purrr 1.0.2 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package () to force all conflicts to become errors\n\nAttaching package: 'rstatix'\n\n\nThe following object is masked from 'package:stats':\n\n filter\n\n\n\nAttaching package: 'janitor'\n\n\nThe following object is masked from 'package:rstatix':\n\n make_clean_names\n\n\nThe following objects are masked from 'package:stats':\n\n chisq.test, fisher.test\n\n\nhere() starts at /Users/mtaliaferro/Documents/GitHub/molb-7950\n\n\nAttaching package: 'cowplot'\n\n\nThe following object is masked from 'package:lubridate':\n\n stamp\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nang <- read_csv(here(\"data/bootcamp/edger.csv.gz\")) |>\n clean_names() |>\n filter(fdr < 0.05) |>\n select(log_fc_time0_25:log_fc_time8) |>\n as.matrix()\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 17942 Columns: 17\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (1): gene\ndbl (16): FDR, maxabsfc, logFC.Time0.25, logFC.Time0.5, logFC.Time0.75, logF...\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n\n```{.r .cell-code}\ncolnames(ang) <- gsub(pattern = \"log_fc_\", \"\", colnames(ang))\n```\n:::\n\n\n\n## Problem \\# 1\n\nMake sure to run the chunk above. The data represent the avg fold change in gene expression for an angiotensin II time course (.25, .5, .75, 1, 1.5, 2, 3, 4, 6, 8, 24 hrs) compared to unstimulated.\n\n## correlation --- (7 pts)\n\nCreate hierarchical clustering heatmap of pairwise pearson correlation coefficients. And provide 1-2 observations.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# scale\nang <- t(scale(t(ang)))\n\n\n# pairwise pearson correlation\np_ang <- cor(ang, method = \"pearson\")\n\n# make heatmap\npheatmap(\n mat = p_ang,\n clustering_distance_rows = \"euclidean\",\n clustering_distance_cols = \"euclidean\",\n clustering_method = \"ward.D2\"\n)\n```\n\n::: {.cell-output-display}\n![](ps-key-15_files/figure-html/unnamed-chunk-3-1.png){width=672}\n:::\n:::\n\n\n\nTimepoints close to each other tend to correlate strongly with each other. The 4,6, and 8 hr time points are the most different from all others.\n\n## PCA --- (7 pts)\n\nPerform PCA and visualize PC1 vs PC2.Provide 1-2 observations.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# pca\npc_ang <- prcomp(ang)\n\n# gather info from summary\npca_data_info <- summary(pc_ang)$importance |> as.data.frame()\n\npca_data_info <- round(x = pca_data_info, digits = 3)\n\n# we make a dataframe out of the rotations and will use this to plot\npca_plot_data <- pc_ang$rotation |>\n as.data.frame() |>\n rownames_to_column(var = \"ID\")\n\n# plot\nggplot(data = pca_plot_data, mapping = aes(x = PC1, y = PC2, color = ID)) +\n geom_point() +\n xlab(paste(\"PC1, %\", 100 * pca_data_info[\"Proportion of Variance\", \"PC1\"])) +\n ylab(paste(\"PC2, %\", 100 * pca_data_info[\"Proportion of Variance\", \"PC2\"])) +\n ggtitle(\"PCA for angII timecourse\") +\n theme_cowplot()\n```\n\n::: {.cell-output-display}\n![](ps-key-15_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n:::\n\n\n\nThere is a a circular patter that seems to correspond to the timepoints. Interestingly, 24 appears to group back with 0.25 indicating the system is resetting w/respect to RNA levels.\n\n## Calculate the empirical p-value of the cluster most enriched for DUX4 targets by sampling --- (6 pts) {.smaller}\n\nIn order to do this, you will need to:\n\n1. Identify which cluster is the most enriched for DUX4 targets.\n - Determine how many genes are in the cluster. You will need to know this to figure out how many genes to sample from the whole data set.\n - Determine how many of the genes in the cluster are DUX4 targets. This is the metric that you are interested in comparing between the null distribution and your observation.\n2. Generate 1000 random sample of the proper size from all genes and find out how many of them are DUX4 targets.\n3. Visualize the distribution of DUX4 targets in these 1000 random (your null distribution) and overlay the number of DUX4 targets you observed in the cluster that was most enriched for DUX4 targets.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# read in data\ncd <- read_tsv(here(\"data\", \"dux4_clustering_results.csv.gz\"))\n\n# how many genes are in cluster 5?\nc5 <- cd |>\n filter(Cluster == \"5\") |>\n nrow()\n\n# how many dux targets are in cluster 5?\nc5t <- cd |>\n filter(Cluster == \"5\" & target == \"target\") |>\n nrow()\n\nsampled_targets <- vector()\n\nfor (i in 1:1000) {\n sampled_targets[i] <- sample_n(tbl = cd, size = c5) |>\n group_by(target) |>\n tally() |>\n # need this so all groups have at least 1.\n mutate(n = n + 1) |>\n filter(target == \"target\") |>\n pull(n)\n}\n\nn <- tibble(\n targets = sampled_targets,\n type = \"null\"\n)\n\nn |>\n arrange(-sampled_targets) |>\n top_n(10)\n\nggplot(data = n, aes(x = targets)) +\n geom_density() +\n geom_vline(xintercept = c5t, color = \"red\") +\n theme_cowplot()\n```\n:::\n\n\n\n### What is the p-value?\n\np \\< 0.001\n\n### What is your interpretation?\n\nThe null hypothesis that the number of DUX4 targets in c5t can be explained by chance - IS NOT WELL SUPPORTED.\n\nThe number of DUX4 targets in c5t CANNOT be explained by chance.\n", "supporting": [ "ps-key-15_files" ], diff --git a/_freeze/problem-sets/ps-12/execute-results/html.json b/_freeze/problem-sets/ps-12/execute-results/html.json index e4f372e6..7a4070b7 100644 --- a/_freeze/problem-sets/ps-12/execute-results/html.json +++ b/_freeze/problem-sets/ps-12/execute-results/html.json @@ -1,10 +1,9 @@ { - "hash": "13589c6002a2e1c7383b9cab62993195", + "hash": "e319175007d9e20a962a36efe74c5da7", "result": { - "markdown": "---\ntitle: \"Problem Set Stats Bootcamp - class 12\"\nsubtitle: \"Hypothesis Testing\"\nauthor: \"Neelanjan Mukherjee\"\neditor: visual\n---\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\n-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --\nv dplyr 1.1.3 v readr 2.1.4\nv forcats 1.0.0 v stringr 1.5.0\nv ggplot2 3.4.3 v tibble 3.2.1\nv lubridate 1.9.2 v tidyr 1.3.0\nv purrr 1.0.2 \n-- Conflicts ------------------------------------------ tidyverse_conflicts() --\nx dplyr::filter() masks stats::filter()\nx dplyr::lag() masks stats::lag()\ni Use the conflicted package () to force all conflicts to become errors\n\nAttaching package: 'rstatix'\n\n\nThe following object is masked from 'package:stats':\n\n filter\n\n\n\nAttaching package: 'janitor'\n\n\nThe following object is masked from 'package:rstatix':\n\n make_clean_names\n\n\nThe following objects are masked from 'package:stats':\n\n chisq.test, fisher.test\n\n\nhere() starts at /Users/jayhesselberth/devel/rnabioco/molb-7950\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nbiochem <- read_tsv(\"http://mtweb.cs.ucl.ac.uk/HSMICE/PHENOTYPES/Biochemistry.txt\", show_col_types = FALSE) |>\n janitor::clean_names()\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '' to native encoding\n```\n\n\n:::\n\n```{.r .cell-code}\n# simplify names a bit more\ncolnames(biochem) <- gsub(pattern = \"biochem_\", replacement = \"\", colnames(biochem))\n\n# we are going to simplify this a bit and only keep some columns\nkeep <- colnames(biochem)[c(1, 6, 9, 14, 15, 24:28)]\nbiochem <- biochem[, keep]\n\n# get weights for each individual mouse\n# careful: did not come with column names\nweight <- read_tsv(\"http://mtweb.cs.ucl.ac.uk/HSMICE/PHENOTYPES/weight\", col_names = F, show_col_types = FALSE)\n\n# add column names\ncolnames(weight) <- c(\"subject_name\", \"weight\")\n\n# add weight to biochem table and get rid of NAs\n# rename gender to sex\nb <- inner_join(biochem, weight, by = \"subject_name\") |>\n na.omit() |>\n rename(sex = gender)\n```\n:::\n\n\n## Problem \\# 1\n\nDoes mouse sex explain mouse total cholesterol levels? Make sure to run chunks above.\n\n### 1. Examine and specify the variable(s)\n\n> The response variable y is $??$ \n> The explantory variable x is $??$\n\n### Make a violin plot:\n\nresponse variable on the y-axis\n\nexplanatory variable on the x-axis\n\n\n\n::: {.cell}\n\n:::\n\n\n### Get n, mean, median, sd\n\n\n::: {.cell}\n\n:::\n\n\n\n### Is it normally distribute?\n\n\n::: {.cell}\n\n:::\n\n\n> Answer here\n\n\n### Is it variance similar between groups?\n\n\n::: {.cell}\n\n:::\n\n\n> Answer here\n\n### What kind of test are you picking and why?\n\n> Answer here\n\n### 2. Declare null hypothesis $\\mathcal{H}_0$\n\n$\\mathcal{H}_0$ is that $??$ does not explain $??$\n\n### 3. Calculate test-statistic, exact p-value and plot\n\n\n\n::: {.cell}\n\n:::\n\n\n\n> My interpretation of the result\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# i have pre-selected some families to compare\nmyfams <- c(\n \"B1.5:E1.4(4) B1.5:A1.4(5)\",\n \"F1.3:A1.2(3) F1.3:E2.2(3)\",\n \"A1.3:D1.2(3) A1.3:H1.2(3)\",\n \"D5.4:G2.3(4) D5.4:C4.3(4)\"\n)\n\n# only keep the familys in myfams\nbfam <- b |>\n filter(family %in% myfams) |>\n droplevels()\n\n# simplify family names and make factor\nbfam$family <- gsub(pattern = \"\\\\..*\", replacement = \"\", x = bfam$family) |>\n as.factor()\n\n\n# make B1 the reference (most similar to overall mean)\nbfam$family <- relevel(x = bfam$family, ref = \"B1\")\n```\n:::\n\n\n\n## Problem \\# 2\n\n\nDoes mouse family explain mouse total cholesterol levels? Make sure to run chunk above.\n\n### 1. Examine and specify the variable(s)\n\n> The response variable y is $??$ \n> The explantory variable x is $??$\n\n### Make a plot:\n\nresponse variable on the y-axis\n\nexplanatory variable on the x-axis\n\n\n\n::: {.cell}\n\n:::\n\n\n### Get n, mean, median, sd\n\n\n::: {.cell}\n\n:::\n\n\n\n### Is it normally distribute?\n\n\n::: {.cell}\n\n:::\n\n\n> Answer here\n\n\n### Is it variance similar between groups?\n\n::: {.cell}\n\n:::\n\n\n> Answer here\n\n### What kind of test are you picking and why?\n\n> Answer here\n### 2. Declare null hypothesis $\\mathcal{H}_0$\n\n$\\mathcal{H}_0$ is that $??$ does not explain $??$\n\n### 3. Calculate test-statistic, exact p-value and plot\n\n\n\n::: {.cell}\n\n:::\n\n\n\n> My interpretation of the result\n", - "supporting": [ - "ps-12_files" - ], + "engine": "knitr", + "markdown": "---\ntitle: \"Problem Set Stats Bootcamp - class 12\"\nsubtitle: \"Hypothesis Testing\"\nauthor: \"Neelanjan Mukherjee\"\neditor: visual\n---\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.1 ✔ tibble 3.2.1\n✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n✔ purrr 1.0.2 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package () to force all conflicts to become errors\n\nAttaching package: 'rstatix'\n\n\nThe following object is masked from 'package:stats':\n\n filter\n\n\n\nAttaching package: 'janitor'\n\n\nThe following object is masked from 'package:rstatix':\n\n make_clean_names\n\n\nThe following objects are masked from 'package:stats':\n\n chisq.test, fisher.test\n\n\nhere() starts at /Users/mtaliaferro/Documents/GitHub/molb-7950\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nbiochem <- read_tsv(\"http://mtweb.cs.ucl.ac.uk/HSMICE/PHENOTYPES/Biochemistry.txt\", show_col_types = FALSE) |>\n janitor::clean_names()\n\n# simplify names a bit more\ncolnames(biochem) <- gsub(pattern = \"biochem_\", replacement = \"\", colnames(biochem))\n\n# we are going to simplify this a bit and only keep some columns\nkeep <- colnames(biochem)[c(1, 6, 9, 14, 15, 24:28)]\nbiochem <- biochem[, keep]\n\n# get weights for each individual mouse\n# careful: did not come with column names\nweight <- read_tsv(\"http://mtweb.cs.ucl.ac.uk/HSMICE/PHENOTYPES/weight\", col_names = F, show_col_types = FALSE)\n\n# add column names\ncolnames(weight) <- c(\"subject_name\", \"weight\")\n\n# add weight to biochem table and get rid of NAs\n# rename gender to sex\nb <- inner_join(biochem, weight, by = \"subject_name\") |>\n na.omit() |>\n rename(sex = gender)\n```\n:::\n\n\n\n## Problem \\# 1\n\nDoes mouse sex explain mouse total cholesterol levels? Make sure to run chunks above.\n\n### 1. Examine and specify the variable(s) (1 pt)\n\n> The response variable y is $??$\\\n> The explantory variable x is $??$\n\n### Make a violin plot: (2 pt)\n\nresponse variable on the y-axis\n\nexplanatory variable on the x-axis\n\n\n\n::: {.cell}\n\n:::\n\n\n\n### Get n, mean, median, sd (1 pt)\n\n\n\n::: {.cell}\n\n:::\n\n\n\n### Is it normally distribute? (1 pt)\n\n\n\n::: {.cell}\n\n:::\n\n\n\n> Answer here\n\n### Is it variance similar between groups? (1 pt)\n\n\n\n::: {.cell}\n\n:::\n\n\n\n> Answer here\n\n### What kind of test are you picking and why? (1 pt)\n\n> Answer here\n\n### 2. Declare null hypothesis $\\mathcal{H}_0$ (1 pt)\n\n$\\mathcal{H}_0$ is that $??$ does not explain $??$\n\n### 3. Calculate test-statistic, exact p-value and plot (2 pt)\n\n\n\n::: {.cell}\n\n:::\n\n\n\n> My interpretation of the result\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# i have pre-selected some families to compare\nmyfams <- c(\n \"B1.5:E1.4(4) B1.5:A1.4(5)\",\n \"F1.3:A1.2(3) F1.3:E2.2(3)\",\n \"A1.3:D1.2(3) A1.3:H1.2(3)\",\n \"D5.4:G2.3(4) D5.4:C4.3(4)\"\n)\n\n# only keep the familys in myfams\nbfam <- b |>\n filter(family %in% myfams) |>\n droplevels()\n\n# simplify family names and make factor\nbfam$family <- gsub(pattern = \"\\\\..*\", replacement = \"\", x = bfam$family) |>\n as.factor()\n\n\n# make B1 the reference (most similar to overall mean)\nbfam$family <- relevel(x = bfam$family, ref = \"B1\")\n```\n:::\n\n\n\n## Problem \\# 2\n\nDoes mouse family explain mouse total cholesterol levels? Make sure to run chunk above.\n\n### 1. Examine and specify the variable(s) (1 pt)\n\n> The response variable y is $??$\\\n> The explantory variable x is $??$\n\n### Make a plot: (2 pt)\n\nresponse variable on the y-axis\n\nexplanatory variable on the x-axis\n\n\n\n::: {.cell}\n\n:::\n\n\n\n### Get n, mean, median, sd (1 pt)\n\n\n\n::: {.cell}\n\n:::\n\n\n\n### Is it normally distribute? (1 pt)\n\n\n\n::: {.cell}\n\n:::\n\n\n\n> Answer here\n\n### Is it variance similar between groups? (1 pt)\n\n\n\n::: {.cell}\n\n:::\n\n\n\n> Answer here\n\n### What kind of test are you picking and why? (1 pt)\n\n> Answer here \\### 2. Declare null hypothesis $\\mathcal{H}_0$\n\n### $\\mathcal{H}_0$ is that $??$ does not explain $??$ (1 pt)\n\n### 3. Calculate test-statistic, exact p-value and plot (2 pt)\n\n\n\n::: {.cell}\n\n:::\n\n\n\n> My interpretation of the result\n", + "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" ], diff --git a/problem-set-keys/ps-key-12.qmd b/problem-set-keys/ps-key-12.qmd index 71ebba50..1185ace3 100644 --- a/problem-set-keys/ps-key-12.qmd +++ b/problem-set-keys/ps-key-12.qmd @@ -29,7 +29,7 @@ response variable on the y-axis explanatory variable on the x-axis -```{r} +```{r, eval = FALSE} ggviolin( data = b, y = "tot_cholesterol", @@ -41,7 +41,7 @@ ggviolin( ### Get n, mean, median, sd (1 pt) -```{r} +```{r, eval = FALSE} b |> group_by(sex) |> get_summary_stats(tot_cholesterol, @@ -52,7 +52,7 @@ b |> ### Is it normally distribute? (1 pt) -```{r} +```{r, eval = FALSE} ggqqplot( data = b, x = "tot_cholesterol", @@ -70,7 +70,7 @@ b |> ### Is it variance similar between groups? (1 pt) -```{r} +```{r, eval = FALSE} b |> levene_test(tot_cholesterol ~ sex) |> gt() @@ -88,7 +88,7 @@ $\mathcal{H}_0$ is that $sex$ does not explain $tot\_cholesterol$ ### 3. Calculate test-statistic, exact p-value and plot (2 pt) -```{r} +```{r, eval = FALSE} b |> t_test(tot_cholesterol ~ sex, var.equal = T, @@ -135,7 +135,7 @@ response variable on the y-axis explanatory variable on the x-axis -```{r} +```{r, eval = FALSE} ggviolin( data = bfam, y = "tot_cholesterol", @@ -147,7 +147,7 @@ ggviolin( ### Get n, mean, median, sd (1 pt) -```{r} +```{r, eval = FALSE} bfam |> group_by(family) |> get_summary_stats(tot_cholesterol, @@ -158,7 +158,7 @@ bfam |> ### Is it normally distribute? (1 pt) -```{r} +```{r, eval = FALSE} ggqqplot( data = bfam, x = "tot_cholesterol", @@ -176,7 +176,7 @@ bfam |> ### Is it variance similar between groups? (1 pt) -```{r} +```{r, eval = FALSE} b |> levene_test(tot_cholesterol ~ family) |> gt() @@ -194,7 +194,7 @@ b |> ### 3. Calculate test-statistic, exact p-value and plot (2 pt) -```{r} +```{r, eval = FALSE} bfam |> anova_test(tot_cholesterol ~ family) |> gt() diff --git a/problem-set-keys/ps-key-14.qmd b/problem-set-keys/ps-key-14.qmd index 865fe39f..06781851 100644 --- a/problem-set-keys/ps-key-14.qmd +++ b/problem-set-keys/ps-key-14.qmd @@ -246,4 +246,4 @@ tidy(fit_cs) |> ## Provide your interpretation of the result --- (2 pts) -The null model that mouse $sex$ does NOT explain $cholesterol$ is not well supported. Therefore, i believe that mouse $sex$ is able to explain \~%18 of the variation in $cholesterol$. \ No newline at end of file +The null model that mouse $sex$ does NOT explain $cholesterol$ is not well supported. Therefore, i believe that mouse $sex$ is able to explain \~%18 of the variation in $cholesterol$. diff --git a/problem-set-keys/ps-key-15.qmd b/problem-set-keys/ps-key-15.qmd index af5e6a2b..94a3c7bc 100644 --- a/problem-set-keys/ps-key-15.qmd +++ b/problem-set-keys/ps-key-15.qmd @@ -90,7 +90,7 @@ In order to do this, you will need to: 2. Generate 1000 random sample of the proper size from all genes and find out how many of them are DUX4 targets. 3. Visualize the distribution of DUX4 targets in these 1000 random (your null distribution) and overlay the number of DUX4 targets you observed in the cluster that was most enriched for DUX4 targets. -```{r} +```{r, eval = FALSE} # read in data cd <- read_tsv(here("data", "dux4_clustering_results.csv.gz")) @@ -139,4 +139,4 @@ p \< 0.001 The null hypothesis that the number of DUX4 targets in c5t can be explained by chance - IS NOT WELL SUPPORTED. -The number of DUX4 targets in c5t CANNOT be explained by chance. \ No newline at end of file +The number of DUX4 targets in c5t CANNOT be explained by chance.