fix pskeys 12 through 15

rnabioco · Oct 11, 2024 · 477f39c · 477f39c
1 parent 1aa2d37
commit 477f39c
Show file tree

Hide file tree

Showing 14 changed files with 30 additions and 29 deletions.
diff --git a/_freeze/problem-set-keys/ps-key-12/execute-results/html.json b/_freeze/problem-set-keys/ps-key-12/execute-results/html.json
diff --git a/_freeze/problem-set-keys/ps-key-13/execute-results/html.json b/_freeze/problem-set-keys/ps-key-13/execute-results/html.json
diff --git a/_freeze/problem-set-keys/ps-key-13/figure-html/augment and viz-1.png b/_freeze/problem-set-keys/ps-key-13/figure-html/augment and viz-1.png
diff --git a/_freeze/problem-set-keys/ps-key-13/figure-html/unnamed-chunk-4-1.png b/_freeze/problem-set-keys/ps-key-13/figure-html/unnamed-chunk-4-1.png
diff --git a/_freeze/problem-set-keys/ps-key-13/figure-html/unnamed-chunk-4-2.png b/_freeze/problem-set-keys/ps-key-13/figure-html/unnamed-chunk-4-2.png
diff --git a/_freeze/problem-set-keys/ps-key-14/execute-results/html.json b/_freeze/problem-set-keys/ps-key-14/execute-results/html.json
diff --git a/_freeze/problem-set-keys/ps-key-14/figure-html/unnamed-chunk-7-1.png b/_freeze/problem-set-keys/ps-key-14/figure-html/unnamed-chunk-7-1.png
diff --git a/_freeze/problem-set-keys/ps-key-14/figure-html/unnamed-chunk-8-1.png b/_freeze/problem-set-keys/ps-key-14/figure-html/unnamed-chunk-8-1.png
diff --git a/_freeze/problem-set-keys/ps-key-14/figure-html/unnamed-chunk-9-1.png b/_freeze/problem-set-keys/ps-key-14/figure-html/unnamed-chunk-9-1.png
diff --git a/_freeze/problem-set-keys/ps-key-15/execute-results/html.json b/_freeze/problem-set-keys/ps-key-15/execute-results/html.json
@@ -1,7 +1,8 @@
 {
-  "hash": "494ed1f55e551327bb51aacf87423e3d",
+  "hash": "d8844a4f3b71052cd5cb4e4df126de3e",
   "result": {
-    "markdown": "---\ntitle: \"Problem Set Stats Bootcamp - class 15\"\nsubtitle: \"Dealing with big data\"\nauthor: \"Neelanjan Mukherjee\"\neditor: visual\n---\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\n-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --\nv dplyr     1.1.3     v readr     2.1.4\nv forcats   1.0.0     v stringr   1.5.0\nv ggplot2   3.4.3     v tibble    3.2.1\nv lubridate 1.9.2     v tidyr     1.3.0\nv purrr     1.0.2     \n-- Conflicts ------------------------------------------ tidyverse_conflicts() --\nx dplyr::filter() masks stats::filter()\nx dplyr::lag()    masks stats::lag()\ni Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors\n\nAttaching package: 'rstatix'\n\n\nThe following object is masked from 'package:stats':\n\n    filter\n\n\n\nAttaching package: 'janitor'\n\n\nThe following object is masked from 'package:rstatix':\n\n    make_clean_names\n\n\nThe following objects are masked from 'package:stats':\n\n    chisq.test, fisher.test\n\n\nhere() starts at /Users/jayhesselberth/devel/rnabioco/molb-7950\n\n\nAttaching package: 'cowplot'\n\n\nThe following object is masked from 'package:lubridate':\n\n    stamp\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nang <- read_csv(here(\"data/bootcamp/edger.csv.gz\")) |>\n  clean_names() |>\n  filter(fdr < 0.05) |>\n  select(log_fc_time0_25:log_fc_time8) |>\n  as.matrix()\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 17942 Columns: 17\n-- Column specification --------------------------------------------------------\nDelimiter: \",\"\nchr  (1): gene\ndbl (16): FDR, maxabsfc, logFC.Time0.25, logFC.Time0.5, logFC.Time0.75, logF...\n\ni Use `spec()` to retrieve the full column specification for this data.\ni Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '<U+00C4>' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '<U+00D6>' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '<U+00DC>' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '<U+00E4>' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '<U+00F6>' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '<U+00FC>' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '<U+00DF>' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '<U+00C6>' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '<U+00E6>' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '<U+00D8>' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '<U+00F8>' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '<U+00C5>' to native encoding\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in FUN(X[[i]], ...): unable to translate '<U+00E5>' to native encoding\n```\n\n\n:::\n\n```{.r .cell-code}\ncolnames(ang) <- gsub(pattern = \"log_fc_\", \"\", colnames(ang))\n```\n:::\n\n\n## Problem \\# 1\n\nMake sure to run the chunk above. The data represent the avg fold change in gene expression for an angiotensin II time course (.25, .5, .75, 1, 1.5, 2, 3, 4, 6, 8, 24 hrs) compared to unstimulated.\n\n## correlation\n\nCreate hierarchical clustering heatmap of pairwise pearson correlation coefficients. And provide 1-2 observations.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# scale\nang <- t(scale(t(ang)))\n\n\n# pairwise pearson correlation\np_ang <- cor(ang, method = \"pearson\")\n\n# make heatmap\npheatmap(\n  mat = p_ang,\n  clustering_distance_rows = \"euclidean\",\n  clustering_distance_cols = \"euclidean\",\n  clustering_method = \"ward.D2\"\n)\n```\n\n::: {.cell-output-display}\n![](ps-key-15_files/figure-html/unnamed-chunk-3-1.png){width=672}\n:::\n:::\n\n\nTimepoints close to each other tend to correlate strongly with each other. The 4,6, and 8 hr time points are the most different from all others.\n\n## PCA\n\nPerform PCA and visualize PC1 vs PC2.Provide 1-2 observations.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# pca\npc_ang <- prcomp(ang)\n\n# gather info from summary\npca_data_info <- summary(pc_ang)$importance |> as.data.frame()\n\npca_data_info <- round(x = pca_data_info, digits = 3)\n\n# we make a dataframe out of the rotations and will use this to plot\npca_plot_data <- pc_ang$rotation |>\n  as.data.frame() |>\n  rownames_to_column(var = \"ID\")\n\n# plot\nggplot(data = pca_plot_data, mapping = aes(x = PC1, y = PC2, color = ID)) +\n  geom_point() +\n  xlab(paste(\"PC1, %\", 100 * pca_data_info[\"Proportion of Variance\", \"PC1\"])) +\n  ylab(paste(\"PC2, %\", 100 * pca_data_info[\"Proportion of Variance\", \"PC2\"])) +\n  ggtitle(\"PCA for angII timecourse\") +\n  theme_cowplot()\n```\n\n::: {.cell-output-display}\n![](ps-key-15_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n:::\n\n\nThere is a a circular patter that seems to correspond to the timepoints. Interestingly, 24 appears to group back with 0.25 indicating the system is resetting w/respect to RNA levels.\n\n## Calculate the empirical p-value of the cluster most enriched for DUX4 targets by sampling {.smaller}\n\nIn order to do this, you will need to:\n\n1.  Identify which cluster is the most enriched for DUX4 targets.\n    -   Determine how many genes are in the cluster. You will need to know this to figure out how many genes to sample from the whole data set.\n    -   Determine how many of the genes in the cluster are DUX4 targets. This is the metric that you are interested in comparing between the null distribution and your observation.\n2.  Generate 1000 random sample of the proper size from all genes and find out how many of them are DUX4 targets.\n3.  Visualize the distribution of DUX4 targets in these 1000 random (your null distribution) and overlay the number of DUX4 targets you observed in the cluster that was most enriched for DUX4 targets.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# read in data\ncd <- read_tsv(here(\"data\", \"dux4_clustering_results.csv.gz\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 10566 Columns: 15\n-- Column specification --------------------------------------------------------\nDelimiter: \"\\t\"\nchr  (2): gene_symbol, target\ndbl (13): hour00_rep1, hour00_rep2, hour00_rep3, hour04_rep1, hour04_rep2, h...\n\ni Use `spec()` to retrieve the full column specification for this data.\ni Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n\n```{.r .cell-code}\n# how many genes are in cluster 5?\nc5 <- cd |>\n  filter(Cluster == \"5\") |>\n  nrow()\n\n# how many dux targets are in cluster 5?\nc5t <- cd |>\n  filter(Cluster == \"5\" & target == \"target\") |>\n  nrow()\n\nsampled_targets <- vector()\n\nfor (i in 1:1000) {\n  sampled_targets[i] <- sample_n(tbl = cd, size = c5) |>\n    group_by(target) |>\n    tally() |>\n    # need this so all groups have at least 1.\n    mutate(n = n + 1) |>\n    filter(target == \"target\") |>\n    pull(n)\n}\n\nn <- tibble(\n  targets = sampled_targets,\n  type = \"null\"\n)\n\nn |>\n  arrange(-sampled_targets) |>\n  top_n(10)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nSelecting by type\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1,000 x 2\n   targets type \n     <dbl> <chr>\n 1      18 null \n 2      18 null \n 3      17 null \n 4      17 null \n 5      17 null \n 6      17 null \n 7      16 null \n 8      16 null \n 9      16 null \n10      16 null \n# i 990 more rows\n```\n\n\n:::\n\n```{.r .cell-code}\nggplot(data = n, aes(x = targets)) +\n  geom_density() +\n  geom_vline(xintercept = c5t, color = \"red\") +\n  theme_cowplot()\n```\n\n::: {.cell-output-display}\n![](ps-key-15_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n\n### What is the p-value?\n\np \\< 0.001\n\n### What is your interpretation?\n\nThe null hypothesis that the number of DUX4 targets in c5t can be explained by chance - IS NOT WELL SUPPORTED.\n\nThe number of DUX4 targets in c5t CANNOT be explained by chance.\n",
+    "engine": "knitr",
+    "markdown": "---\ntitle: \"Problem Set Stats Bootcamp - class 15\"\nsubtitle: \"Dealing with big data\"\nauthor: \"Neelanjan Mukherjee\"\neditor: visual\n---\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr     1.1.4     ✔ readr     2.1.5\n✔ forcats   1.0.0     ✔ stringr   1.5.1\n✔ ggplot2   3.5.1     ✔ tibble    3.2.1\n✔ lubridate 1.9.3     ✔ tidyr     1.3.1\n✔ purrr     1.0.2     \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag()    masks stats::lag()\nℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors\n\nAttaching package: 'rstatix'\n\n\nThe following object is masked from 'package:stats':\n\n    filter\n\n\n\nAttaching package: 'janitor'\n\n\nThe following object is masked from 'package:rstatix':\n\n    make_clean_names\n\n\nThe following objects are masked from 'package:stats':\n\n    chisq.test, fisher.test\n\n\nhere() starts at /Users/mtaliaferro/Documents/GitHub/molb-7950\n\n\nAttaching package: 'cowplot'\n\n\nThe following object is masked from 'package:lubridate':\n\n    stamp\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nang <- read_csv(here(\"data/bootcamp/edger.csv.gz\")) |>\n  clean_names() |>\n  filter(fdr < 0.05) |>\n  select(log_fc_time0_25:log_fc_time8) |>\n  as.matrix()\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 17942 Columns: 17\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr  (1): gene\ndbl (16): FDR, maxabsfc, logFC.Time0.25, logFC.Time0.5, logFC.Time0.75, logF...\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n\n```{.r .cell-code}\ncolnames(ang) <- gsub(pattern = \"log_fc_\", \"\", colnames(ang))\n```\n:::\n\n\n\n## Problem \\# 1\n\nMake sure to run the chunk above. The data represent the avg fold change in gene expression for an angiotensin II time course (.25, .5, .75, 1, 1.5, 2, 3, 4, 6, 8, 24 hrs) compared to unstimulated.\n\n## correlation --- (7 pts)\n\nCreate hierarchical clustering heatmap of pairwise pearson correlation coefficients. And provide 1-2 observations.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# scale\nang <- t(scale(t(ang)))\n\n\n# pairwise pearson correlation\np_ang <- cor(ang, method = \"pearson\")\n\n# make heatmap\npheatmap(\n  mat = p_ang,\n  clustering_distance_rows = \"euclidean\",\n  clustering_distance_cols = \"euclidean\",\n  clustering_method = \"ward.D2\"\n)\n```\n\n::: {.cell-output-display}\n![](ps-key-15_files/figure-html/unnamed-chunk-3-1.png){width=672}\n:::\n:::\n\n\n\nTimepoints close to each other tend to correlate strongly with each other. The 4,6, and 8 hr time points are the most different from all others.\n\n## PCA --- (7 pts)\n\nPerform PCA and visualize PC1 vs PC2.Provide 1-2 observations.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# pca\npc_ang <- prcomp(ang)\n\n# gather info from summary\npca_data_info <- summary(pc_ang)$importance |> as.data.frame()\n\npca_data_info <- round(x = pca_data_info, digits = 3)\n\n# we make a dataframe out of the rotations and will use this to plot\npca_plot_data <- pc_ang$rotation |>\n  as.data.frame() |>\n  rownames_to_column(var = \"ID\")\n\n# plot\nggplot(data = pca_plot_data, mapping = aes(x = PC1, y = PC2, color = ID)) +\n  geom_point() +\n  xlab(paste(\"PC1, %\", 100 * pca_data_info[\"Proportion of Variance\", \"PC1\"])) +\n  ylab(paste(\"PC2, %\", 100 * pca_data_info[\"Proportion of Variance\", \"PC2\"])) +\n  ggtitle(\"PCA for angII timecourse\") +\n  theme_cowplot()\n```\n\n::: {.cell-output-display}\n![](ps-key-15_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n:::\n\n\n\nThere is a a circular patter that seems to correspond to the timepoints. Interestingly, 24 appears to group back with 0.25 indicating the system is resetting w/respect to RNA levels.\n\n## Calculate the empirical p-value of the cluster most enriched for DUX4 targets by sampling --- (6 pts) {.smaller}\n\nIn order to do this, you will need to:\n\n1.  Identify which cluster is the most enriched for DUX4 targets.\n    -   Determine how many genes are in the cluster. You will need to know this to figure out how many genes to sample from the whole data set.\n    -   Determine how many of the genes in the cluster are DUX4 targets. This is the metric that you are interested in comparing between the null distribution and your observation.\n2.  Generate 1000 random sample of the proper size from all genes and find out how many of them are DUX4 targets.\n3.  Visualize the distribution of DUX4 targets in these 1000 random (your null distribution) and overlay the number of DUX4 targets you observed in the cluster that was most enriched for DUX4 targets.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# read in data\ncd <- read_tsv(here(\"data\", \"dux4_clustering_results.csv.gz\"))\n\n# how many genes are in cluster 5?\nc5 <- cd |>\n  filter(Cluster == \"5\") |>\n  nrow()\n\n# how many dux targets are in cluster 5?\nc5t <- cd |>\n  filter(Cluster == \"5\" & target == \"target\") |>\n  nrow()\n\nsampled_targets <- vector()\n\nfor (i in 1:1000) {\n  sampled_targets[i] <- sample_n(tbl = cd, size = c5) |>\n    group_by(target) |>\n    tally() |>\n    # need this so all groups have at least 1.\n    mutate(n = n + 1) |>\n    filter(target == \"target\") |>\n    pull(n)\n}\n\nn <- tibble(\n  targets = sampled_targets,\n  type = \"null\"\n)\n\nn |>\n  arrange(-sampled_targets) |>\n  top_n(10)\n\nggplot(data = n, aes(x = targets)) +\n  geom_density() +\n  geom_vline(xintercept = c5t, color = \"red\") +\n  theme_cowplot()\n```\n:::\n\n\n\n### What is the p-value?\n\np \\< 0.001\n\n### What is your interpretation?\n\nThe null hypothesis that the number of DUX4 targets in c5t can be explained by chance - IS NOT WELL SUPPORTED.\n\nThe number of DUX4 targets in c5t CANNOT be explained by chance.\n",
     "supporting": [
       "ps-key-15_files"
     ],