diff --git a/_freeze/slides/slides-20/execute-results/html.json b/_freeze/slides/slides-20/execute-results/html.json index fb7461e8..b391aaf0 100644 --- a/_freeze/slides/slides-20/execute-results/html.json +++ b/_freeze/slides/slides-20/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "4d4cfb389078bae03eeb204885df38c1", + "hash": "5f0340daf8e46ad0b7fa56a9ad60404b", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Factor-centric chromatin analysis\"\nauthor: \"Jay Hesselberth\"\n---\n\n\n\n# Where do transcription factors bind in the genome?\n\nToday we'll look at where two yeast transcription factors bind in the genome using CUT&RUN.\n\nTechniques like CUT&RUN require an affinity reagent (e.g., an antibody) that uniquely recognizes a transcription factor in the cell.\n\nThis antibody is added to permeabilized cells, and the antibody associates with the epitope. A separate reagent, a fusion of Protein A (which binds IgG) and micrococcal nuclease (MNase) then associates with the antibody. Addition of calcium activates MNase, and nearby DNA is digested. These DNA fragments are then isolated and sequenced to identify sites of TF association in the genome.\n\n![Fig 1a, Skene et al.](../img/block-dna/skene-fig-1a.png)\n\n## Data download and pre-processing\n\nCUT&RUN data were downloaded from the [NCBI GEO page](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84474) for Skene et al.\n\nI selected the 16 second time point for *S. cerevisiae* Abf1 and Reb1 (note the paper combined data from the 1-32 second time points).\n\nBED files containing mapped DNA fragments were separated by size and converted to bigWig with:\n\n``` bash\n# separate fragments by size\nawk '($3 - $2 <= 120)' Abf1.bed > CutRun_Abf1_lt120.bed\nawk '($3 - $2 => 150)' Abf1.bed > CutRun_Abf1_gt150.bed\n\n# for each file with the different sizes\nbedtools genomecov -i Abf1.bed -g sacCer3.chrom.sizes -bg > Abf1.bg\nbedGraphToBigWig Abf1.bg sacCer3.chrom.sizes Abf1.bw\n```\n\nThe bigWig files are available here in the `data/` directory.\n\n## Questions\n\n1. How do you ensure your antibody recognizes what you think it recognizes? What are important controls for ensuring it recognizes a specific epitope?\n\n2. What are some good controls for CUT&RUN experiments?\n\n# CUT&RUN analysis\n\n## Set up libraries\n\n\n\n::: {.cell}\n\n:::\n\n\n\n## Examine genome coverage\n\n\n\n::: {.cell}\n\n:::\n\n\n\nNow that we have tracks loaded, we can make a plot.\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](slides-20_files/figure-revealjs/plot-tracks-1.png){width=960}\n:::\n:::\n\n\n\n## Questions\n\n1. What features stand out in the above tracks? What is different between Reb1 and Abf1? Between the short and long fragments?\n\n2. Where are the major signals with respect to genes?\n\n## Peak calling\n\nA conceptually simple approach to identification of regions containing \"peaks\" where a transcription factor was bound is available in the MACS software ([paper](), [github]()). There's also a nice [blog post](https://hbctraining.github.io/Intro-to-ChIPseq/lessons/05_peak_calling_macs.html) covering the main ideas.\n\n## Theory\n\nThe Poisson distribution is a discrete probability distribution of the form:\n\n$$ P_\\lambda (X=k) = \\frac{ \\lambda^k }{ k! * e^{-\\lambda} } $$\n\nwhere $\\lambda$ captures both the mean and variance of the distribution.\n\nThe R functions `dpois()`, `ppois()`, and `rpois()` provide access to the density, distribution, and random generation for the Poisson distribution. See `?dpois` for details.\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](slides-20_files/figure-revealjs/plot-poisson-1.png){width=960}\n:::\n:::\n\n\n\n## Practice\n\nHere, we model read coverage using the Poisson distribution. Given some genome size $G$ and and a number of reads collected $N$, we can approximate $\\lambda$ from $N/G$.\n\nMACS uses this value (the \"genomewide\" lambda) and also calculates several \"local\" lambda values to account for variation among genomic regions. We'll just use the genomewide lambda, which provides a conservative threshold for peak calling.\n\nUsing the genomewide lambda, we can use the Poisson distribution to address the question: **How surprised should I be to see** $k$ reads at position X?\n\n\n\n::: {.cell}\n\n:::\n\n\n\n## P-values\n\nLet's take a look at a plot of the p-value across a chromosome. What do you notice about this plot, when compared to the coverage of the CUT&RUN coverage above?\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](slides-20_files/figure-revealjs/unnamed-chunk-6-1.png){width=960}\n:::\n:::\n\n\n\n## Peaks\n\nHow many peaks are called in this region?\n\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 3\n chrom start end\n \n1 chrII 100248 100289\n2 chrII 101292 101393\n3 chrII 124916 124949\n4 chrII 136181 136264\n5 chrII 141070 141121\n```\n\n\n:::\n:::\n\n\n\n## Visualize\n\nLet's visualize these peaks in the context of genomic CUT&RUN signal. We need to define an `AnnotationTrack` with the peak intervals, which we can plot against the CUT&RUN coverage we defined above.\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](slides-20_files/figure-revealjs/plot-peaks-1.png){width=960}\n:::\n:::\n\n\n\n## Questions\n\n1. How many peaks were called throughout the genome? How wide are the called peaks, on average?\n\n2. How else might we define a significance threshold for identifying peaks?\n\n3. What might the relative heights of the peaks indicate? What types of technical or biological variables might influence peak heights?\n\n\n## References\n\nGADEM: a genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery. J Comput Biol 2009 \\[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2756050/)\\] \\[[PubMed](https://pubmed.ncbi.nlm.nih.gov/19193149)\\] \\[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=J+Comput+Biol&title=GADEM:+a+genetic+algorithm+guided+formation+of+spaced+dyads+coupled+with+an+EM+algorithm+for+motif+discovery&author=L.+Li&volume=16&issue=2&publication_year=2009&pages=317-329&pmid=19193149&)\\]\n", + "markdown": "---\ntitle: \"Factor-centric chromatin analysis\"\nauthor: \"Jay Hesselberth\"\n---\n\n\n\n## Where do transcription factors bind in the genome?\n\nToday we'll look at where two yeast transcription factors bind in the genome using CUT&RUN.\n\n## Where do transcription factors bind in the genome?\n\nTechniques like CUT&RUN require an affinity reagent (e.g., an antibody) that uniquely recognizes a transcription factor in the cell.\n\nThis antibody is added to permeabilized cells, and the antibody associates with the epitope. A separate reagent, a fusion of Protein A (which binds IgG) and micrococcal nuclease (MNase) then associates with the antibody. Addition of calcium activates MNase, and nearby DNA is digested. These DNA fragments are then isolated and sequenced to identify sites of TF association in the genome.\n\n## Where do transcription factors bind in the genome?\n\n![Fig 1a, Skene et al.](../img/block-dna/skene-fig-1a.png)\n\n## Data download and pre-processing\n\nCUT&RUN data were downloaded from the [NCBI GEO page](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84474) for Skene et al.\n\nI selected the 16 second time point for *S. cerevisiae* Abf1 and Reb1 (note the paper combined data from the 1-32 second time points).\n\nBED files containing mapped DNA fragments were separated by size and converted to bigWig with:\n\n``` bash\n#| echo: true\n# separate fragments by size\nawk '($3 - $2 <= 120)' Abf1.bed > CutRun_Abf1_lt120.bed\nawk '($3 - $2 => 150)' Abf1.bed > CutRun_Abf1_gt150.bed\n\n# for each file with the different sizes\nbedtools genomecov -i Abf1.bed -g sacCer3.chrom.sizes -bg > Abf1.bg\nbedGraphToBigWig Abf1.bg sacCer3.chrom.sizes Abf1.bw\n```\n\nThe bigWig files are available here in the `data/` directory.\n\n## Questions\n\n1. How do you ensure your antibody recognizes what you think it recognizes? What are important controls for ensuring it recognizes a specific epitope?\n\n2. What are some good controls for CUT&RUN experiments?\n\n# CUT&RUN analysis\n\n## Set up libraries\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(here)\nlibrary(valr)\n\n# genome viz\nlibrary(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene)\nlibrary(Gviz)\nlibrary(rtracklayer)\n\n# motif discovery and viz\nlibrary(BSgenome.Scerevisiae.UCSC.sacCer3)\nlibrary(rGADEM)\nlibrary(seqLogo)\n```\n:::\n\n\n\n## Examine genome coverage\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrack_start <- 90000\ntrack_end <- 150000\n\n# genes track\nsgd_genes_trk <-\n GeneRegionTrack(\n TxDb.Scerevisiae.UCSC.sacCer3.sgdGene,\n chromosome = \"chrII\",\n start = track_start,\n end = track_end,\n background.title = \"white\",\n col.title = \"black\",\n fontsize = 16\n )\n```\n:::\n\n\n\n\n## Examine genome coverage\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# signal tracks\n\ntrack_info <-\n tibble(\n file_name = c(\n \"CutRun_Reb1_lt120.bw\",\n \"CutRun_Abf1_lt120.bw\",\n \"CutRun_Reb1_gt150.bw\",\n \"CutRun_Abf1_gt150.bw\"\n ),\n sample_type = c(\n \"Reb1_Short\", \"Abf1_Short\",\n \"Reb1_Long\", \"Abf1_Long\"\n )\n ) |>\n mutate(\n file_path = here(\"data/block-dna\", file_name),\n big_wig = purrr::map(\n file_path, ~ import.bw(.x, as = \"GRanges\")\n ),\n data_track = purrr::map2(\n big_wig, sample_type,\n ~ DataTrack(\n .x,\n name = .y,\n background.title = \"white\",\n col.title = \"black\",\n col.axis = \"black\",\n fontsize = 16\n )\n )\n ) |>\n dplyr::select(sample_type, big_wig, data_track)\n```\n:::\n\n\n\n## Examine genome coverage\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# x-axis track\nx_axis_trk <- GenomeAxisTrack(\n col = \"black\",\n col.axis = \"black\",\n fontsize = 16\n)\n```\n:::\n\n\n\n## Examine genome coverage\nNow that we have tracks loaded, we can make a plot.\n\n\n\n::: {.cell output-location='column'}\n\n```{.r .cell-code}\nplotTracks(\n c(\n sgd_genes_trk,\n track_info$data_track,\n x_axis_trk\n ),\n from = track_start,\n to = track_end,\n chromosome = \"chrII\",\n transcriptAnnotation = \"gene\",\n shape = \"arrow\",\n type = \"histogram\"\n)\n```\n\n::: {.cell-output-display}\n![](slides-20_files/figure-revealjs/plot-tracks-1.png){width=960}\n:::\n:::\n\n\n\n## Questions\n\n1. What features stand out in the above tracks? What is different between Reb1 and Abf1? Between the short and long fragments?\n\n2. Where are the major signals with respect to genes?\n\n## Peak calling\n\nA conceptually simple approach to identification of regions containing \"peaks\" where a transcription factor was bound is available in the MACS software ([paper](), [github]()). There's also a nice [blog post](https://hbctraining.github.io/Intro-to-ChIPseq/lessons/05_peak_calling_macs.html) covering the main ideas.\n\n## Theory\n\nThe Poisson distribution is a discrete probability distribution of the form:\n\n$$ P_\\lambda (X=k) = \\frac{ \\lambda^k }{ k! * e^{-\\lambda} } $$\n\nwhere $\\lambda$ captures both the mean and variance of the distribution.\n\nThe R functions `dpois()`, `ppois()`, and `rpois()` provide access to the density, distribution, and random generation for the Poisson distribution. See `?dpois` for details.\n\n## Theory\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](slides-20_files/figure-revealjs/plot-poisson-1.png){width=960}\n:::\n:::\n\n\n\n## Practice\n\nHere, we model read coverage using the Poisson distribution. Given some genome size $G$ and and a number of reads collected $N$, we can approximate $\\lambda$ from $N/G$.\n\nMACS uses this value (the \"genomewide\" lambda) and also calculates several \"local\" lambda values to account for variation among genomic regions. We'll just use the genomewide lambda, which provides a conservative threshold for peak calling.\n\nUsing the genomewide lambda, we can use the Poisson distribution to address the question: **How surprised should I be to see** $k$ reads at position X?\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nabf1_tbl <- read_bigwig(here(\"data/block-dna/CutRun_Abf1_lt120.bw\"))\n\n# number of reads in the original Abf1 BED file\ntotal_reads <- 16e6\n\ngenome <- read_genome(here(\"data/block-dna/sacCer3.chrom.sizes\"))\ngenome_size <- sum(genome$size)\n\ngenome_lambda <- total_reads / genome_size\n\npeak_calls <-\n abf1_tbl |>\n # define single-base sites\n mutate(\n midpoint = start + round((end - start) / 2),\n start = midpoint,\n end = start + 1,\n # use the poisson to calculate a p-value with the genome-wide lambda\n pval = dpois(score, genome_lambda),\n # convert p-values to FDR\n fdr = p.adjust(pval, method = \"fdr\")\n )\n```\n:::\n\n\n\n## P-values\n\nLet's take a look at a plot of the p-value across a chromosome. What do you notice about this plot, when compared to the coverage of the CUT&RUN coverage above?\n\n\n\n::: {.cell output-location='column'}\n\n```{.r .cell-code}\nggplot(\n filter(peak_calls, chrom == \"chrII\"),\n aes(start, -log10(pval))\n) +\n geom_line() +\n xlim(track_start, track_end) +\n theme_cowplot()\n```\n\n::: {.cell-output-display}\n![](slides-20_files/figure-revealjs/plot-p-vals-1.png){width=960}\n:::\n:::\n\n\n\n## Peaks\n\nHow many peaks are called in this region?\n\n\n\n::: {.cell output-location='column'}\n\n```{.r .cell-code}\n# most stringent cut-off\npeak_calls_sig <-\n filter(\n peak_calls,\n fdr == 0\n ) |>\n # collapse neighboring, significant sites\n bed_merge(max_dist = 20)\n\nfilter(\n peak_calls_sig,\n chrom == \"chrII\" &\n start >= track_start &\n end <= track_end\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 3\n chrom start end\n \n1 chrII 100248 100289\n2 chrII 101292 101393\n3 chrII 124916 124949\n4 chrII 136181 136264\n5 chrII 141070 141121\n```\n\n\n:::\n:::\n\n\n\n## Visualize\n\nLet's visualize these peaks in the context of genomic CUT&RUN signal. We need to define an `AnnotationTrack` with the peak intervals, which we can plot against the CUT&RUN coverage we defined above.\n\nLet us load the data:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# need a GRanges object to convert to an AnnotationTrack\npeak_calls_gr <-\n GRanges(\n seqnames = peak_calls_sig$chrom,\n ranges = IRanges(peak_calls_sig$start, peak_calls_sig$end)\n )\n\npeak_calls_trk <-\n AnnotationTrack(\n peak_calls_gr,\n name = \"Peak calls\",\n fill = \"red\",\n background.title = \"white\",\n col.title = \"red\",\n fontsize = 16,\n rotation.title = 0\n )\n\nabf1_short_trk <-\n filter(\n track_info,\n sample_type == \"Abf1_Short\"\n ) |>\n pull(data_track)\n```\n:::\n\n\n\n## Visualize\n\nAnd plot:\n\n\n\n::: {.cell output-location='column'}\n\n```{.r .cell-code}\nplotTracks(\n c(\n sgd_genes_trk,\n abf1_short_trk,\n peak_calls_trk,\n x_axis_trk\n ),\n from = track_start,\n to = track_end,\n chromosome = \"chrII\",\n transcriptAnnotation = \"gene\",\n shape = \"arrow\",\n type = \"histogram\"\n)\n```\n\n::: {.cell-output-display}\n![](slides-20_files/figure-revealjs/plot-peaks-2-1.png){width=960}\n:::\n:::\n\n\n\n## Questions\n\n1. How many peaks were called throughout the genome? How wide are the called peaks, on average?\n\n2. How else might we define a significance threshold for identifying peaks?\n\n3. What might the relative heights of the peaks indicate? What types of technical or biological variables might influence peak heights?\n", "supporting": [ "slides-20_files" ], diff --git a/_freeze/slides/slides-20/figure-revealjs/plot-p-vals-1.png b/_freeze/slides/slides-20/figure-revealjs/plot-p-vals-1.png new file mode 100644 index 00000000..4abf2a24 Binary files /dev/null and b/_freeze/slides/slides-20/figure-revealjs/plot-p-vals-1.png differ diff --git a/_freeze/slides/slides-20/figure-revealjs/plot-peaks-2-1.png b/_freeze/slides/slides-20/figure-revealjs/plot-peaks-2-1.png new file mode 100644 index 00000000..9425e967 Binary files /dev/null and b/_freeze/slides/slides-20/figure-revealjs/plot-peaks-2-1.png differ diff --git a/_freeze/slides/slides-21/execute-results/html.json b/_freeze/slides/slides-21/execute-results/html.json index 5904a9e3..261a8145 100644 --- a/_freeze/slides/slides-21/execute-results/html.json +++ b/_freeze/slides/slides-21/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "361730bc1e8a3fdd6debfbabf7a5038a", + "hash": "5db899791ade46261d6475edab4fc4d7", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Factor-centric chromatin analysis\"\nauthor: \"Jay Hesselberth\"\n---\n\n\n\n# Where do transcription factors bind in the genome?\n\nToday we'll look at where two yeast transcription factors bind in the genome using CUT&RUN.\n\nTechniques like CUT&RUN require an affinity reagent (e.g., an antibody) that uniquely recognizes a transcription factor in the cell.\n\nThis antibody is added to permeabilized cells, and the antibody associates with the epitope. A separate reagent, a fusion of Protein A (which binds IgG) and micrococcal nuclease (MNase) then associates with the antibody. Addition of calcium activates MNase, and nearby DNA is digested. These DNA fragments are then isolated and sequenced to identify sites of TF association in the genome.\n\n![Fig 1a, Skene et al.](../img/block-dna/skene-fig-1a.png)\n\n## Data download and pre-processing\n\nCUT&RUN data were downloaded from the [NCBI GEO page](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84474) for Skene et al.\n\nI selected the 16 second time point for *S. cerevisiae* Abf1 and Reb1 (note the paper combined data from the 1-32 second time points).\n\nBED files containing mapped DNA fragments were separated by size and converted to bigWig with:\n\n``` bash\n# separate fragments by size\nawk '($3 - $2 <= 120)' Abf1.bed > CutRun_Abf1_lt120.bed\nawk '($3 - $2 => 150)' Abf1.bed > CutRun_Abf1_gt150.bed\n\n# for each file with the different sizes\nbedtools genomecov -i Abf1.bed -g sacCer3.chrom.sizes -bg > Abf1.bg\nbedGraphToBigWig Abf1.bg sacCer3.chrom.sizes Abf1.bw\n```\n\nThe bigWig files are available here in the `data/` directory.\n\n### Questions\n\n1. How do you ensure your antibody recognizes what you think it recognizes? What are important controls for ensuring it recognizes a specific epitope?\n\n2. What are some good controls for CUT&RUN experiments?\n\n# CUT&RUN analysis\n\n\n\n::: {.cell}\n\n:::\n\n\n\n## Examine genome coverage\n\n\n\n::: {.cell}\n\n:::\n\n\n\nNow that we have tracks loaded, we can make a plot.\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](slides-21_files/figure-revealjs/plot-tracks-1.png){width=960}\n:::\n:::\n\n\n\n### Questions\n\n1. What features stand out in the above tracks? What is different between Reb1 and Abf1? Between the short and long fragments?\n\n2. Where are the major signals with respect to genes?\n\n## Peak calling\n\nA conceptually simple approach to identification of regions containing \"peaks\" where a transcription factor was bound is available in the MACS software ([paper](), [github]()). There's also a nice [blog post](https://hbctraining.github.io/Intro-to-ChIPseq/lessons/05_peak_calling_macs.html) covering the main ideas.\n\n### Theory\n\nThe Poisson distribution is a discrete probability distribution of the form:\n\n$$ P_\\lambda (X=k) = \\frac{ \\lambda^k }{ k! * e^{-\\lambda} } $$\n\nwhere $\\lambda$ captures both the mean and variance of the distribution.\n\nThe R functions `dpois()`, `ppois()`, and `rpois()` provide access to the density, distribution, and random generation for the Poisson distribution. See `?dpois` for details.\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](slides-21_files/figure-revealjs/plot-poisson-1.png){width=960}\n:::\n:::\n\n\n\n### Practice\n\nHere, we model read coverage using the Poisson distribution. Given some genome size $G$ and and a number of reads collected $N$, we can approximate $\\lambda$ from $N/G$.\n\nMACS uses this value (the \"genomewide\" lambda) and also calculates several \"local\" lambda values to account for variation among genomic regions. We'll just use the genomewide lambda, which provides a conservative threshold for peak calling.\n\nUsing the genomewide lambda, we can use the Poisson distribution to address the question: **How surprised should I be to see** $k$ reads at position X?\n\n\n\n::: {.cell}\n\n:::\n\n\n\nLet's take a look at a plot of the p-value across a chromosome. What do you notice about this plot, when compared to the coverage of the CUT&RUN coverage above?\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](slides-21_files/figure-revealjs/unnamed-chunk-6-1.png){width=960}\n:::\n:::\n\n\n\nHow many peaks are called in this region?\n\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 3\n chrom start end\n \n1 chrII 100248 100289\n2 chrII 101292 101393\n3 chrII 124916 124949\n4 chrII 136181 136264\n5 chrII 141070 141121\n```\n\n\n:::\n:::\n\n\n\nLet's visualize these peaks in the context of genomic CUT&RUN signal. We need to define an `AnnotationTrack` with the peak intervals, which we can plot against the CUT&RUN coverage we defined above.\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](slides-21_files/figure-revealjs/plot-peaks-1.png){width=960}\n:::\n:::\n\n\n\n### Questions\n\n1. How many peaks were called throughout the genome? How wide are the called peaks, on average?\n\n2. How else might we define a significance threshold for identifying peaks?\n\n3. What might the relative heights of the peaks indicate? What types of technical or biological variables might influence peak heights?\n\n## Motif discovery\n\n### Theory\n\nThere are two major approaches to defining sequence motifs enriched in a sample: enumerative and probabilistic approaches.\n\nHere we'll apply a probabilistic approach (GADEM) to discover motifs in a collection of DNA sequences. During the RNA block, you'll learn about k-mer analysis, which is a form of enumerative approach.\n\nIn each case, the goal is to define a set of sequence motifs that are encriched in a set of provided sequences (i.e., peaks from CUT&RUN data) relative to a genomic background.\n\nMotifs are expressed in a [Position Weight Matrix](https://en.wikipedia.org/wiki/Position_weight_matrix), which captures the propensities for a position to be a particular nucleotide in a sequence motif.\n\nThese PWMs can be represented as sequence logos, visually represent the amount of information provided by the motif, typically using \"information content\", expressed in bits.\n\n![LexA sequence motif](../img/block-dna/lexa-motif.png)\n\n### Practice\n\nWe'll use the [rGADEM](https://bioconductor.org/packages/release/bioc/html/rGADEM.html) package from Bioconductor to derive sequence motifs from the peaks we called above. This is a straightforward process:\n\n1. Collect the DNA sequences within the peak windows using the BSgenome for *S. cerevisiae*\n2. Provide those sequences and the genomic background to `rGADEM::GADEM()`, which runs uses an Expectation-Maximization (EM) approach to identify and refine motifs.\n3. Examine the discovered motifs, and plot as a logo using `seqLogo::seqLogo()`.\n\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\n*** Start C Programm ***\n==============================================================================================\ninput sequence file: \nnumber of sequences and average length:\t\t\t\t452 114.8\nUse pgf method to approximate llr null distribution\nparameters estimated from sequences in: \n\nnumber of GA generations & population size:\t\t\t5 100\n\nPWM score p-value cutoff for binding site declaration:\t\t2.000000e-04\nln(E-value) cutoff for motif declaration:\t\t\t0.000000\n\nnumber of EM steps:\t\t\t\t\t\t40\nminimal no. sites considered for a motif:\t\t\t22\n\n[a,c,g,t] frequencies in input data:\t\t\t\t0.281618 0.218382 0.218382 0.281618\n==============================================================================================\n*** Running an unseeded analysis ***\nGADEM cycle 1: enumerate and count k-mers... top 3 4, 5-mers: 10 32 58\nDone.\nInitializing GA... Done.\nGADEM cycle[ 1] generation[ 1] number of unique motif: 2\n spacedDyad: tttnnnnnnngacg motifConsensus: nrTCAynwnnnACG 0.80 fitness: -437.99\n spacedDyad: tttcnnnnnnnnttcg motifConsensus: wyTTTTTTyTTTTTyk 0.90 fitness: -134.01\n\nGADEM cycle[ 1] generation[ 2] number of unique motif: 3\n spacedDyad: tcannnnnnnnnacga motifConsensus: wwrTCAyywnnnACGr 1.00 fitness: -486.12\n spacedDyad: tgannnnnnnnnncgtcg motifConsensus: kkwyTTTTTTywTwTTyk 0.60 fitness: -142.22\n spacedDyad: cacgnnnnttcgt motifConsensus: rAsGkTAsTTCsT 0.20 fitness: -18.84\n\nGADEM cycle[ 1] generation[ 3] number of unique motif: 5\n spacedDyad: tcannnnnnnnnacga motifConsensus: wwrTCAyywnnnACGr 1.00 fitness: -486.12\n spacedDyad: tgannnnnnnnnncgtcg motifConsensus: kkwyTTTTTTywTwTTyk 0.60 fitness: -142.22\n spacedDyad: acggnnnnggtg motifConsensus: wCkTCswwGrTG 0.20 fitness: -42.34\n spacedDyad: tcannnnnncacg motifConsensus: AmmTTsmmCsAmr 0.20 fitness: -32.00\n spacedDyad: cacgnnnnttcgt motifConsensus: rAsGkTAsTTCsT 0.20 fitness: -18.84\n\nGADEM cycle[ 1] generation[ 4] number of unique motif: 5\n spacedDyad: tcannnnnnnnnacga motifConsensus: wwrTCAyywnnnACGr 1.00 fitness: -486.12\n spacedDyad: tgannnnnnnnnncgtcg motifConsensus: kkwyTTTTTTywTwTTyk 0.60 fitness: -142.22\n spacedDyad: acggnnnnggtg motifConsensus: wCkTCswwGrTG 0.20 fitness: -42.34\n spacedDyad: tcannnnnncacg motifConsensus: AmmTTsmmCsAmr 0.20 fitness: -32.00\n spacedDyad: cacgnnnnttcgt motifConsensus: rAsGkTAsTTCsT 0.20 fitness: -18.84\n\nGADEM cycle[ 1] generation[ 5] number of unique motif: 6\n spacedDyad: tcannnnnnnnnacga motifConsensus: wwrTCAyywnnnACGr 1.00 fitness: -486.12\n spacedDyad: tgannnnnnnnnncgtcg motifConsensus: kkwyTTTTTTywTwTTyk 0.60 fitness: -142.22\n spacedDyad: acggnnnnggtg motifConsensus: wCkTCswwGrTG 0.20 fitness: -42.34\n spacedDyad: tcannnnnncacg motifConsensus: AmmTTsmmCsAmr 0.20 fitness: -32.00\n spacedDyad: cacgnnnnttcgt motifConsensus: rAsGkTAsTTCsT 0.20 fitness: -18.84\n spacedDyad: tcannnnnnnacga motifConsensus: CyATTywCsyACrA 0.20 fitness: -7.23\n\n*** Running an unseeded analysis ***\nGADEM cycle 2: enumerate and count k-mers... top 3 4, 5-mers: 10 22 22\nDone.\nInitializing GA... Done.\nGADEM cycle[ 2] generation[ 1] number of unique motif: 2\n spacedDyad: ggcnnnnnnnnnaaat motifConsensus: sGCCkwArsrrsAwAT 0.10 fitness: -7.73\n spacedDyad: ccannnnnnaaat motifConsensus: CAAAAyAGGmAAw 0.20 fitness: -6.45\n\nGADEM cycle[ 2] generation[ 2] number of unique motif: 2\n spacedDyad: ggcnnnnnnnnnaaat motifConsensus: sGCCkwArsrrsAwAT 0.10 fitness: -7.73\n spacedDyad: ccannnnnnaaat motifConsensus: CAAAAyAGGmAAw 0.20 fitness: -6.45\n\nGADEM cycle[ 2] generation[ 3] number of unique motif: 2\n spacedDyad: ggcnnnnnnnnnaaat motifConsensus: sGCCkwArsrrsAwAT 0.10 fitness: -7.73\n spacedDyad: ccannnnnnaaat motifConsensus: CAAAAyAGGmAAw 0.20 fitness: -6.45\n\nGADEM cycle[ 2] generation[ 4] number of unique motif: 2\n spacedDyad: ggcnnnnnnnnnaaat motifConsensus: sGCCkwArsrrsAwAT 0.10 fitness: -7.73\n spacedDyad: ccannnnnnaaat motifConsensus: CAAAAyAGGmAAw 0.20 fitness: -6.45\n\nGADEM cycle[ 2] generation[ 5] number of unique motif: 2\n spacedDyad: gccnnnnnnnnatttt motifConsensus: GCAwTwTTTCCTkTTT 0.10 fitness: -11.57\n spacedDyad: ggcnnnnnnnnnaaat motifConsensus: sGCCkwArsrrsAwAT 0.10 fitness: -7.73\n\n*** Running an unseeded analysis ***\nGADEM cycle 3: enumerate and count k-mers... top 3 4, 5-mers: 8 22 18\nDone.\nInitializing GA... Done.\nGADEM cycle[ 3] generation[ 1] number of unique motif: 1\n spacedDyad: gaancca motifConsensus: AAAAAGA 1.00 fitness: 27.80\n\nGADEM cycle[ 3] generation[ 2] number of unique motif: 1\n spacedDyad: cggcnnnnncca motifConsensus: CsvCAnsAAsCA 0.10 fitness: 24.29\n\nGADEM cycle[ 3] generation[ 3] number of unique motif: 1\n spacedDyad: cggcnnnnncca motifConsensus: CsvCAnsAAsCA 0.10 fitness: 24.29\n\nGADEM cycle[ 3] generation[ 4] number of unique motif: 1\n spacedDyad: cggcnnnnncca motifConsensus: CsvCAnsAAsCA 0.10 fitness: 24.29\n\nGADEM cycle[ 3] generation[ 5] number of unique motif: 1\n spacedDyad: cggcnnnnncca motifConsensus: CsvCAnsAAsCA 0.10 fitness: 24.29\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"nrTCAyywnnnACGr\" \n[2] \"wTTTTTTywTwTTyk\" \n[3] \"CArACAGAmvmmATTCmmCCAmrrAyGTwTwAC\" \n[4] \"rwmsnTCCTTGACGCTAGwTCrTTsGrCTAAAA\" \n[5] \"AAATArGCAATwTTTCCTkTTTn\" \n[6] \"wGnndGrCAAGGCCGTAGGrrCAwATAkyAkywrGr\"\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 445 137 56 30 39 28\n```\n\n\n:::\n:::\n\n\n\nNow let's look at the sequence logo for the top hit.\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](slides-21_files/figure-revealjs/plot-logo-1.png){width=960}\n:::\n:::\n\n\n\n### Questions\n\n1. Does this motif make sense, based on what you know about the requirements and specificity of DNA binding by transcription factors?\n\n2. How might you confirm that a specific sequence (that conforms to a motif) is bound directly by a transcription factor?\n\n## References\n\nGADEM: a genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery. J Comput Biol 2009 \\[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2756050/)\\] \\[[PubMed](https://pubmed.ncbi.nlm.nih.gov/19193149)\\] \\[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=J+Comput+Biol&title=GADEM:+a+genetic+algorithm+guided+formation+of+spaced+dyads+coupled+with+an+EM+algorithm+for+motif+discovery&author=L.+Li&volume=16&issue=2&publication_year=2009&pages=317-329&pmid=19193149&)\\]\n", + "markdown": "---\ntitle: \"Factor-centric chromatin analysis\"\nauthor: \"Jay Hesselberth\"\n---\n\n\n\n## Where do transcription factors bind in the genome?\n\nToday we'll look at where two yeast transcription factors bind in the genome using CUT&RUN.\n\n## Where do transcription factors bind in the genome?\n\nTechniques like CUT&RUN require an affinity reagent (e.g., an antibody) that uniquely recognizes a transcription factor in the cell.\n\nThis antibody is added to permeabilized cells, and the antibody associates with the epitope. A separate reagent, a fusion of Protein A (which binds IgG) and micrococcal nuclease (MNase) then associates with the antibody. Addition of calcium activates MNase, and nearby DNA is digested. These DNA fragments are then isolated and sequenced to identify sites of TF association in the genome.\n\n## Where do transcription factors bind in the genome?\n\n![Fig 1a, Skene et al.](../img/block-dna/skene-fig-1a.png)\n\n## Data download and pre-processing\n\nCUT&RUN data were downloaded from the [NCBI GEO page](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84474) for Skene et al.\n\nI selected the 16 second time point for *S. cerevisiae* Abf1 and Reb1 (note the paper combined data from the 1-32 second time points).\n\nBED files containing mapped DNA fragments were separated by size and converted to bigWig with:\n\n``` bash\n# separate fragments by size\nawk '($3 - $2 <= 120)' Abf1.bed > CutRun_Abf1_lt120.bed\nawk '($3 - $2 => 150)' Abf1.bed > CutRun_Abf1_gt150.bed\n\n# for each file with the different sizes\nbedtools genomecov -i Abf1.bed -g sacCer3.chrom.sizes -bg > Abf1.bg\nbedGraphToBigWig Abf1.bg sacCer3.chrom.sizes Abf1.bw\n```\n\nThe bigWig files are available here in the `data/` directory.\n\n# CUT&RUN analysis\n\n\n\n::: {.cell}\n\n:::\n\n\n\n## Examine genome coverage\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![](slides-21_files/figure-revealjs/plot-tracks-1.png){width=960}\n:::\n:::\n\n\n\n## How do proteins recognize specific locations in the genome to bind?\n\n## Motif discovery\n\n## Theory\n\nThere are two major approaches to defining sequence motifs enriched in a sample: enumerative and probabilistic approaches.\n\n## Theory\n\nHere we'll apply a probabilistic approach (GADEM) to discover motifs in a collection of DNA sequences. During the RNA block, you'll learn about k-mer analysis, which is a form of enumerative approach.\n\nIn each case, the goal is to define a set of sequence motifs that are encriched in a set of provided sequences (i.e., peaks from CUT&RUN data) relative to a genomic background.\n\n## Theory\n\nMotifs are expressed in a [Position Weight Matrix](https://en.wikipedia.org/wiki/Position_weight_matrix), which captures the propensities for a position to be a particular nucleotide in a sequence motif.\n\nThese PWMs can be represented as sequence logos, visually represent the amount of information provided by the motif, typically using \"information content\", expressed in bits.\n\n## Theory\n![LexA sequence motif](../img/block-dna/lexa-motif.png)\n\n## Practice\n\nWe'll use the [rGADEM](https://bioconductor.org/packages/release/bioc/html/rGADEM.html) package from Bioconductor to derive sequence motifs from the peaks we called above. This is a straightforward process:\n\n1. Collect the DNA sequences within the peak windows using the BSgenome for *S. cerevisiae*\n2. Provide those sequences and the genomic background to `rGADEM::GADEM()`, which runs uses an Expectation-Maximization (EM) approach to identify and refine motifs.\n3. Examine the discovered motifs, and plot as a logo using `seqLogo::seqLogo()`.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npeak_seqs <- BSgenome::getSeq(\n # provided by BSgenome.Scerevisiae.UCSC.sacCer3\n Scerevisiae,\n peak_calls_gr\n)\n\n# takes ~2 minutes to run\ngadem <- rGADEM::GADEM(\n peak_seqs,\n genome = Scerevisiae,\n verbose = 1\n)\n\n# look at the consensus motifs\nconsensus(gadem)\n\n# how many consensus motifs are there?\nnOccurrences(gadem)\n```\n:::\n\n\n\nNow let's look at the sequence logo for the top hit.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\npwm <- gadem@motifList[[1]]@pwm\n\nseqLogo::seqLogo(pwm)\n```\n:::\n\n\n\n## Questions\n\n1. Does this motif make sense, based on what you know about the requirements and specificity of DNA binding by transcription factors?\n\n2. How might you confirm that a specific sequence (that conforms to a motif) is bound directly by a transcription factor?\n\n## References\n\nGADEM: a genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery. J Comput Biol 2009 \\[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2756050/)\\] \\[[PubMed](https://pubmed.ncbi.nlm.nih.gov/19193149)\\] \\[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=J+Comput+Biol&title=GADEM:+a+genetic+algorithm+guided+formation+of+spaced+dyads+coupled+with+an+EM+algorithm+for+motif+discovery&author=L.+Li&volume=16&issue=2&publication_year=2009&pages=317-329&pmid=19193149&)\\]\n", "supporting": [ "slides-21_files" ], diff --git a/slides/slides-21.qmd b/slides/slides-21.qmd index 1ab5f56d..9841626f 100644 --- a/slides/slides-21.qmd +++ b/slides/slides-21.qmd @@ -166,7 +166,7 @@ We'll use the [rGADEM](https://bioconductor.org/packages/release/bioc/html/rGADE 2. Provide those sequences and the genomic background to `rGADEM::GADEM()`, which runs uses an Expectation-Maximization (EM) approach to identify and refine motifs. 3. Examine the discovered motifs, and plot as a logo using `seqLogo::seqLogo()`. -```{r} +```{r, eval = FALSE} #| label: motif-discovery #| echo: true @@ -192,7 +192,7 @@ nOccurrences(gadem) Now let's look at the sequence logo for the top hit. -```{r} +```{r, eval = FALSE} #| label: plot-logo #| echo: true