forked from boulderrinnlab/CLASS_2023
-
Notifications
You must be signed in to change notification settings - Fork 0
/
final_class_project.Rmd
115 lines (65 loc) · 2.75 KB
/
final_class_project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
title: "Final_class_project"
author: "JR"
date: "4/21/2023"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Your mission should you choose to accept it
There is an enormous amount of data in ENCODE that
needs to be analyzed. This includes 480 unique DBP
ChIPseq datasets. To facilitate your mission we provide you
1,066 peak files that have been run through the NF_CORE chipseq
pipeline.
You will find all the .broadPeak peak files here:
/scratch/Shares/rinnclass/CLASS_2023/data/data/peaks
Use what you have learned in class to uncover
new results from this daunting data set.
# Good luck the genome depends on you!
# first you might want to think about your organization
# 1) create a directory in your class folder for these analyses
# -> folder name
-> util/
-> scripts.R
-> analysis/
-> 01_consensus_peaks
-> .RMD
-> results/
-> figures
-> 02_clustering etc....
# Consensus peaks
# Num_peaks_df
# Peak_occurence_df
# lncRNA and or mRNA promoters / gene annotations
```{r}
# How does peak number and genome coverage compare
# What is the distribution of promoter overlaps versus gene-bodies (hint hist)
# Make a list of genes that are "super binders"
# Is there a type of gene ontology associated with them versus the others?
# Is there a difference in mRNA and lncRNA promoter overlaps?
# Do lncRNAs also have super-binding promoters?
# How many of these proteins are TFs? What is the most represented type of DBD?
```
# Let's see how similar our genes are to each other
# Clustering
```{r}
# What genes make sense to cluster togehter create an image (hint Pol II makes RNA)
# Find a cluster of genes your interested in and can figure out what those genes do -- are there unknown genes i there too? If so maybe one could hypothesize they have a similar function to the known genes (Hint ZNFs)
# if we cluster by lncRNA and mRNA seperately what are some similarities and differences?
```
# Metaplots
```{r}
# Let's look at the metaplot for all DBPs on lncRNA and mRNA promoters seperately (hint facet wrap).
# Which genes seem to have a difference in where they bind on promoters between lncRNA and mRNA promoters
```
# RNAseq expression
```{r}
# Let's make a heatmap of genes that are variable across samples
# Which subfraction of the cell has the highest expression.
# Let's make a heatmap of nuclear versus cytoplasmic expression
# How many lncRNA and mRNA genes are sig in nuclear or cyto
# What is/are the most nuclear mRNA(s) -- is there a type of gene ontology associated with them?
# If we zoom in on high binding promoters (> 200 DBPs) are there any that don't have any expression?
```