-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathplotting-gene-expression.Rmd
721 lines (564 loc) · 27.7 KB
/
plotting-gene-expression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
---
title: "Plotting gene expression in R"
author: |
| Kelly Sovacool, Audrey Drotos, Stephanie Thiede, and Breanna McBean
| Adapted from [U of Michigan Carpentries](umcarpentries.org/intro-curriculum-r/)
date: "version `r format(Sys.time(), '%B %d, %Y')`"
output:
html_document:
toc: yes
toc_depth: 3
toc_float:
collapsed: no
pdf_document:
toc: yes
toc_depth: '3'
editor_options:
chunk_output_type: console
urlcolor: blue
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
fig.path = 'figures/')
```
# Intro R
## RStudio
In this workshop, we will plot gene expression levels from an RNAseq dataset.
To do this, we'll need two things: data and a platform to analyze the data.
We could try to use a spreadsheet program like Microsoft Excel or Google sheets that have limited access, less flexibility, and don't easily allow for things that are critical to ["reproducible" research](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285), like easily sharing the steps used to explore and make changes to the original data.
Instead, we'll use a programming language to test our hypothesis. Today we will use R, but we could have also used Python for the same reasons we chose R (and we teach workshops for both languages). Both R and Python are freely available, the instructions you use to do the analysis are easily shared, and by using reproducible practices, it's straightforward to add more data or to change settings like colors or the size of a plotting symbol.
To run R, all you really need is the R program, which is available for computers running the Windows, Mac OS X, or Linux operating systems.
To make your life in R easier, there is a great (and free!) program called RStudio. As we work today, we'll use features that are available in RStudio for writing and running code, managing projects, installing packages, getting help, and much more. It is important to remember that R and RStudio are different, but complementary programs. You need R to use RStudio.
To get started, we'll spend a little time getting familiar with the RStudio environment and setting it up to suit your tastes. When you start RStudio, you'll have three panels.
<img src="https://github.com/GWC-DCMB/plot-gene-expr-R/blob/main/images/initial_rstudio.png?raw=true" width="700"/>
On the left you'll have a panel with three tabs - Console, Terminal, and Jobs. The Console tab is what running R from the command line looks like. This is where you can enter R code. Try typing in `2+2` at the prompt (>). In the upper right panel are tabs indicating the Environment, History, and a few other things. If you click on the History tab, you'll see the command you ran at the R prompt.
<img src="https://github.com/GWC-DCMB/plot-gene-expr-R/blob/main/images/history.png?raw=true" width="700"/>
In the lower right panel are tabs for Files, Plots, Packages, Help, and Viewer. You used the Packages tab to install tidyverse.
We'll spend more time in each of these tabs as we go through the workshop, so we won't spend a lot of time discussing them now.
You might want to alter the appearance of your RStudio window. The default appearance has a white background with black text. If you go to the Tools menu at the top of your screen, you'll see a "Global options" menu at the bottom of the drop down; select that.
<img src="https://github.com/GWC-DCMB/plot-gene-expr-R/blob/main/images/global_options.png?raw=true" width="600"/>
From there you will see the ability to alter numerous things about RStudio. Under the Appearances tab you can select the theme you like most. As you can see there's a lot in Global options that you can set to improve your experience in RStudio. Most of these settings are a matter of personal preference.
However, you can update settings to help you to insure the reproducibility of your code. In the General tab, none of the selectors in the R Sessions, Workspace, and History should be selected. In addition, the toggle next to "Save workspace to .RData on exit" should be set to never. These setting will help ensure that things you worked on previously don't carry over between sessions.
<img src="https://github.com/GWC-DCMB/plot-gene-expr-R/blob/main/images/general_options.png?raw=true" width="600"/>
Let's get going on our analysis!
One of the helpful features in RStudio is the ability to create a project. A project is a special directory that contains all of the code and data that you will need to run an analysis.
At the top of your screen you'll see the "File" menu. Select that menu and then the menu for "New Project...".
<img src="https://github.com/GWC-DCMB/plot-gene-expr-R/blob/main/images/new_project_menu.png?raw=true" width="600"/>
When the smaller window opens, select "New Directory" and then the "Browse" button in the next window.
<img src="https://github.com/GWC-DCMB/plot-gene-expr-R/blob/main/images/existing_directory.png?raw=true" width="600"/>
<img src="https://github.com/GWC-DCMB/plot-gene-expr-R/blob/main/images/browse.png?raw=true" width="600"/>
Creat a new project called "heart-gene-expr-2022" and click the "Open" button.
<img src="https://github.com/GWC-DCMB/plot-gene-expr-R/blob/main/images/navigate_to_project.png?raw=true" width="600"/>
Then click the "Create Project" button.
<img src="https://github.com/GWC-DCMB/plot-gene-expr-R/blob/main/images/create_project.png?raw=true" width="600"/>
Did you notice anything change?
In the lower right corner of your RStudio session, you should notice that your
Files tab is now your project directory. You'll also see a file called
heart-gene-expr-2022.Rproj in that directory.
From now on, you should start RStudio by double clicking on that file.
This will make sure you are in the correct directory when you run your analysis.
<img src="https://github.com/GWC-DCMB/plot-gene-expr-R/blob/main/images/files_with_rproj.png?raw=true" width="600"/>
We'd like to create a file where we can keep track of our R code.
Back in the "File" menu, you'll see the first option is "New File". Selecting "New File" opens another menu to the right and the first option is "R Script". Select "R Script".
Now we have a fourth panel in the upper left corner of RStudio that includes an **Editor** tab with an untitled R Script. Let's save this file as `gene_expression.R` in our project directory.
We will be entering R code into the **Editor** tab to run in our **Console** panel.
On line 1 of `gene_expression.R`, type `2+2`.
With your cursor on the line with the `2+2`, click the button that says <kbd>Run</kbd>. You should be able to see that `2+2` was run in the Console.
As you write more code, you can highlight multiple lines and then click <kbd>Run</kbd> to run all of the lines you have selected.
## R as a calculator
You can use R like a calculator to add (`+`), multiply (`*`), divide (`/`), and more!
```{r}
# Using R as a calculator
# addition +
5 + 3
# subtraction -
5 - 3
# multiplication *
5 * 3
# division /
5 / 3
# mod (remainder) %%
5 %% 3
#exponent
5 ** 3
5 ^3
#combining terms together (PEMDAS)
5 - 3^2
(5-3) ^ 2
```
## Variables and Types
- character
- logical
- double
- integer
```{r types}
name <- "Kelly"
favorite_color <- "green"
height_inches <- 64
likes_cats <- TRUE
num_plants <- 14L
```
You can think of variables as labelled boxes where you store your belongings.
The `<-` symbol is the **assignment operator**. It assigns values generated or typed on the right to objects on the left. An alternative symbol that you might see used as an **assignment operator** is the `=` but it is clearer to only use `<-` for assignment. We use this symbol so often that RStudio has a keyboard short cut for it: <kbd>Alt</kbd>+<kbd>-</kbd> on Windows, and <kbd>Option</kbd>+<kbd>-</kbd> on Mac.
If you're not sure of the type of a variable,
you can find out with the `typeof()` function.
```{r}
typeof(name)
typeof(favorite_color)
typeof(height_inches)
typeof(likes_cats)
```
> #### Comments
> Sometimes you may want to write comments in your code to help you remember
> what your code is doing, but you don't want R to think these comments are a part
> of the code you want to evaluate. That's where **comments** come in! Anything
> after a `#` symbol in your code will be ignored by R. For example, let's say we
> wanted to make a note of what each of the functions we just used do:
> ```{r FunctionsComments2}
> Sys.Date() # outputs the current date
> getwd() # outputs our current working directory (folder)
> sum(5, 6) # adds numbers
> ```
> #### Assigning values to objects
> Try to assign values to some objects and observe each object after you have assigned a new value. What do you notice?
>
> ```{r Objects, eval = FALSE}
> name <- "Ben"
> name
> height <- 72
> height
> name <- "Jerry"
> name
> ```
>
> > #### Solution
> > When we assign a value to an object, the object stores that value so we can access it later. However, if we store a new value in an object we have already created, it replaces the old value. The `height` object does not change, because we never assign it a new value.
> #### Guidelines on naming objects
> - You want your object names to be explicit and not too long.
> - They cannot start with a number (2x is not valid, but x2 is).
> - R is case sensitive, so for example, weight_kg is different from Weight_kg.
> - You cannot use spaces in the name.
> - There are some names that cannot be used because they are the names of fundamental functions in R (e.g., if, else, for; see [here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) for a complete list). If in doubt, check the help to see if the name is already in use (`?function_name`).
> - It is recommended to use nouns for object names and verbs for function names.
> - Be consistent in the styling of your code, such as where you put spaces, how you name objects, etc. Using a consistent coding style makes your code clearer to read for your future self and your collaborators. One popular style guide can be found through the [tidyverse](https://style.tidyverse.org/).
The power of variables is that you can re-use them later!
```{r}
cm_per_inch <- 2.54
height_cm <- height_inches * cm_per_inch
height_cm
height_cm / cm_per_inch
```
The above variables hold a single value. What if you need multiple values?
```{r}
dial_numbers <- c(1, 2, 3)
dial_numbers2 <- 1:3
typeof(dial_numbers)
fan_settings <- c('low', 'medium', 'high')
typeof(fan_settings)
fan_settings <- factor(c('low', 'high', 'medium', 'high', 'medium'), levels = c('low', 'medium', 'high'))
```
## Error messages
Sometimes we make a mistake when writing code, and the code isn't able to run.
In that case, the code will give us an error message.
An error tells us, "there's something wrong with your code or your data and R didn't do what you asked".
A warning tells us, "you might want to know about this issue, but R still did what you asked".
You need to fix any errors that arise in order for your code to work.
Warnings, are probably best to resolve or at least understand why they are coming up.
> ### Exercise: error messages
>
> How can you fix the error messages in these next lines of code?
>
> ```{r, error = TRUE}
> heihgt_cm
> 2 + "three"
> ```
>
> #### Solution
>
> ```{r, error = TRUE}
> height_cm
> 2 + 3
> ```
## Functions
Earlier, the `typeof()` function took the variable we provided and told us the
`type` that the variable belonged to.
A function is a way to re-use code, either written by you or by someone else.
Functions make it easier to write code because you don't have to re-invent the
wheel for certain tasks.
> ### Viewing function documentation
>
> Each function has a help page that documents what arguments the function
> expects and what value it will return. You can bring up the help page a few
> different ways. If you have typed the function name in the **Editor** windows,
> you can put your cursor on the function name and press <kbd>F1</kbd> to open
> help page in the **Help** viewer in the lower right corner of RStudio. You can
> also type `?` followed by the function name in the console.
>
> For example, try running `?typeof`. A help page should pop up with
> information about what the function is used for and how to use it, as well as
> useful examples of the function in action. As you can see, the first
> **argument** of `type` is the variable name.
>
Whenever using a funciton, you type the name of the function followed immediately
by parentheses. Any arguments, or inputs, go between the parentheses.
Do all functions need arguments? Let's test some other functions:
```{r FunctionsNoArgs}
Sys.Date()
getwd()
```
While some functions don't need any arguments, in other
functions we may want to use multiple arguments. When we're using multiple
arguments, we separate the arguments with commas. For example, we can use the
`sum()` function to add numbers together:
```{r sumFunctionExample}
sum(5, 6)
```
> #### Learning more about functions
> Look up the function `round`. What does it do? What will you get as output for the following lines of code?
>
> ```{r functionExercise, eval = FALSE}
> round(3.1415)
> round(3.1415,3)
> ```
>
> > #### Solution
> > `round` rounds a number. By default, it rounds it to zero digits (in our example above, to 3). If you give it a second number, it rounds it to that number of digits (in our example above, to 3.142)
> #### Position of the arguments in functions
>
> Which of the following lines of code will give you an output of 3.14? For the one(s) that don't give you 3.14, what do they give you?
>
> ```{r functionPositionExercise, eval = FALSE}
> round(x = 3.1415)
> round(x = 3.1415, digits = 2)
> round(digits = 2, x = 3.1415)
> round(2, 3.1415)
> ```
>
> > #### Solution
> > The 2nd and 3rd lines will give you the right answer because the arguments are named, and when you use names the order doesn't matter. The 1st line will give you 3 because the default number of digits is 0. Then 4th line will give you 2 because, since you didn't name the arguments, x=2 and digits=3.1415.
Sometimes it is helpful - or even necessary - to include the argument name, but often we can skip the argument name, if the argument values are passed in a certain order. If all this function stuff sounds confusing, don't worry! We'll see a bunch of examples as we go that will make things clearer.
## Packages
So far, all of the functions we have used are part of base R.
When you install R, these functions are included by default.
However, anyone can write code and share it with other people by putting them in packages.
A package is a collection of code that is intended to be shared and re-used.
Sometimes you need to do something that isn't included in base R, or in a
different way than the way it is implemented in base R.
Let's load our first package with `library(tidyverse)`.
Go ahead and run that line in the **Console** by clicking the <kbd>Run</kbd> button on the top right of the **Editor** tab and choosing <kbd>Run Selected Lines</kbd>. This loads a set of useful functions and sample data that makes it easier for us to do complex analyses and create professional visualizations in R.
```{r loadTidyverse, message=TRUE}
#install.packages('tidyverse')
library(tidyverse)
```
> #### What's with all those messages???
>
> When you loaded the `tidyverse` package, you probably got a message like the
> one we got above. Don't panic! These messages are just giving you more
> information about what happened when you loaded `tidyverse`. The `tidyverse` is
> actually a collection of several different packages, so the first section of the
> message tells us what packages were installed when we loaded `tidyverse` (these
> include `ggplot2`, `dyplr`, and `tidyr`, which we will use a lot!
>
> The second section of messages gives a list of "conflicts." Sometimes, the
> same function name will be used in two different packages, and R has to decide
> which function to use. For example, our message says that:
>
> ~~~
> dplyr::filter() masks stats::filter()
> ~~~
>
> This means that two different packages (`dplyr` from `tidyverse` and `stats`
> from base R) have a function named `filter()`. By default, R uses the function
> that was most recently loaded, so if we try using the `filter()` function after
> loading `tidyverse`, we will be using the `filter()` function > from `dplyr()`.
> You can use double colons (`::`) to specify which package's version of the
> function you want to use. This syntax follows the pattern `package::function()`.
> #### The tidyverse vs Base R
> If you've used R before, you may have learned commands that are different than the ones we will be using during this workshop. We will be focusing on functions from the [tidyverse](https://www.tidyverse.org/). The "tidyverse" is a collection of R packages that have been designed to work well together and offer many convenient features that do not come with a fresh install of R (aka "base R"). These packages are very popular and have a lot of developer support including many staff members from RStudio. These functions generally help you to write code that is easier to read and maintain. We believe learning these tools will help you become more productive more quickly.
# Data frames
The variable types we discussed above are useful for simple variables, but
what if you want to store attributes about lots of different things, like patient samples?
For that we use data frames. Just about anything that you can store as rows and columns,
like in an Excel spreadsheet, are a great use for data frames.
Let's find out how to read in a dataset as a data frame.
## load
```{r}
# READING DATA INTO R
expression_data <- read_csv("https://tinyurl.com/UMGWCDFB")
```
`read_csv` gave us lots of useful information about the data frames we just loaded. Take a look at the column names and their types.
> ##### The dataset is curated from this paper:
>
> Cui Y _et al._ 2019. Single-Cell Transcriptome Analysis Maps the Developmental Track of the Human Heart.
> Cell Reports 26:1934-1950.e5.
> https://doi.org/10.1016/j.celrep.2019.01.079
## view
If you want to take a peak at just the first 6 rows of the data frame, you can do so with
`head`, or to peak at the last 6 rows using `tail()`,or look at the whole data frame interactively with `View`:
```{r, eval = FALSE}
# LOOKING AT OUR DATA
View(expression_data)
head(expression_data)
tail(expression_data)
```
We can see that the data type of our weeks column is currently a double. We want to split this data to
look at what is going on at 6 weeks and 24 weeks separately, so we need to make this
column into categories, or "factors".
```{r}
expression_data$weeks <- factor(expression_data$weeks)
```
Now, we are going to use `filter()` to split the data into two separate data frames,
one for each time point.
```{r}
six_weeks <- expression_data %>%
filter(weeks == 6)
twenty_four_weeks <- expression_data %>%
filter(weeks == 24)
```
Now, let's more closely at the six week data:
```{r}
ncol(six_weeks)
nrow(six_weeks)
dim(six_weeks)
colnames(six_weeks)
```
As we see from above, there are 864 rows, so let's just look at `head()` of row
names to see the first few row names.
```{r}
head(rownames(six_weeks))
#reference things in the data frame
#data_frame[row,column]
dim(six_weeks)
six_weeks[1,1]
six_weeks[4,]
six_weeks[,4]
six_weeks[,2]
```
The method above automatically shows us the first 10 rows of our column of interest.
Let's compare this to another method of looking at a particular column, also focusing on the
first 10 rows using `head()` again.
```{r}
head(six_weeks$gene, 10)
```
## explore
```{r}
# USING FUNCTIONS
mean(six_weeks$log2_expr)
mean(twenty_four_weeks$log2_expr)
summary(six_weeks$log2_expr)
summary(twenty_four_weeks$log2_expr)
```
# Data visualization
## Your first boxplot
Now we're ready to make our first plot!
First, let's try just plotting the ACTN2 gene. We'll use `filter()` to get rows for this gene only:
```{r filter}
expression_data %>%
filter(gene == 'ACTN2')
```
We will be using the `ggplot2` package today to make our plots. This is a very
powerful package that creates professional looking plots and is one of the
reasons people like using R so much. All plots made using the `ggplot2` package
start by calling the `ggplot()` function.
```{r ggplot-blank}
expression_data %>%
filter(gene == 'ACTN2') %>%
ggplot()
```
Now we need to start telling it what we actually
want to draw in this plot. The elements of a plot have a bunch of properties
like an x and y position, a size, a color, etc. These properties are called
**aesthetics**. When creating a data visualization, we map a variable in our
dataset to an aesthetic in our plot. In ggplot, we can do this by creating an
"aesthetic mapping", which we do with the `aes()` function.
To create our plot, we need to map variables from our data frame to
ggplot aesthetics using the `aes()` function. Since we have already told
`ggplot` that we are using the data in the `dat_joined_long` object, we can
access the columns of the data frame using the object's column names.
Remember, R is case-sensitive, so we have to be careful to match the column
names exactly!
```{r ggplot-aes}
expression_data %>%
filter(gene == 'ACTN2') %>%
ggplot() +
aes(weeks, log2_expr)
```
Note that we've added this new function call to a second line just to make it
easier to read. To do this we make sure that the `+` is at the end of the first
line otherwise R will assume your command ends when it starts the next row. The
`+` sign indicates not only that we are adding information, but to continue on
to the next line of code.
We still haven't told ggplot how we want it to
draw the data. There are many different plot types (bar charts, scatter plots,
histograms, etc). We tell our plot object what to draw by adding a "geometry"
("geom" for short) to our object. We will talk about many different geometries
today, but for our first plot, let's draw our data using a boxplot.
```{r ggplot-boxplot}
expression_data %>%
filter(gene == 'ACTN2') %>%
ggplot() +
aes(weeks, log2_expr) +
geom_boxplot()
```
Finally, we can make it look nicer by adding labels:
```{r boxplot-actn2}
expression_data %>%
filter(gene == 'ACTN2') %>%
ggplot() +
aes(weeks, log2_expr) +
geom_boxplot() +
labs(title = 'ACTN2', x = '# of weeks', y = 'log2 gene expression')
```
## Saving plots
```{r, eval = FALSE}
boxplot_actn2 <- expression_data %>%
filter(gene == 'ACTN2') %>%
ggplot() +
aes(weeks, log2_expr) +
geom_boxplot() +
labs(title = 'ACTN2', x = '# of weeks', y = 'log2 gene expression')
ggsave(filename = 'boxplot_actn2.png', plot = boxplot_actn2)
```
> ### Exercise: make a boxplot
>
> Make a boxplot plot for expression of the gene GATA4 at 6 weeks compared to 24 weeks.
>
> ```{r gata4}
> # CHALLENGE:
> # Make a boxplot plot for expression of the gene GATA4 at 6 weeks compared to 24 weeks.
> expression_data %>%
> filter(gene == 'GATA4') %>%
> ggplot() +
> aes(weeks, log2_expr) +
> geom_boxplot() +
> labs(title = 'GATA4', x = '# of weeks', y = 'log2 gene expression')
> ```
## Color palettes
We can map the color of the boxplot to a variable using `aes(color = colname)`.
Let's see what it looks like using the default options:
```{r color}
# R color palettes
# change the colors with scale_color_
expression_data %>%
filter(gene == 'ACTN2') %>%
ggplot() +
aes(weeks, log2_expr, color = weeks) +
geom_boxplot() +
labs(title = 'ACTN2', x = '# of weeks', y = 'log2 gene expression')
```
That changed the outline, but not the inside. We can instead use fill:
```{r fill}
# R color palettes
# change the colors with scale_color_
expression_data %>%
filter(gene == 'ACTN2') %>%
ggplot() +
aes(weeks, log2_expr, fill = weeks) +
geom_boxplot() +
labs(title = 'ACTN2', x = '# of weeks', y = 'log2 gene expression')
```
Our plot is looking prettier now that we have some color!
But what if we want to use different colors? We can use `scale_fill_manual` to
manually set them and specify colors by name:
```{r scale-fill}
# R color palettes
# change the colors with scale_color_
expression_data %>%
filter(gene == 'ACTN2') %>%
ggplot() +
aes(weeks, log2_expr, fill = weeks) +
geom_boxplot() +
scale_fill_manual(values = c('green', 'blue')) +
labs(title = 'ACTN2', x = '# of weeks', y = 'log2 gene expression')
```
Or you can look up the **hexadecimal code** of your favorite colors and specify them exactly.
There are a lot of websites that will tell you the hex code of a color.
Here's one: https://www.rapidtables.com/web/color/RGB_Color.html
Let's use "plum" (#DDA0DD) and "light steel blue" (#B0C4DE).
```{r scale-fill-hex}
# R color palettes
# change the colors with scale_color_
expression_data %>%
filter(gene == 'ACTN2') %>%
ggplot() +
aes(weeks, log2_expr, fill = weeks) +
geom_boxplot() +
scale_fill_manual(values = c('#DDA0DD', '#B0C4DE')) +
labs(title = 'ACTN2', x = '# of weeks', y = 'log2 gene expression')
```
You can even specify both the fill and the color at the same time:
```{r scale-fill-color}
# R color palettes
# change the colors with scale_color_
expression_data %>%
filter(gene == 'ACTN2') %>%
ggplot() +
aes(weeks, log2_expr, fill = weeks, color = weeks) +
geom_boxplot() +
scale_fill_manual(values = c('#DDA0DD', '#B0C4DE')) +
scale_color_manual(values = c('#B0C4DE', '#DDA0DD'))
labs(title = 'ACTN2', x = '# of weeks', y = 'log2 gene expression')
```
> ### Exercise: custom colors
>
> modify your boxplot for GATA4. Make each box a different color or fill (or both!)
>
> Once you've finished, use `ggsave()` to save the file.
> ```{r color-exercise}
> pretty_boxplot_gata4 = expression_data %>%
> filter(gene == 'GATA4') %>%
> ggplot() +
> aes(weeks, log2_expr, fill = weeks, color = weeks) +
> geom_boxplot() +
> scale_fill_manual(values = c('#DDA0DD', '#B0C4DE')) +
> scale_color_manual(values = c('#B0C4DE', '#DDA0DD'))
> labs(title = 'GATA4', x = '# of weeks', y = 'log2 gene expression')
> #ggsave(filename = 'boxplot_GATA4.png', plot = pretty_boxplot_gata4)
> ```
## Plot all the genes
```{r boxplot-all-genes}
# PLOT ALL THE GENES
# facet wrap
expression_data %>%
ggplot(aes(weeks, log2_expr)) +
geom_boxplot() +
facet_wrap('gene')
```
## Heatmaps
```{r heatmap-6-weeks}
expression_data %>%
filter(weeks == '6', !is.na(log2_expr)) %>%
ggplot(aes(gene, Sample_Num, fill = log2_expr)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90))
```
```{r heatmap-24-weeks}
expression_data %>%
filter(weeks == '24', !is.na(log2_expr)) %>%
ggplot(aes(gene, Sample_Num, fill = log2_expr)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90))
```
<!--
## Gene fun facts
```{r}
gene_facts <- tibble(gene = c("ACTN2", "COL1A1", "COL1A2", "COL3A1", "COL5A1",
"COL5A2", "GATA4", "MYH6", "MYH7", "MYL2", "MYL3", "MYL4", "MYL6",
"NOTCH1", "NOTCH2", "NPPA", "NPPB", "SIRT1"))
```
-->
## Bonus: Why log2-transform
```{r log-trans}
tibble(linear = 0:10,
x = linear) %>%
mutate(log2 = log2(linear),
log10 = log10(linear)) %>%
pivot_longer(c(linear, log2, log10),
names_to = 'scale',
values_to = 'y') %>%
ggplot() +
aes(x, y, color = scale) +
geom_point() +
geom_line() +
theme_bw()
```