forked from MathiasHarrer/Doing-Meta-Analysis-in-R
-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-getting_data_in_R.Rmd
404 lines (262 loc) · 22 KB
/
02-getting_data_in_R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
# Getting your data into R {#importing}
![](_figs/chappool.jpg)
```{block,type='rmdinfo'}
This chapter will tell you how to **import** your effect size data in RStudio. We will also show you a few commands which make it easier to **manipulate data** directly in *R*. Data preparation can be tedious and exhausting at times, but it is the backbone of all later steps when doing meta-analyses in *R*. We therefore have to pay **close attention** to preparing the data correctly before we can proceed.
```
<br><br>
---
## Data preparation in Excel
To conduct meta-analyses in *R*, you need to have your study data prepared. In the following, we describe the (preferred) way in which you should structure your dataset to facilitate the import into RStudio. We do this for two types of data: **"raw" effect size data** and **pre-calculated effect size data**. Usually, data imported into RStudio is stored in **Excel spreadsheets** first. We recommend to store your data there, because this makes it very easy to do the import.
```{block,type='rmdinfo'}
**What to (not) to do in your Excel spreadsheet**
* It is very important how you **name the columns of your spreadsheet**. If you already named the columns of your sheet adequately in Excel, you can save a lot of time because your data does not have to be transformed in RStudio later on. "Naming" the columns of the spreadsheet simply means to write the name of the variable into the first line of the column; RStudio will automatically detect that this is the name of the column then.
* It **doesn't matter how columns are ordered in your Excel spreadsheet**. They just have to be labeled correctly.
* There's also no need to **format** the columns in any way. If you type the column name in the first line of your spreadsheet, *R* will automatically detect it as a column name.
* It is also important to know that the import **will distort letters like ä,ü,ö,á,é,ê, etc**. So be sure to transform them to "normal" letters before you proceed.
* Make sure that your Excel file **only contains one sheet**. If you want to import Excel files with more than one sheet, you can have a look at the [`xlsx`](http://www.sthda.com/english/wiki/r-xlsx-package-a-quick-start-guide-to-manipulate-excel-files-in-r) package.
* If you have **one or several empty rows or columns which used to contain data**, make sure to delete those columns/rows completely, because RStudio could think that these columns contain data and import them also.
```
### Setting the columns of the Excel spreadsheet (raw effect size data) {#excel_preparation}
#### "Standard" effect size data (*M*, *SD*, *N*)
For a "standard" meta-analysis which uses the mean, standard deviation, and sample size from both groups in a study, the following information is needed for every study.
* The **names** of the individual studies, so that they can be easily identified later on. Usually, the first author and publication year of a study is used for this (e.g. "Ebert et al., 2018"). The names must be unique for each study.
* The **mean** of both the intervention and the control group at the same assessment point.
* The **standard deviation** of both the intervention and the control group at the same assessment point.
* The **number of participants ($N$)** in each group of the trial.
* If you want to have a look at differences between various study subgroups later on, you also need a **subgroup code** for each study which signifies to which subgroup it belongs. For example, if a study was conducted in children, you might give it the subgroup code "children".
**Here is how you should name the data columns in your EXCEL spreadheet containing your Meta-Analysis data**
```{r,echo=FALSE}
library(kableExtra)
Package<-c("Author","Me","Se","Mc","Sc","Ne","Nc","Subgroup")
Description<-c("This signifies the column for the study label (i.e., the first author)","The mean of the experimental/intervention group","The standard deviation of the experimental/intervention group","The Mean of the control group","The Standard Deviation of the control group","The number of participants in the experimental/intervention group","The number of participants in the control group","This is the label for one of your subgroup codes. It is not that important how you name this column, so you can give it a more informative name (e.g. population). In this column, each study should then be given an subgroup code, which should be exactly the same for each subgroup, including upper/lowercase letters. Of course, you can also include more than one subgroup column with different subgroup codings, but the column name has to be unique")
m<-data.frame(Package,Description)
names<-c("Column", "Description")
colnames(m)<-names
kable(m)
```
<br></br>
#### Event rate data
For a meta-analysis of event rates, which uses the number of events and sample size from both groups in a study, the following information is needed for every study.
* The **names** of the individual studies, so that they can be easily identified later on. Usually, the first author and publication year of a study is used for this (e.g. "Ebert et al., 2018"). The names must be unique for each study.
* The **number of events** in both the intervention and the control group at the same assessment point.
* The **number of participants ($N$)** in each group of the trial.
* If you want to have a look at differences between various study subgroups later on, you also need a **subgroup code** for each study which signifies to which subgroup it belongs. For example, if a study was conducted in children, you might give it the subgroup code "children".
```{r,echo=FALSE}
library(kableExtra)
Package<-c("Author","Ee","Ne","Ec","Nc","Subgroup")
Description<-c(
"This signifies the column for the study label (i.e., the first author)",
"Number of events in the experimental treatment arm",
"Number of participants in the experimental treatment arm",
"Number of events in the control arm",
"Number of participants in the control arm",
"This is the label for one of your subgroup codes. It's not that important how you name it, so you can give it a more informative name (e.g. population). In this column, each study should then be given a subgroup code, which should be exactly the same for each subgroup, including upper/lowercase letters. Of course, you can also include more than one subgroup column with different subgroup codings, but the column name has to be unique")
m<-data.frame(Package,Description)
names<-c("Column", "Description")
colnames(m)<-names
kable(m)
```
<br></br>
#### Incidence rate data
For a meta-analysis of incidence rates, which uses the number of events and person-time at risk from both groups in a study, the following information is needed for every study.
* The **names** of the individual studies, so that they can be easily identified later on. Usually, the first author and publication year of a study is used for this (e.g. "Ebert et al., 2018"). The names must be unique for each study.
* The **number of events** of both the intervention and the control group at the same assessment point.
* The [**person-time at risk**](https://sph.unc.edu/files/2015/07/nciph_ERIC4.pdf) in each group of the trial.
* If you want to have a look at differences between various study subgroups later on, you also need a **subgroup code** for each study which signifies to which subgroup it belongs. For example, if a study was conducted in children, you might give it the subgroup code "children".
```{r,echo=FALSE}
library(kableExtra)
Package<-c("Author","event.e","time.e","event.c","time.c","Subgroup")
Description<-c(
"This signifies the column for the study label (i.e., the first author)",
"Number of events in the experimental treatment arm",
"The person-time at risk in the experimental treatment arm",
"Number of events in the control arm",
"The person-time at risk in the control arm",
"This is the label for one of your subgroup codes. It is not that important how you name it, so you can give it a more informative name (e.g. population). In this column, each study should then be given a subgroup code, which should be exactly the same for each subgroup, including upper/lowercase letters. Of course, you can also include more than one subgroup column with different subgroup codings, but the column name has to be unique")
m<-data.frame(Package,Description)
names<-c("Column", "Description")
colnames(m)<-names
kable(m)
```
### Setting the columns of the Excel spreadsheet (pre-calculated effect size data)
If you have **already calculated the effect sizes for each study on your own**, for example using *Comprehensive Meta-Analysis*, *RevMan*, or one of the [**effect size calculators**](#effectsizecalculator) we present in this guide, there is another way to prepare your data which makes things a little easier. In this case, you only have to include the following columns:
```{r,echo=FALSE}
library(kableExtra)
Package.cma<-c("Author","TE","seTE","Subgroup")
Description.cma<-c("This signifies the column for the study label (i.e., the first author)","The calculated effect size of the study (either Cohen's d or Hedges' g, or some other form of effect size)","The standard error (SE) of the calculated effect","This is the label for one of your Subgroup codes. It is not that important how you name it, so you can give it a more informative name (e.g. population). In this column, each study should then be given an subgroup code, which should be exactly the same for each subgroup, including upper/lowercase letters. Of course, you can also include more than one subgroup column with different subgroup codings, but the column name has to be unique")
m.cma<-data.frame(Package.cma,Description.cma)
names.cma<-c("Column", "Description")
colnames(m.cma)<-names.cma
kable(m.cma)
```
<br><br>
---
## Importing the Spreadsheet into Rstudio {#import-excel}
To get our data into *R*, we need to **save our data in a format and at a place where RStudio can open it**.
### Saving the data in the right format
Generelly, finding the right format to import Excel can be tricky.
* If you're using a PC or Mac, it is advised to save your Excel sheet as a **comma-separated-values (.csv) file**. This can be done by clicking on "save as" and then choosing (.csv) as the output format in Excel.
* With some PCs, RStudios might not be able to open such files, or the files might be distorted. In this case, you can also try to save the sheet as a normal **.xslx** Excel-file and try if this works.
### Saving the data in your working directory
To function properly, you have to set a working directory for RStudio first. The working directory is a **folder on your computer from which RStudio can use data, and in which output it saved**.
1. Therefore, create a folder on your computer and give it a meaningful name (e.g. "My Meta-Analysis").
2. Save your spreadsheet in the folder
3. Set the folder as your working directory. This can be done in RStudio on the **bottom-right corner of your screen**. Under the tile **"Files"**, search for the folder on your computer, and open it.
4. Once you've opened your folder, the file you just saved there should be in there.
5. Now that you've opened the folder, click on the **little gear wheel on top of the pane**
```{r, echo=FALSE, fig.width=4,fig.height=3}
library(png)
library(grid)
img <- readPNG("_figs/snippet.PNG")
grid.raster(img)
```
6. Then click on "**Set as working directory**"
**Your file, and the working directory, are now where they should be!**
### Loading the data {#load-the-data}
1. To import the data, simply **click on the file** in the bottom-right pane. Then click on **Import Dataset...**
2. An **import assistant** should now pop up, which is also loading a preview of your data. This can be time-consuming sometimes, so you can skip this step if you want to, and klick straight on **"Import"**
```{r, echo=FALSE, fig.width=7,fig.height=3}
library(png)
library(grid)
img <- readPNG("_figs/snippet_3.PNG")
grid.raster(img)
```
As you can see, the on the top-right pane called **Environment**, your file is now listed as a dataset. This means that your data is now loaded and can be used by *R* and *R* code commands. Tabular datasets like the one we imported here are called **data frames** (`data.frame`) in *R* lingo; so when someone is referring to a data frame, know that what she or he is talking about is dataset with columns and rows just like the Excel spreadsheet we imported.
3. I also want to give my data a shorter name: `madata`. To rename it, I use the following code:
```{r, echo=FALSE}
load("_data/Meta_Analysis_Data.RData")
madata <- Meta_Analysis_Data
colnames(madata)[6:12] = c("Intervention_Duration", "Intervention_Type", "Population",
"Type_of_Students", "Prevention_Type", "Gender", "Mode_of_Delivery")
```
```{r}
madata <- Meta_Analysis_Data
```
This "copies" the data, and gives the copy the name `madata`. The `madata` dataset now also appears in the Environment pane, which means that it is also loaded into R and usable via code commands.
4. Now, let's have a look at the **structure of my data** using the `str()` function.
```{r, echo=FALSE}
madata$`ROB streng`=NULL
madata$`ROB superstreng`=NULL
```
```{r}
str(madata)
```
Although this output looks kind of messy, it's already very informative. It shows the structure of my data. In this case, i used data for which the effect sizes were already calculated. This is why the variables **TE** and **seTE** appear. I also see plenty of other variables, which correspond to the subgroups which were coded for this dataset.
**Here is a (shortened) table created for my data**
```{r, echo=FALSE}
library(kableExtra)
library(magrittr)
madata.s<-madata[,1:7]
madata.s$population=NULL
madata.s %>%
kable("html") %>%
kable_styling(font_size = 10)
```
<br><br>
---
## Data Manipulation {#manipulate}
Now that we have the meta-analysis data in RStudio, let us do a **few manipulations with the data**. These functions might come in handy when we are conducting analyses later on.
Going back to the output of the `str()` function, we see that this also gives us details on the type of data we have stored in each column of our dataset. There are different abbreviations signifying different types of data.
```{r,echo=FALSE}
library(kableExtra)
library(magrittr)
Package<-c("num","chr","log","factor")
type<-c("Numerical","Character","Logical","Factor")
Description<-c("This is all data stored as numbers (e.g. 1.02).",
"This is all data stored as words.",
"These are variables which are binary, meaning that they signify that a condition is either TRUE or FALSE.",
"Factors are stored as numbers, with each number signifying a different level of a variable. A possible factor of a variable might be 1 = low, 2 = medium, 3 = high.")
m<-data.frame(Package,type,Description)
names<-c("Abbreviation", "Type","Description")
colnames(m)<-names
kable(m)
```
### Converting to factors {#convertfactors}
Let's say we have the subgroup **Risk of Bias** (in which the Risk of Bias rating is coded), and want it to be a factor with two different levels: "low" and "high".
To do this, we need the variable `ROB` to be a factor. However, this variable is currently stored as a character (`chr`). We can have a look at this variable by typing the name of our dataset, then adding the selector `$`, and then adding the variable we want to have a look at.
```{r,echo=FALSE}
madata = Meta_Analysis_Data
madata$ROB<-madata$`ROB streng`
```
```{r}
madata$ROB
str(madata$ROB)
```
We can see now that `ROB` is indeed a **character** type variable, which contains only two words: "low" and "high". We want to convert this to a **factor** variable now, which has only two levels, low and high. To do this, we use the `factor()` function.
```{r}
madata$ROB<-factor(madata$ROB)
madata$ROB
str(madata$ROB)
```
We now see that the variable has been **converted to a factor with the levels "high" and "low"**.
### Converting to logicals
Now let us have a look at the **intervention type** subgroup variable. This column is currently stored as a character (`chr`) variable too.
```{r}
madata$`intervention type`
str(madata$`intervention type`)
```
Let us say we want a variable which only contains information if a study is a mindfulness intervention or not. A logical is very well suited for this. To convert the data to logical, we use the `as.logical` function. We will create a new variable containing this information called `intervention.type.logical`. To tell *R* what to count as `TRUE` and what as `FALSE`, we have to define the specific intervention type using the `==` command.
```{r}
intervention.type.logical<-as.logical(madata$`intervention type`=="mindfulness")
intervention.type.logical
```
We see that *R* has converted the character information into trues and falses for us. To check if this was done correctly, let us compare the original and the new variable.
```{r}
n <- data.frame(intervention.type.logical,madata$`intervention type`)
names <- c("New", "Original")
colnames(n) <- names
kable(n)
```
### Selecting specific studies {#select}
It may often come in handy to **select certain studies for further analyses**, or to **exclude some studies in further analyses** (e.g., if they are outliers). To do this, we can use the `filter` function in the `dplyr`package, which is part of the `tidyverse` package we installed before.
So, let us load the package first.
```{r, eval=FALSE,warning=FALSE}
library(dplyr)
```
Let us say we want to do a meta-analysis with three studies in our dataset only. To do this, we need to create a new dataset containing only these studies using the `dplyr::filter()` function. The `dplyr::` part is necessary as there is more than one `filter` function in *R*, and we want to use to use the one of the `dplyr` package. Let us say we want to have the studies by **Cavanagh et al.**, **Frazier et al.** and **Phang et al.** stored in another dataset, so we can conduct analyses only for these studies.
The *R* code to store these three studies in a new dataset called `madata.new` looks like this:
```{r}
madata.new <- dplyr::filter(madata, Author %in% c("Cavanagh et al.",
"Frazier et al.",
"Phang et al."))
```
Note that the `%in%` command tells the `filter` function to search for exactly the three cases we defined in the variable `Author`.
Now, let us have a look at the new data `madata.new` we just created.
```{r,echo=FALSE}
madata.new %>%
kable("html") %>%
kable_styling(font_size = 9)
```
Note that the function can also be used for any other type of data and variable. We can also use it to, for example, only select studies which were coded as being a mindfulness study.
```{r}
madata.new.mf <- dplyr::filter(madata,`intervention type` %in% c("mindfulness"))
```
We can also use the `dplyr::filter()` function to exclude studies from our dataset. To do this, we only have to add `!` in front of the variable we want to use for filtering.
```{r}
madata.new.excl <- dplyr::filter(madata,!Author %in% c("Cavanagh et al.",
"Frazier et al.",
"Phang et al."))
```
### Changing cell values
Sometimes, even when preparing your data in Excel, you might want to **change values in RStudio once you have imported your data**. To do this, we have to select a cell in our data frame in RStudio. This can be done by adding `[x,y]` to our dataset name, where **x** signifies the number of the **row** we want to select, and **y** signifies the number of the **column**.
To see how this works, let us select a variable using this command first:
```{r}
madata[6,1]
```
We now see the **6th study** in our dataframe, and the value of this study for **Column 1 (the author name)** is displayed. Let us say we had a typo in this name and want to have it changed. In this case, we have to give this exact cell a new value.
```{r}
madata[6,1] <- "Frogelli et al."
```
Let us check if the name has changed.
```{r}
madata[6,1]
```
You can also use this function to change any other type of data, including numericals and logicals. Only for characters, you have to put the values you want to insert in `""`.
<br><br>
---
## Exercises
If you feel like it, you can test your previously acquired knowledge about data manipulation in *R* using this **interactive exercise**. To see if your code generates the right output, simply click on the green button.
The exercises are displayed in the code next to the `#` sign. Leave the first chunk of code as is, because this generates the data used for the exercises. Your dataset is named `data`.
<iframe width='100%' height='700' src='https://rdrr.io/snippets/embed/?code=%23%20Generate%20the%20data%0AsuppressPackageStartupMessages(library(dplyr))%0Adata%20%3D%20data.frame(%22Author%22%20%3D%20c(%22Jones%22%2C%20%22Goldman%22%2C%20%22Townsend%22%2C%20%22Martin%22%2C%20%22Rose%22)%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22TE%22%20%3D%20c(0.23%2C%200.56%2C%200.78%2C%200.23%2C%200.33)%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22seTE%22%20%3D%20c(0.324%2C%200.235%2C%200.394%2C%200.275%2C%200.348)%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22subgroup%22%20%3D%20c(%22one%22%2C%20%22one%22%2C%20%22two%22%2C%20%22two%22%2C%20%22three%22))%0A%0A%23%20First%20exercise%3A%20show%20the%20variable%20%22Author%22%0A%0A%0A%23%20Second%20exercise%3A%20convert%20%22subgroup%22%20to%20a%20factor%0A%0A%0A%23%20Third%20exercise%3A%20select%20the%20%22Jones%22%20and%20%22Martin%22%20study%0A%0A%0A%23%20Fourth%20exercise%3A%20change%20the%20name%20of%20%22Rose%22%20to%20%22Bloom%22' frameborder='0'></iframe>
This has only been a quick overview of data manipulation strategies in *R*. Learning *R* from scratch can be exhausting at times, especially when trying to learn something as supposedly easy like manipulating data. However, the best way to get accustomed to the way *R* works is to practice; after some time, common *R* commands will become like second nature to you. To continue learning, you can have a look at this helpful [**tutorial video**](https://www.youtube.com/watch?v=BaFkbNOaof8), which introduces more of the functionality of the `dplyr` package.
<br></br>