This repository has been archived by the owner on Aug 27, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 15
/
05-messydata.Rmd
2883 lines (2039 loc) · 125 KB
/
05-messydata.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Dealing with messy data {#messychapter}
...or, put differently, _welcome to the real world_. Real datasets are seldom as tidy and clean as those you have seen in the previous examples in this book. On the contrary, real data is messy. Things will be out of place, and formatted in the wrong way. You'll need to filter the rows to remove those that aren't supposed to be used in the analysis. You'll need to remove some columns and merge others. You will need to wrestle, clean, coerce, and coax your data until it finally has the right format. Only then will you be able to actually analyse it.
This chapter contains a number of examples that serve as cookbook recipes for common data wrangling tasks. And as with any cookbook, you'll find yourself returning to some recipes more or less every day, until you know them by heart, while you never find the right time to use other recipes. You do definitely not have to know all of them by heart, and can always go back and look up a recipe that you need.
After working with the material in this chapter, you will be able to use R to:
* Handle numeric and categorical data,
* Manipulate and find patterns in text strings,
* Work with dates and times,
* Filter, subset, sort, and reshape your data using `data.table`, `dplyr`, and `tidyr`,
* Split and merge datasets,
* Scrape data from the web,
* Import data from different file formats.
## Changing data types {#coercion}
In Exercise \@ref(exr:ch3exc1) you discovered that R implicitly coerces variables into other data types when needed\index{data type!coercion}\index{variable!change type}. For instance, if you add a `numeric` to a `logical`, the result is a `numeric`. And if you place them together in a vector, the vector will contain two `numeric` values:
```{r eval=FALSE}
TRUE + 5
v1 <- c(TRUE, 5)
v1
```
However, if you add a `numeric` to a `character`, the operation fails. If you put them together in a vector, both become `character` strings:
```{r eval=FALSE}
"One" + 5
v2 <- c("One", 5)
v2
```
There is a hierarchy for data types in R: `logical` < `integer` < `numeric` < `character`\index{data type!hierarchy}. When variables of different types are somehow combined (with addition, put in the same vector, and so on), R will coerce both to the higher ranking type. That is why `v1` contained `numeric` variables (`numeric` is higher ranked than `logical`) and `v2` contained `character` values (`character` is higher ranked than `numeric`).
Automatic coercion is often useful, but will sometimes cause problems. As an example, a vector of numbers may accidentally be converted to a `character` vector, which will confuse plotting functions. Luckily it is possible to convert objects to other data types. The functions most commonly used for this are `as.logical`, `as.numeric` and `as.character`\index{\texttt{as.logical}}\index{\texttt{as.numeric}}\index{\texttt{as.character}}. Here are some examples of how they can be used:
```{r eval=FALSE}
as.logical(1) # Should be TRUE
as.logical("FALSE") # Should be FALSE
as.numeric(TRUE) # Should be 1
as.numeric("2.718282") # Should be numeric 2.718282
as.character(2.718282) # Should be the string "2.718282"
as.character(TRUE) # Should be the string "TRUE"
```
A word of warning though - conversion only works if R can find a natural conversion between the types. Here are some examples where conversion fails. Note that only some of them cause warning messages:
```{r eval=FALSE}
as.numeric("two") # Should be 2
as.numeric("1+1") # Should be 2
as.numeric("2,718282") # Should be numeric 2.718282
as.logical("Vaccines cause autism") # Should be FALSE
```
$$\sim$$
```{exercise, label="ch5exc1"}
The following tasks are concerned with converting and checking data types:
1. What happens if you apply `as.logical` to the `numeric` values 0 and 1? What happens if you apply it to other numbers?
2. What happens if you apply `as.character` to a vector containing `numeric` values?
3. The functions `is.logical`, `is.numeric` and `is.character`\index{\texttt{is.logical}}\index{\texttt{is.numeric}}\index{\texttt{is.character}} can be used to check if a variable is a `logical`, `numeric` or `character`, respectively. What type of object do they return?
4. Is `NA` a `logical`, `numeric` or `character`?
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch5solutions1)
## Working with lists {#lists2}
A data structure that is very convenient for storing data of different types is `list`\index{\texttt{list}}\index{list}. You can think of a `list` as a data frame where you can put different types of objects in each column: like a `numeric` vector of length 5 in the first, a data frame in the second and a single `character` in the third^[In fact, the opposite is true: under the hood, a data frame is a list of vectors of equal length.]. Here is an example of how to create a `list` using the function of the same name:
```{r eval=FALSE}
my_list <- list(my_numbers = c(86, 42, 57, 61, 22),
my_data = data.frame(a = 1:3, b = 4:6),
my_text = "Lists are the best.")
```
To access the elements in the list, we can use the same `$` notation as for data frames:
```{r eval=FALSE}
my_list$my_numbers
my_list$my_data
my_list$my_text
```
In addition, we can access them using indices, but using _double_ brackets:
```{r eval=FALSE}
my_list[[1]]
my_list[[2]]
my_list[[3]]
```
To access elements within the elements of lists, additional brackets can be added. For instance, if you wish to access the second element of the `my_numbers` vector, you can use either of these:
```{r eval=FALSE}
my_list[[1]][2]
my_list$my_numbers[2]
```
### Splitting vectors into lists {#splitvector}
Consider the `airquality` dataset, which among other things describe the temperature on each day during a five-month period. Suppose that we wish to split the `airquality$Temp` vector into five separate vectors: one for each month. We could do this by repeated filtering, e.g.
```{r eval=FALSE}
temp_may <- airquality$Temp[airquality$Month == 5]
temp_june <- airquality$Temp[airquality$Month == 6]
# ...and so on.
```
Apart from the fact that this isn't a very good-looking solution, this would be infeasible if we needed to split our vector into a larger number of new vectors. Fortunately, there is a function that allows us to split the vector by month, storing the result as a list - `split`\index{\texttt{split}}\index{vector!split}:
```{r eval=FALSE}
temps <- split(airquality$Temp, airquality$Month)
temps
# To access the temperatures for June:
temps$`6`
temps[[2]]
# To give more informative names to the elements in the list:
names(temps) <- c("May", "June", "July", "August", "September")
temps$June
```
Note that, in breach of the rules for variable names in R, the original variable names here were numbers (actually `character` variables that happened to contain numeric characters). When accessing them using `$` notation, you need to put them between backticks (`` ` ``), e.g. `` temps$`6` ``, to make it clear that `6` is a variable name and not a number.
### Collapsing lists into vectors
Conversely, there are times where you want to collapse a list into a vector. This can be done using `unlist`\index{\texttt{unlist}}\index{list!collapse to vector}:
```{r eval=FALSE}
unlist(temps)
```
$$\sim$$
```{exercise, label="ch5exc1b"}
Load the `vas.csv` data from Exercise \@ref(exr:ch3exc4). Split the `VAS` vector so that you get a list containing one vector for each patient. How can you then access the VAS values for patient 212?
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch5solutions1b)
## Working with numbers
A lot of data analyses involve numbers, which typically are represented as `numeric` values in R. We've already seen in Section \@ref(maths) that there are numerous mathematical operators that can be applied to numbers in R. But there are also other functions that come in handy when working with numbers.
### Rounding numbers
At times you may want to round numbers, either for presentation purposes or for some other reason. There are several functions that can be used for this\index{\texttt{round}}\index{\texttt{signif}}\index{\texttt{ceiling}}\index{\texttt{floor}}\index{\texttt{trunc}}:
```{r eval=FALSE}
a <- c(2.1241, 3.86234, 4.5, -4.5, 10000.1001)
round(a, 3) # Rounds to 3 decimal places
signif(a, 3) # Rounds to 3 significant digits
ceiling(a) # Rounds up to the nearest integer
floor(a) # Rounds down to the nearest integer
trunc(a) # Rounds to the nearest integer, toward 0
# (note the difference in how 4.5
# and -4.5 are treated!)
```
### Sums and means in data frames
When working with numerical data, you'll frequently find yourself wanting to compute sums or means of either columns or rows of data frames.\index{\texttt{colSums}}\index{\texttt{rowSums}}\index{\texttt{colMeans}}\index{\texttt{rowMeans}} The `colSums`, `rowSums`, `colMeans` and `rowMeans` functions can be used to do this. Here is an example with an expanded version of the `bookstore` data, where three purchases have been recorded for each customer:
```{r eval=FALSE}
bookstore2 <- data.frame(purchase1 = c(20, 59, 2, 12, 22, 160,
34, 34, 29),
purchase2 = c(14, 67, 9, 20, 20, 81,
19, 55, 8),
purchase3 = c(4, 62, 11, 18, 33, 57,
24, 49, 29))
colSums(bookstore2) # The total amount for customers' 1st, 2nd and
# 3rd purchases
rowSums(bookstore2) # The total amount for each customer
colMeans(bookstore2) # Mean purchase for 1st, 2nd and 3rd purchases
rowMeans(bookstore2) # Mean purchase for each customer
```
Moving beyond sums and means, in Section \@ref(vectorloops) you'll learn how to apply any function to the rows or columns of a data frame.
### Summaries of series of numbers {#rle}
When a `numeric` vector contains a series of consecutive measurements, as is the case e.g. in a time series, it is often of interest to compute various cumulative summaries. For instance, if the vector contains the daily revenue of a business during a month, it may be of value to know the total revenue up to each day - that is, the _cumulative sum_\index{cumulative functions} for each day.
Let's return to the `a10` data from Section \@ref(tsplots), which described the monthly anti-diabetic drug sales in Australia during 1991-2008.
```{r eval=FALSE}
library(fpp2)
a10
```
Elements 7 to 18 contain the sales for 1992. We can compute the total, highest and smallest monthly sales up to and including each month using `cumsum`, `cummax` and `cummin`\index{\texttt{cumsum}}\index{\texttt{cummax}}\index{\texttt{cummin}}:
```{r eval=FALSE}
a10[7:18]
cumsum(a10[7:18]) # Total sales
cummax(a10[7:18]) # Highest monthly sales
cummin(a10[7:18]) # Lowest monthly sales
# Plot total sales up to and including each month:
plot(1:12, cumsum(a10[7:18]),
xlab = "Month",
ylab = "Total sales",
type = "b")
```
In addition, the `cumprod`\index{\texttt{cumprod}} function can be used to compute cumulative products.
At other times, we are interested in studying _run lengths_\index{run length} in series, that is, the lengths of runs of equal values in a vector. Consider the `upp_temp`\index{data!\texttt{upp\_temp}} vector defined in the code chunk below, which contains the daily temperatures in Uppsala, Sweden, in February 2020^[Courtesy of the Department of Earth Sciences at Uppsala University.].
```{r eval=FALSE}
upp_temp <- c(5.3, 3.2, -1.4, -3.4, -0.6, -0.6, -0.8, 2.7, 4.2, 5.7,
3.1, 2.3, -0.6, -1.3, 2.9, 6.9, 6.2, 6.3, 3.2, 0.6, 5.5,
6.1, 4.4, 1.0, -0.4, -0.5, -1.5, -1.2, 0.6)
```
It could be interesting to look at runs of sub-zero days, i.e. consecutive days with sub-zero temperatures. The `rle`\index{\texttt{rle}} function counts the lengths of runs of equal values in a vector. To find the length of runs of temperatures below or above zero we can use the vector defined by the condition `upp_temp < 0`, the values of which are `TRUE` on sub-zero days and `FALSE` when the temperature is 0 or higher. When we apply `rle` to this vector, it returns the length and value of the runs:
```{r eval=FALSE}
rle(upp_temp < 0)
```
We first have a 2-day run of above zero temperatures (`FALSE`), then a 5-day run of sub-zero temperatures (`TRUE`), then a 5-day run of above zero temperatures, and so on.
### Scientific notation `1e-03`
When printing very large or very small numbers, R uses _scientific notation_\index{scientific notation}, meaning that $7,000,000$ (7 followed by 6 zeroes) is displayed as (the mathematically equivalent) $7\cdot 10^6$ and $0.0000007$ is displayed as $7\cdot 10^{-7}$. Well, almost, the _ten raised to the power of x_ bit isn't really displayed as $10^x$, but as `e+x`\index{\texttt{e+}}, a notation used in many programming languages and calculators. Here are some examples:
```{r eval=FALSE}
7000000
0.0000007
7e+07
exp(30)
```
Scientific notation is a convenient way to display large numbers, but it's not always desirable. If you just want to print the number, the `format`\index{\texttt{format}} function can be used to convert it to a character, suppressing scientific notation:
```{r eval=FALSE}
format(7000000, scientific = FALSE)
```
If you still want your number to be a `numeric` (as you often do), a better choice is to change the option for when R uses scientific notation\index{\texttt{options}!\texttt{scipen}}. This can be done using the `scipen` argument in the `options` function:
```{r eval=FALSE}
options(scipen = 1000)
7000000
0.0000007
7e+07
exp(30)
```
To revert this option back to the default, you can use:
```{r eval=FALSE}
options(scipen = 0)
7000000
0.0000007
7e+07
exp(30)
```
Note that this option only affects how R _prints_ numbers, and not how they are treated in computations.
### Floating point arithmetics {#floatingpoints}
Some numbers cannot be written in finite decimal forms. Take $1/3$ for example, the decimal form of which is $$0.33333333333333333333333333333333\ldots.$$
Clearly, the computer cannot store this number exactly, as that would require an infinite memory^[This is not strictly speaking true; if we use base 3, $1/3$ is written as $0.1$ which can be stored in a finite memory. But then other numbers become problematic instead.]. Because of this, numbers in computers are stored as _floating point numbers_\index{floating point numbers}, which aim to strike a balance between _range_ (being able to store both very small and very large numbers) and _precision_ (being able to represent numbers accurately). Most of the time, calculations with floating points yield exactly the results that we'd expect, but sometimes these non-exact representations of numbers will cause unexpected problems. If we wish to compute $1.5-0.2$ and $1.1-0.2$, say, we could of course use R for that. Let's see if it gets the answers right:
```{r eval=FALSE}
1.5 - 0.2
1.5 - 0.2 == 1.3 # Check if 1.5-0.2=1.3
1.1 - 0.2
1.1 - 0.2 == 0.9 # Check if 1.1-0.2=0.9
```
The limitations of floating point arithmetics causes the second calculation to fail. To see what has happened, we can use `sprintf`\index{\texttt{sprintf}} to print numbers with 30 decimals (by default, R prints a rounded version with fewer decimals):
```{r eval=FALSE}
sprintf("%.30f", 1.1 - 0.2)
sprintf("%.30f", 0.9)
```
The first 12 decimals are identical, but after that the two numbers `1.1 - 0.2` and `0.9` diverge. In our other example, $1.5 - 0.2$, we don't encounter this problem - both `1.5 - 0.2` and `0.3` have the same floating point representation:
```{r eval=FALSE}
sprintf("%.30f", 1.5 - 0.2)
sprintf("%.30f", 1.3)
```
The order of the operations also matters in this case. The following three calculations would all yield identical results if performed with real numbers, but in floating point arithmetics the results differ:
```{r eval=FALSE}
1.1 - 0.2 - 0.9
1.1 - 0.9 - 0.2
1.1 - (0.9 + 0.2)
```
In most cases, it won't make a difference whether a variable is represented as $0.90000000000000013\ldots$ or $0.90000000000000002\ldots$, but in some cases tiny differences like that can propagate and cause massive problems. A famous example of this involves the US Patriot surface-to-air defence system, which at the end of the first Gulf war missed an incoming missile due to an error in floating point arithmetics^[Not in R though.]. It is important to be aware of the fact that floating point arithmetics occasionally will yield incorrect results. This can happen for numbers of any size, but is more likely to occur when very large and very small numbers appear in the same computation.
So, `1.1 - 0.2` and `0.9` may not be the same thing in floating point arithmetics, but at least they are _nearly_ the same thing. The `==` operator checks if two numbers are exactly equal, but there is an alternative that can be used to check if two numbers are nearly equal: `all.equal`\index{\texttt{all.equal}}. If the two numbers are (nearly) equal, it returns `TRUE`, and if they are not, it returns a description of how they differ. In order to avoid the latter, we can use the `isTRUE`\index{\texttt{isTRUE}} function to return `FALSE` instead:
```{r eval=FALSE}
1.1 - 0.2 == 0.9
all.equal(1.1 - 0.2, 0.9)
all.equal(1, 2)
isTRUE(all.equal(1, 2))
```
$$\sim$$
```{exercise, label="ch5exc2"}
These tasks showcase some problems that are commonly faced when working with numeric data:
1. The vector `props <- c(0.1010, 0.2546, 0.6009, 0.0400, 0.0035)` contains proportions (which, by definition, are between 0 and 1). Convert the proportions to percentages with one decimal place.
2. Compute the highest and lowest temperatures up to and including each day in the `airquality` dataset.
3. What is the longest run of days with temperatures above 80 in the `airquality` dataset?
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch5solutions2)
<br>
```{exercise, label="ch5exc3"}
These tasks are concerned with floating point arithmetics:
1. Very large numbers, like `10e500`, are represented by `Inf` (infinity) in R. Try to find out what the largest number that can be represented as a floating point number in R is.
2. Due to an error in floating point arithmetics, `sqrt(2)^2 - 2` is not equal to `0`. Change the order of the operations so that the results is `0`.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch5solutions3)
## Working with factors {#factors}
In Sections \@ref(catdata1) and \@ref(catdata2) we looked at how to analyse and visualise categorical data, i.e data where the variables can take a fixed number of possible values that somehow correspond to groups or categories. But so far we haven't really gone into how to handle categorical variables in R.
Categorical data is stored in R as `factor`\index{\texttt{factor}} variables. You may ask why a special data structure is needed for categorical data, when we could just use `character` variables to represent the categories. Indeed, the latter is what R does by default, e.g. when creating a `data.frame` object or reading data from `.csv` and `.xlsx` files.
Let's say that you've conducted a survey on students' smoking habits. The possible responses are _Never_, _Occasionally_, _Regularly_ and _Heavy_. From 10 students, you get the following responses\index{data!\texttt{smoke}}:
```{r eval=FALSE}
smoke <- c("Never", "Never", "Heavy", "Never", "Occasionally",
"Never", "Never", "Regularly", "Regularly", "No")
```
Note that the last answer is invalid - `No` was not one of the four answers that were allowed for the question.
You could use `table` to get a summary of how many answers of each type that you got:
```{r eval=FALSE}
table(smoke)
```
But the categories are not presented in the correct order! There is a clear order between the different categories, _Never_ < _Occasionally_ < _Regularly_ < _Heavy_, but `table` doesn't present the results in that way. Moreover, R didn't recognise that `No` was an invalid answer, and treats it just the same as the other categories.
This is where `factor` variables come in. They allow you to specify which values your variable can take, and the ordering between them (if any).
### Creating factors
When creating a `factor` variable, you typically start with a `character`, `numeric` or `logical` variable, the values of which are turned into categories. To turn the `smoke` vector that you created in the previous section into a `factor`, you can use the `factor`\index{\texttt{factor}} function:
```{r eval=FALSE}
smoke2 <- factor(smoke)
```
You can inspect the elements, and _levels_\index{\texttt{levels}}, i.e. the values that the categorical variable takes, as follows:
```{r eval=FALSE}
smoke2
levels(smoke2)
```
So far, we haven't solved neither the problem of the categories being in the wrong order nor that invalid `No` value. To fix both these problems, we can use the `levels` argument in `factor`:
```{r eval=FALSE}
smoke2 <- factor(smoke, levels = c("Never", "Occasionally",
"Regularly", "Heavy"),
ordered = TRUE)
# Check the results:
smoke2
levels(smoke2)
table(smoke2)
```
You can control the order in which the levels are presented by choosing which order we write them in in the `levels` argument. The `ordered = TRUE` argument specifies that the order of the variables is _meaningful_. It can be excluded in cases where you wish to specify the order in which the categories should be presented purely for presentation purposes (e.g. when specifying whether to use the order `Male/Female/Other` or `Female/Male/Other`). Also note that the `No` answer now became an `NA`, which in the case of `factor` variables represents both missing observations and invalid observations. To find the values of `smoke` that became `NA` in `smoke2` you can use `which` and `is.na`:
```{r eval=FALSE}
smoke[which(is.na(smoke2))]
```
By checking the original values of the `NA` elements, you can see if they should be excluded from the analysis or recoded into a proper category (`No` could for instance be recoded into `Never`). In Section \@ref(regexp) you'll learn how to replace values in larger datasets automatically using regular expressions.
### Changing factor levels {#factorlevels}
When we created `smoke2`, one of the elements became an `NA`. `NA` was however not included as a level of the `factor`. Sometimes it is desirable to include `NA` as a level, for instance when you want to analyse rows with missing data. This is easily done using the `addNA`\index{\texttt{addNA}}\index{\texttt{factor}!add `NA` level} function:
```{r eval=FALSE}
smoke2 <- addNA(smoke2)
```
If you wish to change the name of one or more of the `factor` levels, you can do it directly via the `levels` function. For instance, we can change the name of the `NA` category, which is the 5th level of `smoke2`, as follows\index{\texttt{factor}!rename levels}:
```{r eval=FALSE}
levels(smoke2)[5] <- "Invalid answer"
```
The above solution is a little brittle in that it relies on specifying the index of the level name, which can change if we're not careful. More robust solutions using the `data.table` and `dplyr` packages are presented in Section \@ref(recodedplyr).
Finally, if you've added more levels than what are actually used, these can be dropped using the `droplevels`\index{\texttt{droplevels}}\index{\texttt{factor}!drop levels} function:
```{r eval=FALSE}
smoke2 <- factor(smoke, levels = c("Never", "Occasionally",
"Regularly", "Heavy",
"Constantly"),
ordered = TRUE)
levels(smoke2)
smoke2 <- droplevels(smoke2)
levels(smoke2)
```
### Changing the order of levels
Now suppose that we'd like the levels of the `smoke2` variable to be presented in the reverse order: _Heavy_, _Regularly_, _Occasionally_, and _Never_.\index{\texttt{factor}!change order of levels} This can be done by a new call to `factor`, where the new level order is specified in the `levels` argument:
```{r eval=FALSE}
smoke2 <- factor(smoke2, levels = c("Heavy", "Regularly",
"Occasionally", "Never"))
# Check the results:
levels(smoke2)
```
### Combining levels
Finally, `levels` can be used to merge categories by replacing their separate names with a single name.\index{\texttt{factor}!combine levels} For instance, we can combine the smoking categories _Occasionally_, _Regularly_, and _Heavy_ to a single category named _Yes_. Assuming that these are first, second and third in the list of names (as will be the case if you've run the last code chunk above), here's how to do it:
```{r eval=FALSE}
levels(smoke2)[1:3] <- "Yes"
# Check the results:
levels(smoke2)
```
Alternative ways to do this are presented in Section \@ref(recodedplyr).
$$\sim$$
```{exercise, label="ch5exc4"}
In Exercise \@ref(exr:ch3exc3b) you learned how to create a `factor` variable from a `numeric` variable using `cut`. Return to your solution [(or the solution at the back of the book)](#ch3solutions3b) and do the following:
1. Change the category names to `Mild`, `Moderate` and `Hot`.
2. Combine `Moderate` and `Hot` into a single level named `Hot`.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch5solutions4)
<br>
```{exercise, label="ch5exc5"}
Load the `msleep` data from the `ggplot2` package. Note that categorical variable `vore` is stored as a `character`. Convert it to a `factor` by running `msleep$vore <- factor(msleep$vore)`.
1. How are the resulting factor levels ordered? Why are they ordered in that way?
2. Compute the mean value of `sleep_total` for each `vore` group.
3. Sort the factor levels according to their `sleep_total` means. Hint: this can be done manually, or more elegantly using e.g. a combination of the functions `rank`\index{\texttt{rank}} and `match`\index{\texttt{match}} in an intermediate step.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch5solutions5)
## Working with strings {#strings}
Text in R is represented by `character` strings. These are created using double or single quotes. I recommend double quotes for three reasons. First, it is the default in R, and is the recommended style (see e.g. `?Quotes`). Second, it improves readability - code with double quotes is easier to read because double quotes are easier to spot than single quotes. Third, it will allow you to easily use apostrophes in your strings, which single quotes don't (because apostrophes will be interpreted as the end of the string). Single quotes can however be used if you need to include double quotes inside your string:
```{r eval=FALSE}
# This works:
text1 <- "An example of a string. Isn't this great?"
text2 <- 'Another example of a so-called "string".'
# This doesn't work:
text1_fail <- 'An example of a string. Isn't this great?'
text2_fail <- "Another example of a so-called "string"."
```
If you check what these two strings look like, you'll notice something funny about `text2`:
```{r eval=FALSE}
text1
text2
```
R has put backslash characters, `\`, before the double quotes. The backslash is called an _escape character_\index{escape character}, which invokes a different interpretation of the character that follows it. In fact, you can use this to put double quotes inside a string that you define using double quotes:
```{r eval=FALSE}
text2_success <- "Another example of a so-called \"string\"."
```
There are a number of other special characters that can be included using a backslash: `\n` for a line break (a new line) and `\t` for a tab (a long whitespace) being the most important^[See `?Quotes` for a complete list.]:
```{r eval=FALSE}
text3 <- "Text...\n\tWith indented text on a new line!"
```
To print your string in the Console in a way that shows special characters instead of their escape character-versions, use the function `cat`\index{\texttt{cat}}:
```{r eval=FALSE}
cat(text3)
```
You can also use `cat` to print the string to a text file...
```{r eval=FALSE}
cat(text3, file = "new_findings.txt")
```
...and to append text at the end of a text file:
```{r eval=FALSE}
cat("Let's add even more text!", file = "new_findings.txt",
append = TRUE)
```
(Check the output by opening `new_findings.txt`!)
### Concatenating strings
If you wish to concatenate multiple strings, `cat` will do that for you:
```{r eval=FALSE}
first <- "This is the beginning of a sentence"
second <- "and this is the end."
cat(first, second)
```
By default, `cat` places a single white space between the two strings, so that `"This is the beginning of a sentence"` and
`"and this is the end."` are concatenated to `"This is the beginning of a sentence and this is the end."`. You can change that using the `sep` argument in `cat`. You can also add as many strings as you like as input:
```{r eval=FALSE}
cat(first, second, sep = "; ")
cat(first, second, sep = "\n")
cat(first, second, sep = "")
cat(first, second, "\n", "And this is another sentence.")
```
At other times, you want to concatenate two or more strings without printing them. You can then use `paste` in exactly the same way as you'd use `cat`, the exception being that `paste`\index{\texttt{paste}} returns a string instead of printing it.
```{r eval=FALSE}
my_sentence <- paste(first, second, sep = "; ")
my_novel <- paste(first, second, "\n",
"And this is another sentence.")
# View results:
my_sentence
my_novel
cat(my_novel)
```
Finally, if you wish to create a number of similar strings based on information from other variables, you can use `sprintf`\index{\texttt{sprintf}}, which allows you to write a string using `%s` as a placeholder for the values that should be pulled from other variables:
```{r eval=FALSE}
names <- c("Irma", "Bea", "Lisa")
ages <- c(5, 59, 36)
sprintf("%s is %s years old.", names, ages)
```
There are many more uses of `sprintf` (we've already seen some in Section \@ref(floatingpoints)), but this enough for us for now.
### Changing case
If you need to translate characters from lowercase to uppercase or vice versa, that can be done using `toupper` and `tolower`\index{\texttt{toupper}}\index{\texttt{tolower}}:
```{r eval=FALSE}
my_string <- "SOMETIMES I SCREAM (and sometimes I whisper)."
toupper(my_string)
tolower(my_string)
```
If you only wish to change the case of some particular element in your string, you can use `substr`\index{\texttt{substr}}, which allows you to access substrings:
```{r eval=FALSE}
months <- c("january", "february", "march", "aripl")
# Replacing characters 2-4 of months[4] with "pri":
substr(months[4], 2, 4) <- "pri"
months
# Replacing characters 1-1 (i.e. character 1) of each element of month
# with its uppercase version:
substr(months, 1, 1) <- toupper(substr(months, 1, 1))
months
```
### Finding patterns using regular expressions {#regexp}
_Regular expressions_, or regexps for short, are special strings that describe patterns\index{regular expression}. They are extremely useful if you need to find, replace or otherwise manipulate a number of strings depending on whether or not a certain pattern exists in each one of them. For instance, you may want to find all strings containing only numbers and convert them to `numeric`, or find all strings that contain an email address and remove said addresses (for censoring purposes, say). Regular expressions are incredibly useful, but can be daunting. Not everyone will need them, and if this all seems a bit too much to you can safely skip this section, or just skim through it, and return to it at a later point.
To illustrate the use of regular expressions we will use a sheet from the `projects-email.xlsx` file from the books' web page. In Exercise \@ref(exr:ch3exc5), you explored the second sheet in this file, but here we'll use the third instead. Set `file_path` to the path to the file, and then run the following code to import the data\index{data!\texttt{contacts}}:
```{r eval=FALSE}
library(openxlsx)
contacts <- read.xlsx(file_path, sheet = 3)
str(contacts)
```
There are now three variables in `contacts`. We'll primarily be concerned with the third one: `Address`. Some people have email addresses attached to them, others have postal addresses and some have no address at all:
```{r eval=FALSE}
contacts$Address
```
You can find loads of guides on regular expressions online, but few of them are easy to use with R, the reason being that regular expressions in R sometimes require escape characters that aren't needed in some other programming languages. In this section we'll take a look at regular expressions, _as they are written in R_.
The basic building blocks of regular expressions are patterns consisting of one or more characters. If, for instance, we wish to find all occurrences of the letter `y` in a vector of strings, the regular expression describing that "pattern" is simply `"y"`. The functions used to find occurrences of patterns are called `grep` and `grepl`. They differ only in the output they return: `grep` returns the indices of the strings containing the pattern, and `grepl` returns a `logical` vector with `TRUE` at indices matching the patterns and `FALSE` at other indices.
To find all addresses containing a lowercase `y`, we use `grep` and `grepl` as follows:
```{r eval=FALSE}
grep("y", contacts$Address)
grepl("y", contacts$Address)
```
Note how both outputs contain the same information presented in different ways.
In the same way, we can look for word or substrings. For instance, we can find all addresses containing the string `"Edin"`:
```{r eval=FALSE}
grep("Edin", contacts$Address)
grepl("Edin", contacts$Address)
```
Similarly, we can also look for special characters. Perhaps we can find all email addresses by looking for strings containing the `@` symbol:
```{r eval=FALSE}
grep("@", contacts$Address)
grepl("@", contacts$Address)
# To display the addresses matching the pattern:
contacts$Address[grep("@", contacts$Address)]
```
Interestingly, this includes two rows that aren't email addresses. To separate the email addresses from the other rows, we'll need a more complicated regular expression, describing the pattern of an email address in more general terms. Here are four examples or regular expressions that'll do the trick:
```{r eval=FALSE}
grep(".+@.+[.].+", contacts$Address)
grep(".+@.+\\..+", contacts$Address)
grep("[[:graph:]]+@[[:graph:]]+[.][[:alpha:]]+", contacts$Address)
grep("[[:alnum:]._-]+@[[:alnum:]._-]+[.][[:alpha:]]+",
contacts$Address)
```
To try to wrap our head around what these mean we'll have a look at the building blocks of regular expressions. These are:
* Patterns describing a single character.
* Patterns describing a class of characters, e.g. letters or numbers.
* Repetition quantifiers describing how many repetitions of a pattern to look for.
* Other operators.
We've already looked at single character expressions, as well as the multi-character expression `"Edin"` which simply is a combination of four single-character expressions. Patterns describing classes of characters, e.g. characters with certain properties, are denoted by brackets `[]` (for manually defined classes) or double brackets `[[]]` (for predefined classes). One example of the latter is `"[[:digit:]]` which is a pattern that matches all digits: `0 1 2 3 4 5 6 7 8 9`. Let's use it to find all addresses containing a number:
```{r eval=FALSE}
grep("[[:digit:]]", contacts$Address)
contacts$Address[grep("[[:digit:]]", contacts$Address)]
```
Some important predefined classes are:
* `[[:lower:]]` matches lowercase letters,
* `[[:upper:]]` matches UPPERCASE letters,
* `[[:alpha:]]` matches both lowercase and UPPERCASE letters,
* `[[:digit:]]` matches digits: `0 1 2 3 4 5 6 7 8 9`,
* `[[:alnum:]]` matches alphanumeric characters (alphabetic characters and digits),
* `[[:punct:]]` matches punctuation characters: `! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~`,
* `[[:space:]]` matches space characters: space, tab, newline, and so on,
* `[[:graph:]]` matches letters, digits, and punctuation characters,
* `[[:print:]]` matches letters, digits, punctuation characters, and space characters,
* `.` matches _any_ character.
Examples of manually defined classes are:
* `[abcd]` matches `a`, `b`, `c`, and `d`,
* `[a-d]` matches `a`, `b`, `c`, and `d`,
* `[aA12]` matches `a`, `A`, `1` and `2`,
* `[.]` matches `.`,
* `[.,]` matches `.` and `,`,
* `[^abcd]` matches anything except `a`, `b`, `c`, or `d`.
So for instance, we can find all addresses that don't contain at least one of the letters `y` and `z` using:
```{r eval=FALSE}
grep("[^yz]", contacts$Address)
contacts$Address[grep("[^yz]", contacts$Address)]
```
All of these patterns can be combined with patterns describing a single character:
* `gr[ea]y` matches `grey` and `gray` (but not `greay`!),
* `b[^o]g` matches `bag`, `beg`, and similar strings, but not `bog`,
* `[.]com` matches `.com`.
When using the patterns above, you only look for a single occurrence of the pattern. Sometimes you may want a pattern like _a word of 2-4 letters_ or _any number of digits in a row_. To create these, you add repetition patterns to your regular expression:
* `?` means that the preceding patterns is matched _at most once_, i.e. 0 or 1 time,
* `*` means that the preceding pattern is matched _0 or more_ times,
* `+` means that the preceding pattern is matched _at least once_, i.e. 1 time or more,
* `{n}` means that the preceding pattern is matched _exactly_ `n` times,
* `{n,}` means that the preceding pattern is matched _at least_ `n` times, i.e. `n` times or more,
* `{n,m}` means that the preceding pattern is matched _at least_ `n` times _but not more than_ `m` times.
Here are some examples of how repetition patterns can be used:
```{r eval=FALSE}
# There are multiple ways of finding strings containing two n's
# in a row:
contacts$Address[grep("nn", contacts$Address)]
contacts$Address[grep("n{2}", contacts$Address)]
# Find strings with words beginning with an uppercase letter, followed
# by at least one lowercase letter:
contacts$Address[grep("[[:upper:]][[:lower:]]+", contacts$Address)]
# Find strings with words beginning with an uppercase letter, followed
# by at least six lowercase letters:
contacts$Address[grep("[[:upper:]][[:lower:]]{6,}", contacts$Address)]
# Find strings containing any number of letters, followed by any
# number of digits, followed by a space:
contacts$Address[grep("[[:alpha:]]+[[:digit:]]+[[:space:]]",
contacts$Address)]
```
Finally, there are some other operators that you can use to create even more complex patterns:
* `|` alteration, picks one of multiple possible patterns. For example, `ab|bc` matches `ab` or `bc`.
* `()` parentheses are used to denote a subset of an expression that should be evaluated separately. For example, `colo|our` matches `colo` or `our` while `col(o|ou)r` matches `color` or `colour`.
* `^`, when used outside of brackets `[]`, means that the match should be found at the start of the string. For example, `^a` matches strings beginning with `a`, but not `"dad"`.
* `$` means that the match should be found at the end of the string. For example, `a$` matches strings ending with `a`, but not `"dad"`.
* `\\` escape character that can be used to match special characters like `.`, `^` and `$` (`\\.`, `\\^`, `\\$`).
This may seem like a lot (and it is!), but there are in fact many more possibilities when working with regular expression. For the sake of some sorts of brevity, we'll leave it at this for now though.
Let's return to those email addresses. We saw three regular expressions that could be used to find them:
```{r eval=FALSE}
grep(".+@.+[.].+", contacts$Address)
grep(".+@.+\\..+", contacts$Address)
grep("[[:graph:]]+@[[:graph:]]+[.][[:alpha:]]+", contacts$Address)
grep("[[:alnum:]._-]+@[[:alnum:]._-]+[.][[:alpha:]]+",
contacts$Address)
```
The first two of these both specify the same pattern: _any number of any characters, followed by an `@`, followed by any number of any characters, followed by a period `.`, followed by any number of characters_. This will match email addresses, but would also match strings like `"?=)(/x@!.a??"`, which isn't a valid email address. In this case, that's not a big issue, as our goal was to find addresses that looked like email addresses, and not to verify that the addresses were valid.
The third alternative has a slightly different pattern: _any number of letters, digits, and punctuation characters, followed by an `@`, followed by any number of letters, digits, and punctuation characters, followed by a period `.`, followed by any number of letters_. This too would match `"?=)(/x@!.a??"` as it allows punctuation characters that don't usually occur in email addresses. The fourth alternative, however, won't match `"?=)(/x@!.a??"` as it only allows letters, digits and the symbols `.`, `_` and `-` in the name and domain name of the address.
### Substitution {#sub}
An important use of regular expressions is in substitutions, where the parts of strings that match the pattern in the expression are replaced by another string. There are two email addresses in our data that contain `(a)` instead of `@`:
```{r eval=FALSE}
contacts$Address[grep("[(]a[])]", contacts$Address)]
```
If we wish to replace the `(a)` by `@`, we can do so using `sub` and `gsub`\index{\texttt{sub}}\index{\texttt{gsub}}. The former replaces only the _first_ occurrence of the pattern in the input vector, whereas the latter replaces _all_ occurrences.
```{r eval=FALSE}
contacts$Address[grep("[(]a[])]", contacts$Address)]
sub("[(]a[])]", "@", contacts$Address) # Replace first occurrence
gsub("[(]a[])]", "@", contacts$Address) # Replace all occurrences
```
### Splitting strings
At times you want to extract only a part of a string, for example if measurements recorded in a column contains units, e.g. `66.8 kg` instead of `66.8`. To split a string into different parts, we can use `strsplit`\index{\texttt{strsplit}}.
As an example, consider the email addresses in our `contacts` data. Suppose that we want to extract the user names from all email addresses, i.e. remove the `@domain.topdomain` part. First, we store all email addresses from the data in a new vector, and then we split them at the `@` sign:
```{r eval=FALSE}
emails <- contacts$Address[grepl(
"[[:alnum:]._-]+@[[:alnum:]._-]+[.][[:alpha:]]+",
contacts$Address)]
emails_split <- strsplit(emails, "@")
emails_split
```
`emails_split` is a _list_. In this case, it seems convenient to convert the split strings into a matrix using `unlist` and `matrix`\index{\texttt{unlist}}\index{\texttt{matrix}} (you may want to have a quick look at Exercise \@ref(exr:ch3exc1b) to re-familiarise yourself with `matrix`):
```{r eval=FALSE}
emails_split <- unlist(emails_split)
# Store in a matrix with length(emails_split)/2 rows and 2 columns:
emails_matrix <- matrix(emails_split,
nrow = length(emails_split)/2,
ncol = 2,
byrow = TRUE)
# Extract usernames:
emails_matrix[,1]
```
Similarly, when working with data stored in data frames, it is sometimes desirable to split a column containing strings into two columns. Some convenience functions for this are discussed in Section \@ref(splittingcolumns).
### Variable names
Variable names can be very messy, particularly when they are imported from files. You can access and manipulate the variable names of a data frame using `names`\index{\texttt{names}}\index{data frame!change variable names}\index{variable!name}:
```{r eval=FALSE}
names(contacts)
names(contacts)[1] <- "ID number"
grep("[aA]", names(contacts))
```
$$\sim$$
```{exercise, label="ch5exc6"}
[Download the file `handkerchief.csv`\index{data!\texttt{handkerchiefs.csv}} from the book's web page](http://www.modernstatisticswithr.com/data.zip). It contains a short list of prices of Italian handkerchiefs from the [1769 publication](https://books.google.se/books?id=rUxiAAAAcAAJ) _Prices in those branches of the weaving manufactory, called, the black branch, and, the fancy branch_. Load the data in a data frame in R and then do the following:
1. Read the documentation for the function `nchar`\index{\texttt{nchar}}. What does it do? Apply it to the `Italian.handkerchief` column of your data frame.
2. Use `grep` to find out how many rows of the `Italian.handkerchief` column that contain numbers.
3. Find a way to extract the prices in shillings (S) and pence (D) from the `Price` column, storing these in two new `numeric` variables in your data frame.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch5solutions6)
<br>
```{exercise, label="ch5exc65"}
[Download the `oslo-biomarkers.xlsx`\index{data!\texttt{oslo-biomarkers.xlsx}} data from the book's web page](http://www.modernstatisticswithr.com/data.zip). It contains data from a medical study about patients with disc herniations, performed at the Oslo University Hospital, Ullevål (this is a modified^[For patient confidentiality purposes.] version of the data analysed by Moen et al. (2016)). Blood samples were collected from a number of patients with disc herniations at three time points: 0 weeks (first visit at the hospital), 6 weeks and 12 months. The levels of some biomarkers related to inflammation were measured in each blood sample. The first column in the spreadsheet contains information about the patient ID and the time point of sampling. Load the data and check its structure. Each patient is uniquely identified by their ID number. How many patients were included in the study?
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch5solutions65)
<br>
```{exercise, label="ch5exc7"}
What patterns do the following regular expressions describe? Apply them to the `Address` vector of the `contacts` data to check that you interpreted them correctly.
1. `"$g"`
2. `"^[^[[:digit:]]"`
3. `"a(s|l)"`
4. `"[[:lower:]]+[.][[:lower:]]+"`
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch5solutions7)
<br>
```{exercise, label="ch5exc7b"}
Write code that, given a string, creates a vector containing all words from the string, with one word in each element and no punctuation marks. Apply it to the following string to check that it works:
```{r eval=FALSE}
x <- "This is an example of a sentence, with 10 words. Here are 4 more!"
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch5solutions7b)
## Working with dates and times {#datetime}
Data describing dates and times can be complex, not least because they can be written is so many different formats\index{date format}. 1 April 2020 can be written as `2020-04-01`, `20/04/01`, `200401`, `1/4 2020`, `4/1/20`, `1 Apr 20`, and a myriad of other ways. 5 past 6 in the evening can be written as `18:05` or `6.05 pm`. In addition to this ambiguity, time zones, daylight saving time, leap years and even leap seconds make working with dates and times even more complicated.
The default in R is to use the ISO8601 standards, meaning that dates are written as YYYY-MM-DD and that times are written using the 24-hour hh:mm:ss format. In order to avoid confusion, you should always use these, unless you have _very_ strong reasons not to.
Dates in R are represented as `Date`\index{\texttt{Date}} objects, and dates with times as `POSIXct`\index{\texttt{POSIXct}} objects. The examples below are concerned with `Date` objects, but you will explore `POSIXct` too, in Exercise \@ref(exr:ch5exc8).
### Date formats
The `as.Date`\index{\texttt{as.Date}} function tries to coerce a `character` string to a date. For some formats, it will automatically succeed, whereas for others, you have to provide the format of the date manually. To complicate things further, what formats work automatically will depend on your system settings. Consequently, the safest option is always to specify the format of your dates, to make sure that the code still will run if you at some point have to execute it on a different machine. To help describe date formats, R has a number of tokens to describe days, months and years:
* `%d` - day of the month as a number (01-31).
* `%m` - month of the year as a number (01-12).
* `%y` - year without century (00-99).
* `%Y` - year with century (e.g. 2020).
Here are some examples of date formats, all describing 1 April 2020 - try them both with and without specifying the format to see what happens:
```{r eval=FALSE}
as.Date("2020-04-01")
as.Date("2020-04-01", format = "%Y-%m-%d")
as.Date("4/1/20")
as.Date("4/1/20", format = "%m/%d/%y")
# Sometimes dates are expressed as the number of days since a
# certain date. For instance, 1 April 2020 is 43,920 days after
# 1 January 1900:
as.Date(43920, origin = "1900-01-01")
```
If the date includes month or weekday names, you can use tokens to describe that as well:
* `%b` - abbreviated month name, e.g. `Jan`, `Feb`.
* `%B` - full month name, e.g. `January`, `February`.
* `%a` - abbreviated weekday, e.g. `Mon`, `Tue`.
* `%A` - full weekday, e.g. `Monday`, `Tuesday`.
Things become a little more complicated now though, because R will interpret the names as if they were written in the language set in your _locale_,\index{\texttt{Sys.getlocale}}\index{locale} which contains a number of settings related your language and region. To find out what language is in your locale, you can use:
```{r eval=FALSE}
Sys.getlocale("LC_TIME")
```
I'm writing this on a machine with Swedish locale settings (my output from the above code chunk is `"sv_SE.UTF-8"`). The Swedish word for _Wednesday_ is _onsdag_^[The Swedish _onsdag_ and English _Wednesday_ both derive from the proto-Germanic _Wodensdag_, Odin's day, in honour of the old Germanic god of that name.], and therefore the following code doesn't work on my machine:
```{r eval=FALSE}
as.Date("Wednesday 1 April 2020", format = "%A %d %B %Y")
```
However, if I translate it to Swedish, it runs just fine:
```{r eval=FALSE}
as.Date("Onsdag 1 april 2020", format = "%A %d %B %Y")
```
You may at times need to make similar translations of dates. One option is to use `gsub` to translate the names of months and weekdays into the correct language (see Section \@ref(sub)). Alternatively, you can change the locale settings. On most systems, the following setting\index{\texttt{Sys.setlocale}} will allow you to read English months and days properly:
```{r eval=FALSE}
Sys.setlocale("LC_TIME", "C")
```
The locale settings will revert to the defaults the next time you start R.
Conversely, you may want to extract a substring from a `Date` object, for instance the day of the month. This can be done using `strftime`\index{\texttt{strftime}}, using the same tokens as above. Here are some examples, including one with the token `%j`, which can be used to extract the day of the year:
```{r eval=FALSE}
dates <- as.Date(c("2020-04-01", "2021-01-29", "2021-02-22"),
format = "%Y-%m-%d")
# Extract the day of the month:
strftime(dates, format = "%d")
# Extract the month:
strftime(dates, format = "%m")
# Extract the year:
strftime(dates, format = "%Y")
# Extract the day of the year:
strftime(dates, format = "%j")
```
Should you need to, you can of course convert these objects from `character` to `numeric` using `as.numeric`.
For a complete list of tokens that can be used to describe date patterns, see `?strftime`.
$$\sim$$
```{exercise, label="ch5exc75"}
Consider the following `Date` vector:
```{r eval=FALSE}
dates <- as.Date(c("2015-01-01", "1984-03-12", "2012-09-08"),
format = "%Y-%m-%d")
```
1. Apply the functions `weekdays`, `months` and `quarters`\index{\texttt{weekdays}}\index{\texttt{months}}\index{\texttt{quarters}} to the vector. What do they do?
2. Use the `julian`\index{\texttt{julian}} function to find out how many days passed between 1970-01-01 and the dates in `dates`.
<br>
```{exercise, label="ch5exc8"}
Consider the three `character` objects created below:
```{r eval=FALSE}
time1 <- "2020-04-01 13:20"
time2 <- "2020-04-01 14:30"
time3 <- "2020-04-03 18:58"
```
1. What happens if you convert the three variables to `Date` objects using `as.Date` without specifying the date format?
2. Convert `time1` to a `Date` object and add `1` to it. What is the result?
3. Convert `time3` and `time1` to `Date` objects and subtract them. What is the result?
4. Convert `time2` and `time1` to `Date` objects and subtract them. What is the result?
5. What happens if you convert the three variables to `POSIXct` date and time objects using `as.POSIXct` without specifying the date format?
6. Convert `time3` and `time1` to `POSIXct` objects and subtract them. What is the result?
7. Convert `time2` and `time1` to `POSIXct` objects and subtract them. What is the result?
8. Use the `difftime`\index{\texttt{difftime}} to repeat the calculation in task 6, but with the result presented in hours.
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch5solutions8)
<br>
```{exercise, label="ch5exc8b"}
In some fields, e.g. economics, data is often aggregated on a quarter-year level, as in these examples:
```{r eval=FALSE}
qvec1 <- c("2020 Q4", "2021 Q1", "2021 Q2")