-
Notifications
You must be signed in to change notification settings - Fork 4
/
tables.Rmd
2074 lines (1423 loc) · 90.7 KB
/
tables.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{todo-tables-general, echo=FALSE, eval=FALSE}
prolly use fewer distinct example data sets, especially in joins. this can be done after everything's in and organized throughout the full tutorial
then again, maybe it's not a real problem
```
# Tables {#tables}
This chapter covers navigating tabular data with R. The core syntax is (\@ref(dt-syntax)):
DT[where, select|update|do, by] # SQL verbs
DT[i, j, by] # R function arguments
It reads as:
1. Subset using `i`, then
1. Group using `by`, then
1. Do `j`
## Essential packages
This section covers packages that help me work more efficiently, to the point where I regard them as essential. They'll be used throughout the rest of this document. They don't have any dependencies on other packages, and I expect they'll be available as long as R is. It will be assumed in subsequent sections that these libraries have been attached:
```{r loadem, warning=FALSE}
library(magrittr)
library(data.table)
```
### The magrittr package {#magrittr}
[Magrittr](https://cran.r-project.org/package=magrittr) introduces syntax with "pipes," improving readability by unnesting function calls:
```{r magrittr-calls, eval=FALSE}
# we can do
x %>% f %>% g %>% h
# instead of
h(g(f(x)))
```
In addition, it allows more compact function definitions (\@ref(function-writing)):
```{r magrittr-chain}
fun = . %>% .^2 %>% sum %>% sqrt
fun(1:3)
```
Despite these advantages, magrittr can slow down code, so avoid using it anywhere where speed might be an issue.
To install, just use CRAN:
```{r inst-mgr, eval=FALSE}
install.packages('magrittr')
```
The name is a pun on [René Magritte's pipe painting](https://en.wikipedia.org/wiki/The_Treachery_of_Images). To get a handle on the package, I would start with the package [vignette](https://CRAN.R-project.org/package=magrittr/vignettes/magrittr.html).
```{block2 mispipery, type='rmd-caution'}
**Gotchas with pipes.** Typing `3/c(3,3) %>% sum`, you might expect 3/3 + 3/3 = 2, but instead get 3/(3+3) = 1/2. This is a confusing quirk of operator precedence (\@ref(order-ops)) and something to watch out for with pipes. As usual, using parentheses to clarify fixes it: `(3/c(3,3)) %>% sum`.
```
Beyond the `%>%` operator, magrittr includes many other convenience tools:
- If `x = "boo"`, then `x %<>% paste("ya!")` overwrites it as "boo ya!".
- If `L = list(a = 1, b = 2)`, then `L %$% { a + b }` acts like `L$a + L$b`.
- `"boo" %>% paste("ya!") %T>% print %>% toupper` takes a "Timeout" to print before continuing with later pipes.
It also supports function defintions, like
```{r magrittr-funs, eval=FALSE}
f = . %>% paste("ya!") %T>% print %>% toupper
# usage
f("boo")
"baz" %>% f
```
Finally, it has some helper functions, to make common chains more readable:
```{r magrittr-extras, eval=FALSE}
12 %>% divide_by(3) %>% is_greater_than(5)
```
See `help(package = "magrittr")` for a full listing. I almost never use anything beyond `%>%`.
### The data.table package {#data-table}
[Data.table](http://r-datatable.com) offers a variant of the `data.frame` class for data frames (seen in \@ref(data-frames)), optimized for fast sorted and grouped operations and enhanced with cleaner syntax. Section \@ref(dt-class) introduces this `data.table` class.
The package also bundles in a variety of other functionality:
- ITime and IDate date and time classes
- `fread` and `fwrite` for fast reading and writing of delimited files
- `dcast` and `melt` for reshaping tables
Because these disparate features are bundled together in data.table, we don't have to load more packages and worry about their complicated dependencies or namespace conflicts (\@ref(namespaces)).
#### Installation
Again, CRAN can be used:
```{r inst-dt, eval=FALSE}
install.packages('data.table')
```
#### Development version
The CRAN version is usually all you need. However, the package is under active development, with new features available only in the development version. To install it, I usually run:
```{r inst-dt-dev, eval=FALSE}
remove.packages("data.table")
install.packages("data.table", type = "source",
repos = "http://Rdatatable.github.io/data.table")
```
[The instructions on the package wiki](https://github.com/Rdatatable/data.table/wiki/Installation) offer details and other options.
```{block2 grr-rtools, type='rmd-caution'}
**Windows shenanigans.** If using Windows, you'll need to first install [Rtools](https://cran.r-project.org/bin/windows/Rtools/) (requiring administrator privileges), as explained in the last wiki link. This is also required for many other packages. After installing Rtools, it may also be necessary to tell R where it is. Instructions are in section "6.3.1 Windows" of the "R Installation and Administration" manual (see \@ref(docs)).
The manual covers it, but here's my synopsis. Navigate to R's Makeconf files:
$(R_HOME)/etc/$(R_ARCH)/Makeconf
For `R_HOME`, paste the result of `R.home() %>% normalizePath %>% cat("\n")` into Windows Explorer. For `R_ARCH` select one of the folders seen there (usually `i386` and `64x`). Edit the Makeconf file, setting the BINPREF or BINPREF64 line to something like
BINPREF ?= c:/Rtools/mingw_64/bin/
It will depend on where Rtools was installed. This might have to be done again whenever a new version of Rtools is installed.
```
#### Getting started
The official vignettes for the package are a great way to start with the package. See them on the [wiki](http://r-datatable.com/Getting-started) or with
```{r, eval=FALSE}
browseVignettes(package="data.table")
```
(Beware that as of March 2017, the website vignettes are somewhat out-of-date.) The website also includes links to other useful materials (an online course, presentation materials, blog posts). Before using the package, I started by reading some slides and the FAQ in full.
```{block2 watch-dev, type='rmd-details'}
**Keeping track of changes in R and packages.**
To follow changes in R:
> Go to https://cran.r-project.org > "What's New" > "NEWS"
To follow changes in the CRAN version of a package:
> Go to, e.g., https://cran.r-project.org/package=data.table > "NEWS"
To follow changes to an in-development version of a package, instead navigate from its CRAN page to the "URL" field. The magrittr package is not in active development, so it has no "NEWS" file and little happening on its website, but data.table has been changing consistently for years.
Usually, these changes don't matter much, but it's generally a good idea to check the "Potentially Breaking Changes" list, if there is any (e.g., [in data.table 1.9.8](https://github.com/Rdatatable/data.table/blob/master/NEWS.md#changes-in-v198--on-cran-25-nov-2016)) before upgrading.
```
## The data.table class {#dt-class}
Data.tables extend the data frame class introduced in \@ref(data-frames).
This section reviews basic operations seen in the last chapter (inspecting, slicing, extracting), before discussing how `data.table` extends `data.frame`.
### Inspecting {#dt-inspect}
Consider the `quakes` data set (with more info in `?quakes`):
```{r dt-view-pre, echo=1}
quakeDT = data.table(quakes)
quakeDT
```
By default, data.tables with over 100 rows will print in the compressed form seen above, showing just the first and last five rows. To globally change how many rows are needed, use `options(datatable.print.nrows = n)`. To override it a single time, use `print` with `nrows=`. I usually set `nrows=Inf` to print all rows:
```{r dt-print, eval=FALSE}
quakeDT %>% print(nrow = Inf)
```
To browse a table in a new window, `View` for data frames (\@ref(view)) again works:
```{r dt-view, eval=FALSE}
quakeDT %>% View
# or if using RStudio
quakeDT %>% utils::View()
```
To inspect the structure of a data.table, we can again use `str` (\@ref(str)):
```{r dt-str}
quakeDT %>% str
```
Here, we see the class of each column along with the first few values. In addition, there is an obscure `".internal.selfref"` attribute, which we can ignore except when saving or loading the table from disk (\@ref(dt-saveload)). Other attributes will sometimes also show up here, related to optimizing the performance of ordered or grouped queries on the table (\@ref(keys-indices)).
In the code above, we could use syntax like `View(quakeDT)` instead of `quakeDT %>% View`, but I often find the latter handier, since it's easy to insert intermediate steps, like `quakeDT %>% head(10) %>% View`. Also, while `%>%` is slow, it's not going to matter for tasks like browsing data.
### Slicing {#dt-slicing}
A slice of a data.table is a smaller data.table formed by subsetting rows, columns, or both, analogous to vector and matrix slices seen in \@ref(slicing) and \@ref(slicing-matrix). The remaining columns will retain attributes from the full table (class, levels for factors, etc.).
The syntax is familiar from vectors and matrices:
```{r dt-slicing-pre}
# example data
DT = data.table(
x = letters[c(1, 2, 3, 4, 5)],
y = c(1, 2, 3, 4, 5),
z = c(1, 2, 3, 4, 5) > 3
)
```
```{r dt-slicing}
DT[1:2, ]
DT[, "z"]
DT[-3, c("y", "z")]
```
One difference with matrices is that when only slicing rows, we can skip the comma:
```{r dt-slicing-rows}
DT[1:2]
```
Other differences come into play when subsetting columns programmatically, where we need either a `..` prefix...
```{r dt-slicing-cols-vec, error=TRUE}
keep_cols = c("y", "z")
DT[, keep_cols] # error
DT[, ..keep_cols] # use this instead
```
... or we need `with=FALSE` if programming inline:
```{r dt-slicing-cols-program}
DT[, letters[25:26]] # no error, but doesn't print what we want
DT[, letters[25:26], with=FALSE] # use this instead
```
These requirements are a quirk of the more flexible `DT[...]` syntax that supports far more than taking slices, as discussed in \@ref(dt-syntax).
```{block2 dt-colnumbers, type='rmd-caution'}
**Use column names, not numbers.** Subsetting by hard-coded column number, like `DT[, 2:3]` or `cols = 2:3; DT[, ..cols]` works but is discouraged. Column numbers can easily change in the course of writing or updating a script, invalidating column number references in a way that will be annoying to debug. See the first answer in `vignette("datatable-faq")` for a deeper discussion.
```
```{block2 dt-dfbrackafrak, type='rmd-details'}
**How data frame slicing works.** This is really getting in the weeds, since I suggest not using data frames at all, but you'll see two major differences if you do. First, `DF[, "z"]` will extract the `z` column instead of taking a slice thanks to data frames' `drop=TRUE` default (which also came up regarding matrices in the last chapter). Second, `DF[1:2]` will slice the first two *columns* instead of the first two rows, thanks to the fact that no-comma usage of `[` on a data.frame triggers list slicing.
```
`head`, `tail` and the special empty and missing-data slices (all seen in the last chapter) work by row:
```{r dt-slice-headtailspecial}
head(DT, -3) # top, removing last 3
tail(DT, 2) # bottom 2
DT[0L] # empty
DT[NA_integer_] # missing
```
The package also offers `first` and `last`, which are simply special cases of `head` and `tail`.
Fancier slicing methods, like `DT[x > "b", .(y, z)]`, will be introduced with the rest of the `DT[...]` syntax in \@ref(dt-syntax).
There are some syntactical shortcuts for slicing to a set of columns in `j`; see \@ref(program-cols).
### Extracting columns
Since data.tables (and data frames) are lists, we can extract columns like `DT$z` or `DT[["z"]]`.
With the full `DT[...]` syntax (\@ref(dt-syntax)), it is easy to extract a column for a limited set of rows, like `DT[x > "b", z]`.
### Extensions to the data.frame class {#dt-extends}
The first thing to note is that data.tables *are* data frames, so any function that works on a data frame will work fine with data.tables, too (even if the data.table package is not installed or attached).
#### Modification in-place
In contrast with the rest of R, data.tables are primarily modified "in-place" (or "by reference"), which can be much more efficient. As a first example, we can switch the class of a data frame to data.table:
```{r setDT-pre}
# example data
DF = data.frame(
x = letters[c(1, 2, 3, 4, 5)],
y = c(1, 2, 3, 4, 5),
z = c(1, 2, 3, 4, 5) > 3
)
```
```{r setDT}
str(DF)
setDT(DF)
str(DF)
```
We did not use `=` or `<-` to assign the result of `setDT`; the function simply altered `DF` in-place. All of data.table's functions named like `set*` do this.
```{block2 dt-byref, type='rmd-details'}
**R's copy-on-modify.** Understanding the contrast between modification in-place and base R's modification rules (often called "copy-on-modify") probably requires some familiarity with C (see threads on [stackoverflow](http://stackoverflow.com/questions/15759117/what-exactly-is-copy-on-modify-semantics-in-r-and-where-is-the-canonical-source) or [the mailing list](http://r.789695.n4.nabble.com/Confused-about-NAMED-td4103326.html) if interested). I wouldn't worry about it except to note that many base R operations make copies, which is costly in terms of RAM and computing time; while those operations' data.table counterparts do not have this problem.
```
#### Factor vs character columns {#dt-factorvchar}
Another difference is that `data.frame(...)` reads string input as a `factor` categorical variable (\@ref(factors)), as seen in the `str` output above; while `data.table(...)` reads it as character:
```{r DT}
DT = data.table(
x = letters[c(1, 2, 3, 4, 5)],
y = c(1, 2, 3, 4, 5),
z = c(1, 2, 3, 4, 5) > 3
)
str(DT)
```
Data frames' preference for factors may still bite when reading files in with `read.table` from base R instead of `fread` from data.table (\@ref(fread)).
#### Extended syntax for `DT[...]`\ {#dt-syntax}
The third major difference is the extension of the `DT[...]` syntax to support more than simple slices (like `DF[i,j]`, where `i` and `j` as indices, covered in \@ref(dt-slicing)). It offers SQL-style syntax that is cleaner, particularly for by-group operations:
# (pseudocode)
DT[where, select|update|do, by] # SQL verbs
DT[i, j, by] # R function arguments
This should be read as a command to take a sequence of steps:
1. Subset using `i`
1. Group using `by`
1. Do `j`
We can use column names as barewords, and even form expressions in terms of columns:
```{r DT-ex}
DT[x > "b", sum(y), by=z]
```
Typically, the `i` and `j` arguments are called by position, as seen here; while `by=` is called by name.
Whenever `j` evaluates to a list, the output will be a new data.table:
```{r DT-exdot}
DT[x > "b", .(n_bigy = sum(y > 2), n_smally = sum(y <= 2)), by=z]
```
```{block2 dt-dot, type='rmd-caution'}
**The `.()` convenience function.** Inside many arguments of `DT[...]`, we can use the shorthand `.()`, which stands for `list()`.
```
The use of `by=` is very similar to a `for` loop (\@ref(for-loops)) over subsets, but it is better in a few important ways:
- We don't have to manually construct and keep track of some "split-up data" list to iterate over.
- Fast by-group functions are used when available (see `?GForce`)
- We don't have to define the task in `j` as a function of prespecified variables -- we can just use any columns of the data.table.
- We don't have to worry about intermediate variables (like `case` in the next example) contaminating the global environment.
There are some syntactical shortcuts for writing a list of columns in `by=`; see \@ref(program-cols).
#### An example {#dt-syntax-ex}
The task in `j` can really be anything. Just as a demonstration of its power, here's how it can be used for saving per-group plots to a PDF:
```{r dt-bydemo, cache=TRUE}
# note: this code will save to your current working directory
# type getwd() and read ?setwd for details
bwDT = data.table(MASS::birthwt)
pdf(file="birthweight_graphs.pdf")
bwDT[, {
case = sprintf("high #visits? = % 5s, smoking? = % 5s", .BY$high_vis, .BY$smoke)
cat("Handling case:", case, "...\n")
plot(
age ~ lwt,
main = case
)
}, by=.(high_vis = ftv >= 1, smoke = as.logical(smoke))]
dev.off()
```
```{r dt-bydemo-cleanup, echo=FALSE, results=FALSE}
for (fn in intersect("birthweight_graphs.pdf", list.files()))
file.remove(fn)
```
The example above should make intuitive sense after reading the docs for each object, `?MASS::birthwt`, `?pdf`, et al. For each group, we're formatting a string with `sprintf`; printing it with `cat`; and saving a plot. The special symbol `.BY` is a list containing the per-group values of the `by=` variables. Formatting and printing strings will be covered more in \@ref(strings). Plotting will be covered more under "Exploring data" (\@ref(exploring-data)) and exporting graphs in \@ref(pub-graphs).
```{block2 dt-logicolsub, type='rmd-caution'}
**Filtering on logical columns.** `DT[z]` and `DT[!z]` will give errors, for design reasons related to the syntax for joins (\@ref(dt-joins)). To get around this, always wrap the column in parentheses: `DT[(z)]` and `DT[!(z)]`.
```
```{block2 dt-byfloats, type='rmd-caution'}
**Grouping on floats.** Much of the computational and syntactical magic of the package comes from grouping rows together with the `by=` argument. To group on a floating-point variable, however, is just asking for trouble, [for the usual numerical computing reasons](http://floating-point-gui.de/). Instead, always use characters, integers or factors. To discretize a continuous variable into bins, use `cut`.
```
```{block2 dt-verbose, type='rmd-details'}
**Verbose data.table messages.** To learn how data.table queries work, I recommend toggling the setting `options(datatable.verbose = TRUE)`. This option is similar to verbose output from an optimization call (reporting the value of the objective at each iteration, etc.). To only see verbose output for a single call, add `verbose = TRUE`. For example,`DT[x > "b", sum(y), by=z, verbose=TRUE]`.
```
## Aggregation {#dt-agg}
The mtcars data set (see `?mtcars`) has two categorical variables:
- `am` for automatic (0) or manual (1) transmission; and
- `vs` for v (0) or straight (1) engine shape.
```{r agg-pre}
# example data
carsDT = data.table(mtcars, keep.rownames = TRUE)
# quick inspection
first(carsDT)
```
Suppose we want to compare the mean horsepower, `hp`, across these categories:
```{r agg}
carsDT[, .(mean_hp = mean(hp)), by=.(am, vs)]
```
So, we just write `j` of `DT[i,j,by]` as an expression to compute the summary statistic, optionally giving it a name by wrapping in `.(name = expression)`.
Sections \@ref(explore-onevar) and \@ref(explore-vars) cover more options for exploring data with summary statistics; and \@ref(dcast-browse) shows how to put this result in wide format (with, e.g., `am` on rows and `vs` on columns).
### Iterating over columns {#dt-lapply}
Now suppose we want to compare mean horsepower, weight and displacement:
```{r agg-sd, eval=-(1:2)}
carsDT[, .(mean_hp = mean(hp), mean_wt = mean(wt), mean_disp = mean(disp)), by=.(am, vs)]
# can be simplified to...
carsDT[, lapply(.SD, mean), by=.(am, vs), .SDcols = c("hp", "wt", "disp")]
```
So, we just write the relevant columns in `.SDcols` and refer to `.SD`, the ***S**ubset of **D**ata*. Within each `by=` group, the query has access to `.SD` with rows for that group and columns as specified in `.SDcols`. `.SD`, like `.BY` seen earlier, is a special symbol available in some arguments of `DT[...]`, documented at `?.SD`.
We can use `lapply` (a function designed for iterating over lists) here since `.SD` is a data.table, which is a list of column vectors (see \@ref(lapply-df)). The column names carry over to the result because `lapply` always carries over names.
While dot notation `.SDcols=.(hp, wt, disp)` is not yet supported, there are a variety of convenience features for specifying `.SDcols`, covered in \@ref(program-cols).
```{block2 dt-avoidbyrow, type='rmd-caution'}
**"Aggregating" across columns.** One major red flag to look out for is the desire to "aggregate" columns by row, setting `by=1:nrow(DT)` and possibly using `unlist(.SD)` somewhere. Not only will this be incredibly slow, but it also suggests that the data is poorly organized, costing a lot of extra mental energy at every step. Section \@ref(structuring-data) explains some ways to format data better to avoid the need for this problematic approach.
```
### Exercises
1. Within each `.(am, vs)` group, return the last row (or equivalently, the last element of each column). There are many ways to do this.
## Modifying data {#dt-subassign}
Creating, editing and removing columns are all done using `:=` in `j` of `DT[i, j, by]`. This functionality operates in-place, in the sense that the underlying data stays in the same place, which is more efficient in terms of how much RAM and time is taken. See `vignette("datatable-reference-semantics")` for details.
```{block2, type='rmd-details'}
**What can be modified in-place?** The scope of in-place operations is currently limited to altering columns. Adding and removing rows in-place is not yet supported; it is harder to do in R, due to its column-oriented storage of tables (contrasting with database systems that store data rowwise for easy INSERT and DELETE queries).
```
### Creating columns {#dt-col-create}
```{r dt-add-pre, echo=1:2}
# example data
DT = data.table(
x = letters[c(1, 2, 3, 4, 5)],
y = c(1, 2, 3, 4, 5),
z = c(1, 2, 3, 4, 5) > 3
)
DT
```
```{r dt-add}
# creating a column
DT[, u := 5:1][]
# creating multiple
DT[, `:=`(v = 2, w = 3L)][]
# creating with dynamic names
nms = c("a", "b", "c")
DT[, (nms) := .(1, 2, 3)][]
```
All of these tasks are performed in-place -- altering `DT` without making a new object. Usually, the results are not printed in the console; but they appear here because `[]` is "chained" onto the end of each task.
```{block2 modify-op, type='rmd-caution'}
**`:=` is the function for *creation, modification and deletion* of columns.** This will be covered it more detail in subsequent sections, but is worth emphasising how parsimonious this syntax is. In particular, this contrasts with Stata (which uses distinct verbs `gen`, `replace` and `drop`).
```
```{block2 modify-iter, type='rmd-details'}
**Iterative column creation.** The `` `:=`(...)`` syntax does *not* support iterative definitions like ``DT[, `:=`(W1 = u + y, W2 = W1^2)]``. One common workaround is ``DT[, `:=`(W1 = W1 <- u + y, W2 = W1^2)]``. This solution may not be intuitive for new R users, but the gist is: `W1 <- u + y` creates `W1` as an object in `DT[...]` and then returns its value, let's call it `v`. Now, `W2 = W1^2` can find `W1`, since it was created by `<-`; and `W1 = W1 <- u + y` simplifies to `W1 <- v`, where `v` is the return value of `W1 <- u + y`.
```
### Removing columns
```{r dt-remove}
nms = c("u", "v", "w", "a", "b", "c")
DT[, (nms) := NULL][]
```
A warning will print if we remove some columns that don't currently exist.
### Replacing entire columns {#dt-replace-all}
Data.table is careful about column types:
```{r dt-reptot, warning=TRUE}
DT[, a := 10L ][] # Create a new column of integers
DT[, a := 21L ][] # Replace it with another column of integers
DT[, a := 32.5 ][] # Replace it with a float
```
The warning in the last call is related to coercion of vector classes (\@ref(classes)).
```{block2 modify-typesafe, type='rmd-details'}
**Safeguards against accidental coercion.** The verbose warning printed above is typical for the package and an excellent feature. Running the final command, data.table knows that the `a` column is of integer type and sees that 32.5 is conspiciously *not an integer*. So it gives a warning when coercing 32.5 to an integer (to match `a`).
Elsewhere in R, `a` would be coerced to match the float `32.5` -- probably not the behavior we want -- with no warning. That is, if we have `x <- c(21L, 22L)`, we can freely assign `x[1] <- 32.5`; and the same freedom (by which I mean "danger") is present even if `x` is a data frame column. For more on the issue, search online for "type safety."
```
If we want this assignment to work, we need to change `a`'s type by passing a full vector, as described in the warning:
```{r dt-reptot-class, warning=TRUE}
DT[, a := rep(32.5, .N) ][]
```
The coercion is done silently, but it can be made more visible by turning on `verbose`, which notes the "plonk" of a full vector replacement. `.N` is a special symbol for the number of rows (or the number of rows in a subset when `i` or `by=` is present).
### Replacing columns conditionally {#dt-ifelse}
Conditional replacement here is analogous to a SQL UPDATE query or a replace if command in Stata. To illustrate, we will look again at a vectorized `if`/`else` assignment, mentioned in \@ref(ifelse):
```{r dt-ifelse-1, results="hide"}
DT[ , b := "Aardvark"] # initialize to baseline value
```
```{r dt-ifelse-2}
DT[y > 1, b := "Zebra"][] # replace based on a condition
```
This can also be done with chaining:
```{r dt-ifelse-chain, results="hide"}
DT[, b := "Aardvark"][y > 1, b := "Zebra"]
```
This chaining works because `DT[...][...]` is evaluated like `(DT[...])[...]` and the return value of the piece in parentheses is `DT` -- provided `j` has a `:=` statement.
```{block2 modify-chainwarn, type='rmd-caution'}
**Broken chains.** If `j` is not a `:=` statement, the return value is not the original data.table but rather a new one. Subsequent steps in the chain will not affect the starting table. So, after `DT[y < Inf, d := 1]` and `DT[y < Inf][, d := 2]`, what does the `d` column look like in `DT`? See the Exercise section of `vignette("datatable-reference-semantics")`; and the Note section of `` ?`:=` ``.
```
As we saw in the last section, partial replacement of a column will trigger a warning if the classes don't match:
```{r dt-ifelse-conflict, warning=TRUE, results="hide"}
DT[, b := "Aardvark"][y > 1, b := 111]
```
Another nice feature, similar to Stata, is reporting of the number of rows modified. This can be seen by turning `verbose` on:
```{r dt-ifelse-count}
DT[, b := "Aardvark"][y > 1, b := "Zebra", verbose = TRUE]
```
The number reported is the number of rows assigned to (irrespective of whether their values were changed by the assignment), in contrast with Stata, which reports the number of changes.
```{block2 dt-subassign, type='rmd-details'}
**Modifying subsets.** Conditional modifications `DT[cond, v := expr]` are a special case of more general subset modifications, `DT[i, v := expr]`, where `i` can be any row-slicer covered in \@ref(dt-slicing) (e.g., a vector of row numbers) or even another data.table (in an "update join"; see \@ref(joins-update)).
```
### Other in-place modifications
The data.table package has a few other tools for modifying table attributes in-place:
- The `set` function is another way of making assignments like `:=`.
- `setDT` and `setDF`, seen earlier, alter the class.
- `setorder` will sort the table by some or all of its columns.
- `setcolorder` changes the order in which columns are displayed.
- Indices and the key (explained in \@ref(keys-indices)) can be set with `setindex` and `setkey`, respectively.
- `setnames` will alter column names. For example, in \@ref(dt-lapply) we saw...
```{r modify-namesexpost-pre}
carsDT[, lapply(.SD, mean), by=.(am, vs), .SDcols = c("hp", "wt", "disp")]
```
... and if we want to add the prefix `"mean_"` to the results, we can do
```{r modify-namesexpost}
cols = c("hp", "wt", "disp")
carsDT[, lapply(.SD, mean), by=.(am, vs), .SDcols = cols] %>%
setnames(cols, sprintf("mean_%s", cols)) %>% print
```
The trailing `%>% print` command is used because, `setnames`, like `:=` and the other `set*` operators, does not print the table on its own. The string-formatter `sprintf` will be explained in \@ref(strings).
- `setattr` is a general function for altering attributes of a vector or other object (see `?attributes`). For example, if we have a factor column (encoding categorical data), we can change its "levels":
```{r dt-setlevels-pre}
# example data
fDT = data.table(fac = factor(c("A", "B"), levels = c("A", "B", "C")))
```
```{r dt-setlevels}
levels(fDT$fac)
fDT$fac %>% setattr("levels", c("A", "B", "Q"))
levels(fDT$fac)
```
With `setattr`, we avoid the copy usually incurred when using `levels(x) <- lvls` (base R syntax), but it will also skip some checks, so it is important to be careful to assign a valid vector of levels.
### Avoiding in-place modification
To create a new data.table starting from an existing table, use
```{r dt-copy, eval=FALSE}
DT2 = copy(DT1)
```
We have to use this instead of `DT2 = DT1` since with the latter we have only created a new "pointer" to the first table. I routinely make copies like this after reading in data (\@ref(fread)) so that I can back out where I tripped over something in the process of data cleaning.
Besides `DT2 = DT1`, `names1 = names(DT1)` is also unsafe, since the `names` function does not extract the column names as they are at a given time, but rather points at the names attribute, which can change as columns are modified or rearranged.
### Using in-place modification in functions
A user-written function (\@ref(function-writing)) like
```{r modify-fun}
f <- function(DT) DT[, newcol := 1111 ][]
```
will act like `:=` and the `set*` functions, altering its input in-place. This can be useful, but requires extra caution. See the "`:=` for its side effect" section of `vignette("datatable-reference-semantics")` for discussion.
The `[]` inside the function is necessary for printing, due to restrictions of R syntax and how data.table gets around them.
### Exercises
1. With `carsDT = data.table(mtcars, keep.rownames = TRUE)`, overwrite the `am` column with a factor so 0 becomes "automatic" and 1 becomes "manual", so `head(carsDT)` looks like...
# rn mpg cyl disp hp drat wt qsec vs am gear carb
# 1: Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 manual 4 4
# 2: Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 manual 4 4
# 3: Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 manual 4 1
# 4: Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 automatic 3 1
# 5: Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 automatic 3 2
# 6: Valiant 18.1 6 225 105 2.76 3.460 20.22 1 automatic 3 1
2. In `DT = data.table(x = letters[1:5], y = 1:5, z = 1:5 > 3)`, edit `x` to equal "yolo" on rows where `z` is true, so `DT` looks like...
# x y z
# 1: a 1 FALSE
# 2: b 2 FALSE
# 3: c 3 FALSE
# 4: yolo 4 TRUE
# 5: yolo 5 TRUE
3. In `DT = data.table(x = letters[1:5], matrix(1:15,5) > 7)`, add a column `id` that contains the row number; and use `setcolorder`, `names(DT)` and `setdiff` to move `id` to be the first column.
4. In `DT = data.table(id = 1:5, x = letters[1:5], matrix(1:15,5) > 7)`, use `setnames` to rename `id` to `ID`.
## Joins {#dt-joins}
Almost any operation between related tables can be called a "join" or "merge." One notable exception is "stacking" tables atop each other, covered in \@ref(rbindlist-read).
### Equi joins
```{r dt-join-prea, echo=c(1:2,4L)}
# example data
a = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15)
a
b = data.table(id = 1:2, y = c(11L, 15L))
b
```
The idiom for a simple equi join is `x[i, on=.(...)]` or `x[i]` for short:
```{r dt-join}
a[b, on=.(id)]
```
It is called an equi join since we are only getting matches where equality holds between the `on=` columns in the two tables.
Think of `x[i]` as using index table `i` to look up rows of `x`, in the same way an "index matrix" can look up elements of a matrix (\@ref(matrix-extract)). By default, we see results for every row of `i`, even those that are unmatched.
Here are some more complicated examples:
```{r dt-join-more}
a[b, on=.(x = y)]
a[b, on=.(id, x = y)]
```
When merging on columns with different names, they must be written in `on=` like `x = y` where `x` is from the "left" table, and `y` from the "right" one. Because we are using `i` to lookup rows in `x`, the displayed column will have its name from `x` and its values from `i`.
A character vector also works in, e.g., `on=c("id", x = "y")`, making it easier to merge programmatically.
```{block2 dt-efficientlookup, type='rmd-details'}
**Binary search.** Data.table joins find matching rows using a binary search algorithm, which is usually faster than a naive "vector scan" (which tests every element for a match). See the "binary search vs vector scans" section of `vignette("datatable-keys-fast-subset")` for details.
```
### Subset lookup
For browsing dynamic data (that grows over time), it is convenient and quick to use joins:
```{r dt-lookup}
a[.(1L), on=.(id)]
# or, as a function
ra <- function(my_id) a[.(my_id), on=.(id)]
ra(2L)
```
Note that the index `i` is a list here, not a data.table. The link between the two is close, since data.tables are just lists of columns (\@ref(dt-lapply)). Any list passed in `i` will be treated the same as the analogous data.table with appropriate column names.
Subset browsing becomes even easier when keys are set so that `on=` can be skipped (\@ref(keys-indices)).
### Aggregating in a join
Looking again at the first join above, suppose we want to use `b` to find rows in `a` and then add up `a$x`. We'll do this by using `by=.EACHI`:
```{r join-byeachi}
a[b, on=.(id)]
a[b, on=.(id), sum(x), by=.EACHI]
```
It is called "each `i`" since the syntax is `x[i, ...]` and we are computing per row of the index `i`. I prefer to write the `on=` before the `j` (unlike the core syntax of `DT[i,j,by]`) since the `on=` merging columns are closely related to the `i` index.
If we tried summing `b$y` here, we would not get `11+11` for `id` 1:
```{r join-byeachi-confusing}
a[b, on=.(id)] # y shows up twice
a[b, on=.(id), .(sum(x), sum(y)), by=.EACHI] # we only get one y
```
This is because we are working by each row of `b`. For `id` 1 in `b`, `y` is a single value, `11`. So really there is no point in summing `y` or otherwise aggregating columns from `i` when using `by=.EACHI`.
```{block2 dt-join-beware, type='rmd-caution'}
**Beware `DT[i,on=,j,by=bycols]`.** Just to repeat: only `by=.EACHI` works in a join. Typing other `by=` values there will cause `i`'s columns to become unavailable. This [may eventually change](https://github.com/Rdatatable/data.table/issues/733).
```
### Updating in a join {#joins-update}
Continuing from the last example, we are computing `sum(x)` per row of `b`, so maybe we want to save the result as a column in `b`:
```{r join-byeachi-assign, echo=-4}
b[, sumx :=
a[b, on=.(id), sum(x), by=.EACHI]$V1
]
# or
b[, sumx :=
a[.SD, on=.(id), sum(x), by=.EACHI]$V1
]
b
```
```{block2 v1v2v3, type='rmd-caution'}
**Default names for `j` computations.** When the task in `j` evaluates to an unnamed list, default names `V1, V2, ...` are assigned. We can use these to extract a single computed column with `$V1`, `$V2`, etc. Since we can only extract one column at a time, we usually only create and extract a single column, auto-named `V1`.
```
On the other hand, we may want to take values from `b` and assign them to the larger table `a`:
```{r join-assign, echo=-2}
a[b, on=.(id), y := i.y ]
a
```
This sort of operation is very common when doing analysis with well-organized relational data, and will come up again in \@ref(structuring-data). It is essentially the same as a SQL UPDATE JOIN. It might also be called a "merge assign" (though I seem to be the only one using that term).
The `i.*` prefix in `i.y` indicates that we are taking the column from the `i` table in `x[i]`. We can similarly use an `x.*` prefix for columns from `x`. This helps to disambiguate if the same column names appear in both tables, and is particularly helpful with non-equi joins (\@ref(joins-nonequi)). I recommend always using `i.*` prefixes when copying columns in an update join.
```{block2 dt-updatejoin-beware, type='rmd-caution'}
**Beware multiple matches in an update join.** When there are multiple matches (\@ref(join-multimatch)), an update join will apparently only use the last one. [Unfortunately](https://github.com/Rdatatable/data.table/issues/2022), this is done silently. Try `b[a, on=.(id), x := i.x, verbose = TRUE ][]`. With `verbose` on, we see a helpful message about assignment "to 3 row subset of 2 rows."
```
### Self join to fill in missing levels
Consider the `a` data set from above:
```{r dt-join-CJ-pre, echo=-3}
# example data
a = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15)
a
```
Now we want to "complete" the data set so that every `id` has a row for every time `t`:
```{r dt-join-CJ}
a[CJ(id = id, t = t, unique=TRUE), on=.(id, t)]
```
This is called a self join because we are using the table's own columns in `i` of `x[i, on=]`. The `CJ` function is a helper that takes all combinations of vectors, alternately called the Cartesian product or a "cross join." See \@ref(combos) for details. By applying `unique=TRUE`, we treat each vector *as a set* in the mathematical sense, only considering distinct values.
Notice that missing values (`NA`) in `i` are treated in the same way as other values.
### Handling matches {#join-matches}
For this section, we'll reset the tables:
```{r dt-mult-pre, echo=-c(3,5)}
# example data
a = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15)
a
b = data.table(id = 1:2, y = c(11L, 15L))
b
```
#### Handling multiply-matched rows {#join-multimatch}
In `a[b, on=.(id)]`, we are indexing by rows of `b` and so get matches for every row of `b`. By default, we get *all* matches in `a`, but this can be tweaked:
```{r dt-join-mult}
a[b, on=.(id), mult="first"]
```
Now each row of `b` only returns the first matching row (from the top) in `a`. Similarly, we could select `mult="last"`.
#### Handling unmatched rows {#join-unmatched}
Flipping it around, if we use `a` to index `b`, we have some index rows from `a` that don't have any matches in `b`:
```{r dt-join-nomatch}
b[a, on=.(id)]
```
These unmatched rows still show up in the result, which is usually nice. However, this behavior can also be tweaked:
```{r dt-join-nomatch0}
b[a, on=.(id), nomatch=0]
```
Dropping unmatched elements of `i` is similar to filtering `_m == 3` in Stata.
```{block2 dt-merge-res, type='rmd-caution'}
**Diagnostics for merges.** In Stata, joins report on how well they went -- did everything match? how many didn't match? The analogous question for an `x[i]` join is -- for each row of `i`, how many matches did we find in `x`? To see the answer, use `b[a, on=.(id), .N, by=.EACHI]`.
```
```{block2 dt-merge-stata, type='rmd-details'}
**Comparison with Stata.** Because `x[i]` uses `i` to look up rows in `x`, we are never looking ar rows that correspond to Stata's `_m == 1` -- that belong to `x` but are not matched by `i`.
There is another way of merging, `merge(b, a)`, that allows for `all.x` and `all.y`, resembling Stata's options, but I have never found any reason to use it.
```
#### Handling imperfect matches with rolling joins {#im-rolling}
Sometimes we want unmatched rows paired with the closest match occurring earlier or later in the table:
```{r im-rolling-im-rolling}
# target x
myxDT = list(myx = c(5L, 10L, 15L, 20L))
# exact match (equi-join)
a[myxDT, on=.(x = myx), .(i.myx, x.x)]
# nearest match
a[myxDT, on=.(x = myx), roll="nearest", .(i.myx, x.x)]
# upward match within 3
a[myxDT, on=.(x = myx), roll=-3, .(i.myx, x.x)]
```
Recall from \@ref(joins-update) that the `i.*` and `x.*` prefixes refer to where columns come from in `x[i]`.
When joining on multiple columns, the roll is taken on the last column listed in `on=`.
The value of `roll=` refers to how much higher or lower the value of `x` can be and still qualify as a match. We add (up to) `roll` to the target row if necessary to find a match. So `roll = -3` means we would accept a `x` as far away as `x - 3 = myx`.
### Non-equi joins {#joins-nonequi}
It is sometimes useful to match on a range of values. To do this, we explicitly name all columns in `i` and define inequalities in `on=`. Suppose we want to see, for every time `t` in 1..5, how many individuals were seen in the preceding three days:
```{r join-interval, echo=-2}
mDT = data.table(id = 1:3, x_dn = 10L, x_up = 13L)
mDT
a[mDT, on=.(id, x >= x_dn, x <= x_up), .(id, i.x_dn, i.x_up, x.x)]
```
So we are defining a range `x_dn` to `x_up` for each `id` and finding all matches of `x` within the range.
These could alternately be called "interval joins." For more on interval joins and subsets, see `?inrange`, `?foverlaps` and the [IRanges package](http://www.bioconductor.org/packages/IRanges/) that inspired these data.table tools.
### Shortcuts and tricks
When joining on a single character or factor column, the `.()` in `i` can be skipped:
```{r join-char, eval=1:2}
DT = data.table(id = c("A", "A", "B"), u = c(0, 1, 2), v = c(1, 4, 7))
DT["A", on=.(id)]
# instead of
DT[.("A"), on=.(id)]
```
#### Setting keys and indices {#keys-indices}
If a table is always joined on the same column(s), these can be set as its "key." Setting the key of `x` sorts the table and allows for skipping `on=` during `x[i,on=]` joins. It also can have some performance benefits. See `?setkey`.
```{r join-setkey}
setkey(DT, id)
DT["A"]
```
This can be dangerous, however, since even if `i` has names, they are ignored:
```{r join-setkey-warn}
setkey(DT, u, v)
DT[.(v = 7, u = 2)]
```
No match is found here since `i=.(v = 7, u = 2)` is mapped to the key `.(u,v)` by position and not by name.
It can also be dangerous in another way. If the key consists of all columns, we might not notice when joining and finding no match:
```{r join-nomatchandyetmatch}
setkey(DT, id, u, v)
DT[.("B", 1, 1)]
```
It looks like we found one match, but in fact we found zero. To avoid this, a custom manual browsing function can be written:
```{r browsetab}
r0 <- function(..., d = DT) {
x = d[.(...), nomatch=0]
print(x = x, nrows=Inf, row.names=FALSE)
invisible(x)
}
r0("B", 1, 1)
r0("B", 2, 7)
```
Most of the performance benefits of a key can also be achieved with an "index." Unlike a key, which sorts the table, an index simply notes the order of the table with respect to some columns. A table can have many indices, but only one key:
```{r join-indices, eval=1:2}
setindex(DT, id, u)
setindex(DT, id, v)
key(DT)
indices(DT, vectors=TRUE)
str(DT)
```
The performance benefits show up in most joins and some subsetting operations, too. Turn on `verbose` data.table messages to see when it is and is not kicking in. Also see `vignette("datatable-secondary-indices-and-auto-indexing")`; and regarding whether setting a key is important (as some old tutorials might say), see the package developer's [post](https://github.com/Rdatatable/data.table/issues/1232#issuecomment-131190268).
Keys and indices are destroyed whenever any of their columns are edited in a way that won't obviously preserve order.
```{block2 dt-wrongkeys, type='rmd-details'}
**Attribute verification.** Keys and indices are simply stored as attributes and may be invalid. For example, with `DT = data.table(id = 2:1, v = 2:1); setattr(DT, "sorted", "id")`, we've told the table that it is sorted by `id` even though it isn't. Some functions may check the validity of the key, but others won't, like `DT[.(1L)]`.
```
#### Anti joins
We also have the option of selecting *unmatched* rows with `!`:
```{r dt-join-notjoin}
a[!b, on=.(id)]
```
This "not join" or "anti join" returns all rows of `a` that are not matched by rows in `b`.
### Exercises
1. In `DT = data.table(id = rep(1:2, 2:3), x = 1:5, y = FALSE)`, use an update self join with `mult=` to edit `y` to `TRUE` in the first row of each `id` group.
2. In `DT = data.table(id = 1:4, v = 10:13)`, use an anti join to set `v` to missing if `id` is not in `my_ids = 2:3`.
3. In `DT = data.table(id = rep(1:2, c(10,5)), t = cumsum(1:15))`, use a non-equi self join to add a column counting how many observations of the current `id` are in the interval $(t-20, t)$. The result should be...
# id t N
# 1: 1 1 0
# 2: 1 3 1
# 3: 1 6 2
# 4: 1 10 3
# 5: 1 15 4
# 6: 1 21 4
# 7: 1 28 3
# 8: 1 36 2
# 9: 1 45 2
# 10: 1 55 2
# 11: 2 66 0
# 12: 2 78 1
# 13: 2 91 1
# 14: 2 105 1
# 15: 2 120 1
4. The "Grouping on floats" note (\@ref(dt-extends)) says to use `cut` to bin continuous variables, but this task can also be accomplished with a rolling join. For `irisDT = data.table(iris)`, create `mDT` so that
```{r, eval = FALSE}
irisDT[, v := mDT[irisDT, on=.(Species, Petal.Width), roll = TRUE, x.lab ]]
```
creates a column the same as..
```{r, eval = FALSE}
irisDT[, vcut := cut(
Petal.Width,
breaks = c(-Inf, median(Petal.Width), Inf),
right = FALSE,
lab = c("low", "high")
), by=Species]
```
```{r, echo=FALSE, eval=FALSE}
# answer
mDT = irisDT[, .(Petal.Width = c(-Inf, median(Petal.Width)), lab = factor(c("low", "high"), levels=c("low", "high"))), by=.(Species)]
```
## Input and output {#input-output}
This section covers reading tables from disk; basic cleaning of column formats; and writing to disk.
### File paths
Don't provide paths with backslashes, like `"C:\data\input.csv"`, since `\` is special character in R. My workaround is to use forward slashes or construct the path using `file.path`.
- For a list of files and folders, use `dir` or `list.files`.