tables.Rmd

```{todo-tables-general, echo=FALSE, eval=FALSE}
prolly use fewer distinct example data sets, especially in joins. this can be done after everything's in and organized throughout the full tutorial

then again, maybe it's not a real problem
```

# Tables {#tables}

This chapter covers navigating tabular data with R. The core syntax is (\@ref(dt-syntax)):

    DT[where, select|update|do, by] # SQL verbs
    DT[i, j, by]                    # R function arguments

It reads as:

1. Subset using `i`, then
1. Group using `by`, then
1. Do `j`


## Essential packages

This section covers packages that help me work more efficiently, to the point where I regard them as essential. They'll be used throughout the rest of this document. They don't have any dependencies on other packages, and I expect they'll be available as long as R is. It will be assumed in subsequent sections that these libraries have been attached:

```{r loadem, warning=FALSE}
library(magrittr)
library(data.table)
```

### The magrittr package {#magrittr}

[Magrittr](https://cran.r-project.org/package=magrittr) introduces syntax with "pipes," improving readability by unnesting function calls:

```{r magrittr-calls, eval=FALSE}
# we can do
x %>% f %>% g %>% h
# instead of 
h(g(f(x)))
```

In addition, it allows more compact function definitions (\@ref(function-writing)):

```{r magrittr-chain}
fun = . %>% .^2 %>% sum %>% sqrt
fun(1:3)
```

Despite these advantages, magrittr can slow down code, so avoid using it anywhere where speed might be an issue.

To install, just use CRAN:

```{r inst-mgr, eval=FALSE}
install.packages('magrittr')
```

The name is a pun on [René Magritte's pipe painting](https://en.wikipedia.org/wiki/The_Treachery_of_Images). To get a handle on the package, I would start with the package [vignette](https://CRAN.R-project.org/package=magrittr/vignettes/magrittr.html).

```{block2 mispipery, type='rmd-caution'}
**Gotchas with pipes.** Typing `3/c(3,3) %>% sum`, you might expect 3/3 + 3/3 = 2, but instead get 3/(3+3) = 1/2. This is a confusing quirk of operator precedence (\@ref(order-ops)) and something to watch out for with pipes. As usual, using parentheses to clarify fixes it: `(3/c(3,3)) %>% sum`.
```

Beyond the `%>%` operator, magrittr includes many other convenience tools:

- If `x = "boo"`, then `x %<>% paste("ya!")` overwrites it as "boo ya!".
- If `L = list(a = 1, b = 2)`, then `L %$% { a + b }` acts like `L$a + L$b`.
- `"boo" %>% paste("ya!") %T>% print %>% toupper` takes a "Timeout" to print before continuing with later pipes.

It also supports function defintions, like

```{r magrittr-funs, eval=FALSE}
f = . %>% paste("ya!") %T>% print %>% toupper
# usage
f("boo")
"baz" %>% f
```

Finally, it has some helper functions, to make common chains more readable:

```{r magrittr-extras, eval=FALSE}
12 %>% divide_by(3) %>% is_greater_than(5)
```

See `help(package = "magrittr")` for a full listing. I almost never use anything beyond `%>%`.

### The data.table package {#data-table}

[Data.table](http://r-datatable.com) offers a variant of the `data.frame` class for data frames (seen in \@ref(data-frames)), optimized for fast sorted and grouped operations and enhanced with cleaner syntax. Section \@ref(dt-class) introduces this `data.table` class. 

The package also bundles in a variety of other functionality:

- ITime and IDate date and time classes
- `fread` and `fwrite` for fast reading and writing of delimited files
- `dcast` and `melt` for reshaping tables

Because these disparate features are bundled together in data.table, we don't have to load more packages and worry about their complicated dependencies or namespace conflicts (\@ref(namespaces)).

#### Installation

Again, CRAN can be used:

```{r inst-dt, eval=FALSE}
install.packages('data.table')
```

#### Development version

The CRAN version is usually all you need. However, the package is under active development, with new features available only in the development version. To install it, I usually run: 

```{r inst-dt-dev, eval=FALSE}
remove.packages("data.table")
install.packages("data.table", type = "source",
  repos = "http://Rdatatable.github.io/data.table")
```

[The instructions on the package wiki](https://github.com/Rdatatable/data.table/wiki/Installation) offer details and other options.

```{block2 grr-rtools, type='rmd-caution'}
**Windows shenanigans.** If using Windows, you'll need to first install [Rtools](https://cran.r-project.org/bin/windows/Rtools/) (requiring administrator privileges), as explained in the last wiki link. This is also required for many other packages. After installing Rtools, it may also be necessary to tell R where it is. Instructions are in section "6.3.1 Windows" of the "R Installation and Administration" manual (see \@ref(docs)).

The manual covers it, but here's my synopsis. Navigate to R's Makeconf files: 
      
    $(R_HOME)/etc/$(R_ARCH)/Makeconf

For `R_HOME`, paste the result of `R.home() %>% normalizePath %>% cat("\n")` into Windows Explorer. For `R_ARCH` select one of the folders seen there (usually `i386` and `64x`). Edit the Makeconf file, setting the BINPREF or BINPREF64 line to something like

    BINPREF ?= c:/Rtools/mingw_64/bin/

It will depend on where Rtools was installed. This might have to be done again whenever a new version of Rtools is installed.
```


#### Getting started

The official vignettes for the package are a great way to start with the package. See them on the [wiki](http://r-datatable.com/Getting-started) or with 

```{r, eval=FALSE}
browseVignettes(package="data.table")
```

(Beware that as of March 2017, the website vignettes are somewhat out-of-date.) The website also includes links to other useful materials (an online course, presentation materials, blog posts). Before using the package, I started by reading some slides and the FAQ in full.

```{block2 watch-dev, type='rmd-details'}
**Keeping track of changes in R and packages.**

To follow changes in R:

> Go to https://cran.r-project.org > "What's New" > "NEWS"

To follow changes in the CRAN version of a package:

> Go to, e.g., https://cran.r-project.org/package=data.table > "NEWS"

To follow changes to an in-development version of a package, instead navigate from its CRAN page to the "URL" field. The magrittr package is not in active development, so it has no "NEWS" file and little happening on its website, but data.table has been changing consistently for years. 

Usually, these changes don't matter much, but it's generally a good idea to check the "Potentially Breaking Changes" list, if there is any (e.g., [in data.table 1.9.8](https://github.com/Rdatatable/data.table/blob/master/NEWS.md#changes-in-v198--on-cran-25-nov-2016)) before upgrading.
```


## The data.table class {#dt-class}

Data.tables extend the data frame class introduced in \@ref(data-frames). 

This section reviews basic operations seen in the last chapter (inspecting, slicing, extracting), before discussing how `data.table` extends `data.frame`.

### Inspecting {#dt-inspect}

Consider the `quakes` data set (with more info in `?quakes`):

```{r dt-view-pre, echo=1}
quakeDT = data.table(quakes)
quakeDT
```

By default, data.tables with over 100 rows will print in the compressed form seen above, showing just the first and last five rows. To globally change how many rows are needed, use `options(datatable.print.nrows = n)`. To override it a single time, use `print` with `nrows=`. I usually set `nrows=Inf` to print all rows:

```{r dt-print, eval=FALSE}
quakeDT %>% print(nrow = Inf)
```

To browse a table in a new window, `View` for data frames (\@ref(view)) again works:

```{r dt-view, eval=FALSE}
quakeDT %>% View
# or if using RStudio
quakeDT %>% utils::View()
```

To inspect the structure of a data.table, we can again use `str` (\@ref(str)):

```{r dt-str}
quakeDT %>% str
```

Here, we see the class of each column along with the first few values. In addition, there is an obscure `".internal.selfref"` attribute, which we can ignore except when saving or loading the table from disk (\@ref(dt-saveload)). Other attributes will sometimes also show up here, related to optimizing the performance of ordered or grouped queries on the table (\@ref(keys-indices)).

In the code above, we could use syntax like `View(quakeDT)` instead of `quakeDT %>% View`, but I often find the latter handier, since it's easy to insert intermediate steps, like `quakeDT %>% head(10) %>% View`. Also, while `%>%` is slow, it's not going to matter for tasks like browsing data.

### Slicing {#dt-slicing}

A slice of a data.table is a smaller data.table formed by subsetting rows, columns, or both, analogous to vector and matrix slices seen in \@ref(slicing) and \@ref(slicing-matrix). The remaining columns will retain attributes from the full table (class, levels for factors, etc.).

The syntax is familiar from vectors and matrices:

```{r dt-slicing-pre}
# example data
DT = data.table(
  x = letters[c(1, 2, 3, 4, 5)], 
  y = c(1, 2, 3, 4, 5), 
  z = c(1, 2, 3, 4, 5) > 3
)
```
```{r dt-slicing}
DT[1:2, ]
DT[, "z"]
DT[-3, c("y", "z")]
```

One difference with matrices is that when only slicing rows, we can skip the comma:

```{r dt-slicing-rows}
DT[1:2]
```

Other differences come into play when subsetting columns programmatically, where we need either a `..` prefix...

```{r dt-slicing-cols-vec, error=TRUE}
keep_cols = c("y", "z")
DT[, keep_cols]                  # error
DT[, ..keep_cols]                # use this instead
```

... or we need `with=FALSE` if programming inline:

```{r dt-slicing-cols-program}
DT[, letters[25:26]]             # no error, but doesn't print what we want
DT[, letters[25:26], with=FALSE] # use this instead
```

These requirements are a quirk of the more flexible `DT[...]` syntax that supports far more than taking slices, as discussed in \@ref(dt-syntax).

```{block2 dt-colnumbers, type='rmd-caution'}
**Use column names, not numbers.** Subsetting by hard-coded column number, like `DT[, 2:3]` or `cols = 2:3; DT[, ..cols]` works but is discouraged. Column numbers can easily change in the course of writing or updating a script, invalidating column number references in a way that will be annoying to debug. See the first answer in `vignette("datatable-faq")` for a deeper discussion.
```

```{block2 dt-dfbrackafrak, type='rmd-details'}
**How data frame slicing works.** This is really getting in the weeds, since I suggest not using data frames at all, but you'll see two major differences if you do. First, `DF[, "z"]` will extract the `z` column instead of taking a slice thanks to data frames' `drop=TRUE` default (which also came up regarding matrices in the last chapter). Second, `DF[1:2]` will slice the first two *columns* instead of the first two rows, thanks to the fact that no-comma usage of `[` on a data.frame triggers list slicing.
```

`head`, `tail` and the special empty and missing-data slices (all seen in the last chapter) work by row:

```{r dt-slice-headtailspecial}
head(DT, -3)    # top, removing last 3
tail(DT, 2)     # bottom 2
DT[0L]          # empty 
DT[NA_integer_] # missing
```

The package also offers `first` and `last`, which are simply special cases of `head` and `tail`.

Fancier slicing methods, like `DT[x > "b", .(y, z)]`, will be introduced with the rest of the `DT[...]` syntax in \@ref(dt-syntax).

There are some syntactical shortcuts for slicing to a set of columns in `j`; see \@ref(program-cols).

### Extracting columns

Since data.tables (and data frames) are lists, we can extract columns like `DT$z` or `DT[["z"]]`. 

With the full `DT[...]` syntax (\@ref(dt-syntax)), it is easy to extract a column for a limited set of rows, like `DT[x > "b", z]`.

### Extensions to the data.frame class {#dt-extends}

The first thing to note is that data.tables *are* data frames, so any function that works on a data frame will work fine with data.tables, too (even if the data.table package is not installed or attached).

#### Modification in-place

In contrast with the rest of R, data.tables are primarily modified "in-place" (or "by reference"), which can be much more efficient. As a first example, we can switch the class of a data frame to data.table:

```{r setDT-pre}
# example data
DF = data.frame(
  x = letters[c(1, 2, 3, 4, 5)], 
  y = c(1, 2, 3, 4, 5), 
  z = c(1, 2, 3, 4, 5) > 3
)
```
```{r setDT}
str(DF)
setDT(DF)
str(DF)
```

We did not use `=` or `<-` to assign the result of `setDT`; the function simply altered `DF` in-place. All of data.table's functions named like `set*` do this.

```{block2 dt-byref, type='rmd-details'}
**R's copy-on-modify.** Understanding the contrast between modification in-place and base R's modification rules (often called "copy-on-modify") probably requires some familiarity with C (see threads on [stackoverflow](http://stackoverflow.com/questions/15759117/what-exactly-is-copy-on-modify-semantics-in-r-and-where-is-the-canonical-source) or [the mailing list](http://r.789695.n4.nabble.com/Confused-about-NAMED-td4103326.html) if interested). I wouldn't worry about it except to note that many base R operations make copies, which is costly in terms of RAM and computing time; while those operations' data.table counterparts do not have this problem.
```

#### Factor vs character columns {#dt-factorvchar}

Another difference is that `data.frame(...)` reads string input as a `factor` categorical variable (\@ref(factors)), as seen in the `str` output above; while `data.table(...)` reads it as character:

```{r DT}
DT = data.table(
  x = letters[c(1, 2, 3, 4, 5)], 
  y = c(1, 2, 3, 4, 5), 
  z = c(1, 2, 3, 4, 5) > 3
)
str(DT)
```

Data frames' preference for factors may still bite when reading files in with `read.table` from base R instead of `fread` from data.table (\@ref(fread)).

#### Extended syntax for `DT[...]`\ {#dt-syntax}

The third major difference is the extension of the `DT[...]` syntax to support more than simple slices (like `DF[i,j]`, where `i` and `j` as indices, covered in \@ref(dt-slicing)). It offers SQL-style syntax that is cleaner, particularly for by-group operations:

    # (pseudocode)
    DT[where, select|update|do, by] # SQL verbs
    DT[i, j, by]                    # R function arguments

This should be read as a command to take a sequence of steps:

1. Subset using `i`
1. Group using `by`
1. Do `j`

We can use column names as barewords, and even form expressions in terms of columns:

```{r DT-ex}
DT[x > "b", sum(y), by=z]
```

Typically, the `i` and `j` arguments are called by position, as seen here; while `by=` is called by name. 

Whenever `j` evaluates to a list, the output will be a new data.table:

```{r DT-exdot}
DT[x > "b", .(n_bigy = sum(y > 2), n_smally = sum(y <= 2)), by=z]
```

```{block2 dt-dot, type='rmd-caution'}
**The `.()` convenience function.** Inside many arguments of `DT[...]`, we can use the shorthand `.()`, which stands for `list()`.
```

The use of `by=` is very similar to a `for` loop (\@ref(for-loops)) over subsets, but it is better in a few important ways: 

- We don't have to manually construct and keep track of some "split-up data" list to iterate over.
- Fast by-group functions are used when available (see `?GForce`)
- We don't have to define the task in `j` as a function of prespecified variables -- we can just use any columns of the data.table.
- We don't have to worry about intermediate variables (like `case` in the next example) contaminating the global environment.

There are some syntactical shortcuts for writing a list of columns in `by=`; see \@ref(program-cols).

#### An example {#dt-syntax-ex}

The task in `j` can really be anything. Just as a demonstration of its power, here's how it can be used for saving per-group plots to a PDF:

```{r dt-bydemo, cache=TRUE}
# note: this code will save to your current working directory
# type getwd() and read ?setwd for details

bwDT = data.table(MASS::birthwt)

pdf(file="birthweight_graphs.pdf")
bwDT[, {
  case = sprintf("high #visits? = % 5s, smoking? = % 5s", .BY$high_vis, .BY$smoke)
  
  cat("Handling case:", case, "...\n")
  plot(
    age ~ lwt, 
    main = case
  )  
}, by=.(high_vis = ftv >= 1, smoke = as.logical(smoke))]
dev.off()
```
```{r dt-bydemo-cleanup, echo=FALSE, results=FALSE}
for (fn in intersect("birthweight_graphs.pdf", list.files()))
  file.remove(fn)
```

The example above should make intuitive sense after reading the docs for each object, `?MASS::birthwt`, `?pdf`, et al. For each group, we're formatting a string with `sprintf`; printing it with `cat`; and saving a plot. The special symbol `.BY` is a list containing the per-group values of the `by=` variables. Formatting and printing strings will be covered more in \@ref(strings). Plotting will be covered more under "Exploring data" (\@ref(exploring-data)) and exporting graphs in \@ref(pub-graphs).

```{block2 dt-logicolsub, type='rmd-caution'}
**Filtering on logical columns.** `DT[z]` and `DT[!z]` will give errors, for design reasons related to the syntax for joins (\@ref(dt-joins)). To get around this, always wrap the column in parentheses: `DT[(z)]` and `DT[!(z)]`.
```

```{block2 dt-byfloats, type='rmd-caution'}
**Grouping on floats.** Much of the computational and syntactical magic of the package comes from grouping rows together with the `by=` argument. To group on a floating-point variable, however, is just asking for trouble, [for the usual numerical computing reasons](http://floating-point-gui.de/). Instead, always use characters, integers or factors. To discretize a continuous variable into bins, use `cut`.
```

```{block2 dt-verbose, type='rmd-details'}
**Verbose data.table messages.** To learn how data.table queries work, I recommend toggling the setting `options(datatable.verbose = TRUE)`. This option is similar to verbose output from an optimization call (reporting the value of the objective at each iteration, etc.). To only see verbose output for a single call, add `verbose = TRUE`. For example,`DT[x > "b", sum(y), by=z, verbose=TRUE]`.
```


## Aggregation {#dt-agg}

The mtcars data set (see `?mtcars`) has two categorical variables: 

- `am` for automatic (0) or manual (1) transmission; and 
- `vs` for v (0) or straight (1) engine shape. 

```{r agg-pre}
# example data
carsDT = data.table(mtcars, keep.rownames = TRUE)
# quick inspection
first(carsDT)
```

Suppose we want to compare the mean horsepower, `hp`, across these categories:

```{r agg}
carsDT[, .(mean_hp = mean(hp)), by=.(am, vs)]
```

So, we just write `j` of `DT[i,j,by]` as an expression to compute the summary statistic, optionally giving it a name by wrapping in `.(name = expression)`.

Sections \@ref(explore-onevar) and \@ref(explore-vars) cover more options for exploring data with summary statistics; and \@ref(dcast-browse) shows how to put this result in wide format (with, e.g., `am` on rows and `vs` on columns).

### Iterating over columns {#dt-lapply}

Now suppose we want to compare mean horsepower, weight and displacement:

```{r agg-sd, eval=-(1:2)}
carsDT[, .(mean_hp = mean(hp), mean_wt = mean(wt), mean_disp = mean(disp)), by=.(am, vs)]
# can be simplified to...
carsDT[, lapply(.SD, mean), by=.(am, vs), .SDcols = c("hp", "wt", "disp")]
```

So, we just write the relevant columns in `.SDcols` and refer to `.SD`, the ***S**ubset of **D**ata*. Within each `by=` group, the query has access to `.SD` with rows for that group and columns as specified in `.SDcols`. `.SD`, like `.BY` seen earlier, is a special symbol available in some arguments of `DT[...]`, documented at `?.SD`.

We can use `lapply` (a function designed for iterating over lists) here since `.SD` is a data.table, which is a list of column vectors (see \@ref(lapply-df)). The column names carry over to the result because `lapply` always carries over names. 

While dot notation `.SDcols=.(hp, wt, disp)` is not yet supported, there are a variety of convenience features for specifying `.SDcols`, covered in \@ref(program-cols).

```{block2 dt-avoidbyrow, type='rmd-caution'}
**"Aggregating" across columns.** One major red flag to look out for is the desire to "aggregate" columns by row, setting `by=1:nrow(DT)` and possibly using `unlist(.SD)` somewhere. Not only will this be incredibly slow, but it also suggests that the data is poorly organized, costing a lot of extra mental energy at every step. Section \@ref(structuring-data) explains some ways to format data better to avoid the need for this problematic approach.
```

### Exercises

1. Within each `.(am, vs)` group, return the last row (or equivalently, the last element of each column). There are many ways to do this.

## Modifying data {#dt-subassign}

Creating, editing and removing columns are all done using `:=` in `j` of `DT[i, j, by]`. This functionality operates in-place, in the sense that the underlying data stays in the same place, which is more efficient in terms of how much RAM and time is taken. See `vignette("datatable-reference-semantics")` for details.

```{block2, type='rmd-details'}
**What can be modified in-place?** The scope of in-place operations is currently limited to altering columns. Adding and removing rows in-place is not yet supported; it is harder to do in R, due to its column-oriented storage of tables (contrasting with database systems that store data rowwise for easy INSERT and DELETE queries).
```

### Creating columns {#dt-col-create}

```{r dt-add-pre, echo=1:2}
# example data
DT = data.table(
  x = letters[c(1, 2, 3, 4, 5)], 
  y = c(1, 2, 3, 4, 5), 
  z = c(1, 2, 3, 4, 5) > 3
)
DT
```
```{r dt-add}
# creating a column
DT[, u := 5:1][]
# creating multiple
DT[, `:=`(v = 2, w = 3L)][]
# creating with dynamic names
nms = c("a", "b", "c")
DT[, (nms) := .(1, 2, 3)][]
```

All of these tasks are performed in-place -- altering `DT` without making a new object. Usually, the results are not printed in the console; but they appear here because `[]` is "chained" onto the end of each task.

```{block2 modify-op, type='rmd-caution'}
**`:=` is the function for *creation, modification and deletion* of columns.** This will be covered it more detail in subsequent sections, but is worth emphasising how parsimonious this syntax is. In particular, this contrasts with Stata (which uses distinct verbs `gen`, `replace` and `drop`).
```

```{block2 modify-iter, type='rmd-details'}
**Iterative column creation.** The `` `:=`(...)`` syntax does *not* support iterative definitions like ``DT[, `:=`(W1 = u + y, W2 = W1^2)]``. One common workaround is ``DT[, `:=`(W1 = W1 <- u + y, W2 = W1^2)]``. This solution may not be intuitive for new R users, but the gist is: `W1 <- u + y` creates `W1` as an object in `DT[...]` and then returns its value, let's call it `v`. Now, `W2 = W1^2` can find `W1`, since it was created by `<-`; and `W1 = W1 <- u + y` simplifies to `W1 <- v`, where `v` is the return value of `W1 <- u + y`.
```

### Removing columns

```{r dt-remove}
nms = c("u", "v", "w", "a", "b", "c")
DT[, (nms) := NULL][]
```

A warning will print if we remove some columns that don't currently exist.

### Replacing entire columns {#dt-replace-all}

Data.table is careful about column types:

```{r dt-reptot, warning=TRUE}
DT[, a := 10L ][]  # Create a new column of integers
DT[, a := 21L ][]  # Replace it with another column of integers
DT[, a := 32.5 ][] # Replace it with a float
```

The warning in the last call is related to coercion of vector classes (\@ref(classes)).

```{block2 modify-typesafe, type='rmd-details'}
**Safeguards against accidental coercion.** The verbose warning printed above is typical for the package and an excellent feature. Running the final command, data.table knows that the `a` column is of integer type and sees that 32.5 is conspiciously *not an integer*. So it gives a warning when coercing 32.5 to an integer (to match `a`). 

Elsewhere in R, `a` would be coerced to match the float `32.5` -- probably not the behavior we want -- with no warning. That is, if we have `x <- c(21L, 22L)`, we can freely assign `x[1] <- 32.5`; and the same freedom (by which I mean "danger") is present even if `x` is a data frame column. For more on the issue, search online for "type safety."
```

If we want this assignment to work, we need to change `a`'s type by passing a full vector, as described in the warning:

```{r dt-reptot-class, warning=TRUE}
DT[, a := rep(32.5, .N) ][]
```

The coercion is done silently, but it can be made more visible by turning on `verbose`, which notes the "plonk" of a full vector replacement. `.N` is a special symbol for the number of rows (or the number of rows in a subset when `i` or `by=` is present).

### Replacing columns conditionally {#dt-ifelse}

Conditional replacement here is analogous to a SQL UPDATE query or a replace if command in Stata. To illustrate, we will look again at a vectorized `if`/`else` assignment, mentioned in \@ref(ifelse):

```{r dt-ifelse-1, results="hide"}
DT[     , b := "Aardvark"] # initialize to baseline value
```
```{r dt-ifelse-2}
DT[y > 1, b := "Zebra"][] # replace based on a condition
```

This can also be done with chaining:

```{r dt-ifelse-chain, results="hide"}
DT[, b := "Aardvark"][y > 1, b := "Zebra"]
```

This chaining works because `DT[...][...]` is evaluated like `(DT[...])[...]` and the return value of the piece in parentheses is `DT` -- provided `j` has a `:=` statement.

```{block2 modify-chainwarn, type='rmd-caution'}
**Broken chains.** If `j` is not a `:=` statement, the return value is not the original data.table but rather a new one. Subsequent steps in the chain will not affect the starting table. So, after `DT[y < Inf, d := 1]` and `DT[y < Inf][, d := 2]`, what does the `d` column look like in `DT`? See the Exercise section of `vignette("datatable-reference-semantics")`; and the Note section of `` ?`:=` ``.
```

As we saw in the last section, partial replacement of a column will trigger a warning if the classes don't match:

```{r dt-ifelse-conflict, warning=TRUE, results="hide"}
DT[, b := "Aardvark"][y > 1, b := 111]
```

Another nice feature, similar to Stata, is reporting of the number of rows modified. This can be seen by turning `verbose` on:

```{r dt-ifelse-count}
DT[, b := "Aardvark"][y > 1, b := "Zebra", verbose = TRUE]
```

The number reported is the number of rows assigned to (irrespective of whether their values were changed by the assignment), in contrast with Stata, which reports the number of changes.

```{block2 dt-subassign, type='rmd-details'}
**Modifying subsets.** Conditional modifications `DT[cond, v := expr]` are a special case of more general subset modifications, `DT[i, v := expr]`, where `i` can be any row-slicer covered in \@ref(dt-slicing) (e.g., a vector of row numbers) or even another data.table (in an "update join"; see \@ref(joins-update)).
```

### Other in-place modifications

The data.table package has a few other tools for modifying table attributes in-place:

- The `set` function is another way of making assignments like `:=`.

- `setDT` and `setDF`, seen earlier, alter the class.

- `setorder` will sort the table by some or all of its columns.

- `setcolorder` changes the order in which columns are displayed.

- Indices and the key (explained in \@ref(keys-indices)) can be set with `setindex` and `setkey`, respectively.

- `setnames` will alter column names. For example, in \@ref(dt-lapply) we saw...

    ```{r modify-namesexpost-pre}
    carsDT[, lapply(.SD, mean), by=.(am, vs), .SDcols = c("hp", "wt", "disp")]
    ```

    ... and if we want to add the prefix `"mean_"` to the results, we can do

    ```{r modify-namesexpost}
    cols = c("hp", "wt", "disp")
    carsDT[, lapply(.SD, mean), by=.(am, vs), .SDcols = cols] %>% 
      setnames(cols, sprintf("mean_%s", cols)) %>% print
    ```
    
    The trailing `%>% print` command is used because, `setnames`, like `:=` and the other `set*` operators, does not print the table on its own. The string-formatter `sprintf` will be explained in \@ref(strings).


- `setattr` is a general function for altering attributes of a vector or other object (see `?attributes`). For example, if we have a factor column (encoding categorical data), we can change its "levels":

    ```{r dt-setlevels-pre}
    # example data
    fDT = data.table(fac = factor(c("A", "B"), levels = c("A", "B", "C")))
    ```
    ```{r dt-setlevels}
    levels(fDT$fac)
    fDT$fac %>% setattr("levels", c("A", "B", "Q"))
    levels(fDT$fac)
    ```

With `setattr`, we avoid the copy usually incurred when using `levels(x) <- lvls` (base R syntax), but it will also skip some checks, so it is important to be careful to assign a valid vector of levels.

### Avoiding in-place modification

To create a new data.table starting from an existing table, use

```{r dt-copy, eval=FALSE}
DT2 = copy(DT1)
```

We have to use this instead of `DT2 = DT1` since with the latter we have only created a new "pointer" to the first table. I routinely make copies like this after reading in data (\@ref(fread)) so that I can back out where I tripped over something in the process of data cleaning.

Besides `DT2 = DT1`, `names1 = names(DT1)` is also unsafe, since the `names` function does not extract the column names as they are at a given time, but rather points at the names attribute, which can change as columns are modified or rearranged.

### Using in-place modification in functions

A user-written function (\@ref(function-writing)) like

```{r modify-fun}
f <- function(DT) DT[, newcol := 1111 ][]
```

will act like `:=` and the `set*` functions, altering its input in-place. This can be useful, but requires extra caution. See the "`:=` for its side effect" section of `vignette("datatable-reference-semantics")` for discussion.

The `[]` inside the function is necessary for printing, due to restrictions of R syntax and how data.table gets around them. 

### Exercises

1. With `carsDT = data.table(mtcars, keep.rownames = TRUE)`, overwrite the `am` column with a factor so 0 becomes "automatic" and 1 becomes "manual", so `head(carsDT)` looks like...

        #                   rn  mpg cyl disp  hp drat    wt  qsec vs        am gear carb
        # 1:         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0    manual    4    4
        # 2:     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0    manual    4    4
        # 3:        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1    manual    4    1
        # 4:    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1 automatic    3    1
        # 5: Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0 automatic    3    2
        # 6:           Valiant 18.1   6  225 105 2.76 3.460 20.22  1 automatic    3    1

2. In `DT = data.table(x = letters[1:5], y = 1:5, z = 1:5 > 3)`, edit `x` to equal "yolo" on rows where `z` is true, so `DT` looks like...
  
        #       x y     z
        # 1:    a 1 FALSE
        # 2:    b 2 FALSE
        # 3:    c 3 FALSE
        # 4: yolo 4  TRUE
        # 5: yolo 5  TRUE


3. In `DT = data.table(x = letters[1:5], matrix(1:15,5) > 7)`, add a column `id` that contains the row number; and use `setcolorder`, `names(DT)` and `setdiff` to move `id` to be the first column. 

4. In `DT = data.table(id = 1:5, x = letters[1:5], matrix(1:15,5) > 7)`, use `setnames` to rename `id` to `ID`.


## Joins {#dt-joins}

Almost any operation between related tables can be called a "join" or "merge." One notable exception is "stacking" tables atop each other, covered in \@ref(rbindlist-read).

### Equi joins

```{r dt-join-prea, echo=c(1:2,4L)}
# example data
a = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15)
a
b = data.table(id = 1:2, y = c(11L, 15L))
b
```

The idiom for a simple equi join is `x[i, on=.(...)]` or `x[i]` for short: 

```{r dt-join}
a[b, on=.(id)]
```

It is called an equi join since we are only getting matches where equality holds between the `on=` columns in the two tables.

Think of `x[i]` as using index table `i` to look up rows of `x`, in the same way an "index matrix" can look up elements of a matrix (\@ref(matrix-extract)). By default, we see results for every row of `i`, even those that are unmatched. 

Here are some more complicated examples:

```{r dt-join-more}
a[b, on=.(x = y)]
a[b, on=.(id, x = y)]
```

When merging on columns with different names, they must be written in `on=` like `x = y` where `x` is from the "left" table, and `y` from the "right" one. Because we are using `i` to lookup rows in `x`, the displayed column will have its name from `x` and its values from `i`.

A character vector also works in, e.g., `on=c("id", x = "y")`, making it easier to merge programmatically.

```{block2 dt-efficientlookup, type='rmd-details'}
**Binary search.** Data.table joins find matching rows using a binary search algorithm, which is usually faster than a naive "vector scan" (which tests every element for a match). See the "binary search vs vector scans" section of `vignette("datatable-keys-fast-subset")` for details.
```

### Subset lookup

For browsing dynamic data (that grows over time), it is convenient and quick to use joins:

```{r dt-lookup}
a[.(1L), on=.(id)]
# or, as a function
ra <- function(my_id) a[.(my_id), on=.(id)] 
ra(2L)
```

Note that the index `i` is a list here, not a data.table. The link between the two is close, since data.tables are just lists of columns (\@ref(dt-lapply)). Any list passed in `i` will be treated the same as the analogous data.table with appropriate column names.

Subset browsing becomes even easier when keys are set so that `on=` can be skipped (\@ref(keys-indices)).

### Aggregating in a join

Looking again at the first join above, suppose we want to use `b` to find rows in `a` and then add up `a$x`. We'll do this by using `by=.EACHI`:

```{r join-byeachi}
a[b, on=.(id)]
a[b, on=.(id), sum(x), by=.EACHI]
```

It is called "each `i`" since the syntax is `x[i, ...]` and we are computing per row of the index `i`. I prefer to write the `on=` before the `j` (unlike the core syntax of `DT[i,j,by]`) since the `on=` merging columns are closely related to the `i` index.

If we tried summing `b$y` here, we would not get `11+11` for `id` 1:

```{r join-byeachi-confusing}
a[b, on=.(id)] # y shows up twice
a[b, on=.(id), .(sum(x), sum(y)), by=.EACHI] # we only get one y
```

This is because we are working by each row of `b`. For `id` 1 in `b`, `y` is a single value, `11`. So really there is no point in summing `y` or otherwise aggregating columns from `i` when using `by=.EACHI`.

```{block2 dt-join-beware, type='rmd-caution'}
**Beware `DT[i,on=,j,by=bycols]`.** Just to repeat: only `by=.EACHI` works in a join. Typing other `by=` values there will cause `i`'s columns to become unavailable. This [may eventually change](https://github.com/Rdatatable/data.table/issues/733).
```

### Updating in a join {#joins-update}

Continuing from the last example, we are computing `sum(x)` per row of `b`, so maybe we want to save the result as a column in `b`:

```{r join-byeachi-assign, echo=-4}
b[, sumx := 
    a[b, on=.(id), sum(x), by=.EACHI]$V1
]
# or
b[, sumx := 
    a[.SD, on=.(id), sum(x), by=.EACHI]$V1
]
b
```

```{block2 v1v2v3, type='rmd-caution'}
**Default names for `j` computations.** When the task in `j` evaluates to an unnamed list, default names `V1, V2, ...` are assigned. We can use these to extract a single computed column with `$V1`, `$V2`, etc. Since we can only extract one column at a time, we usually only create and extract a single column, auto-named `V1`.
```

On the other hand, we may want to take values from `b` and assign them to the larger table `a`:

```{r join-assign, echo=-2}
a[b, on=.(id), y := i.y ]
a
```

This sort of operation is very common when doing analysis with well-organized relational data, and will come up again in \@ref(structuring-data). It is essentially the same as a SQL UPDATE JOIN. It might also be called a "merge assign" (though I seem to be the only one using that term).

The `i.*` prefix in `i.y` indicates that we are taking the column from the `i` table in `x[i]`. We can similarly use an `x.*` prefix for columns from `x`. This helps to disambiguate if the same column names appear in both tables, and is particularly helpful with non-equi joins (\@ref(joins-nonequi)). I recommend always using `i.*` prefixes when copying columns in an update join.

```{block2 dt-updatejoin-beware, type='rmd-caution'}
**Beware multiple matches in an update join.** When there are multiple matches (\@ref(join-multimatch)), an update join will apparently only use the last one. [Unfortunately](https://github.com/Rdatatable/data.table/issues/2022), this is done silently. Try `b[a, on=.(id), x := i.x, verbose = TRUE ][]`. With `verbose` on, we see a helpful message about assignment "to 3 row subset of 2 rows."
```

### Self join to fill in missing levels

Consider the `a` data set from above:

```{r dt-join-CJ-pre, echo=-3}
# example data
a = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15)
a
```

Now we want to "complete" the data set so that every `id` has a row for every time `t`:

```{r dt-join-CJ}
a[CJ(id = id, t = t, unique=TRUE), on=.(id, t)]
```

This is called a self join because we are using the table's own columns in `i` of `x[i, on=]`. The `CJ` function is a helper that takes all combinations of vectors, alternately called the Cartesian product or a "cross join." See \@ref(combos) for details. By applying `unique=TRUE`, we treat each vector *as a set* in the mathematical sense, only considering distinct values.

Notice that missing values (`NA`) in `i` are treated in the same way as other values.

### Handling matches {#join-matches}

For this section, we'll reset the tables:

```{r dt-mult-pre, echo=-c(3,5)}
# example data
a = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15)
a
b = data.table(id = 1:2, y = c(11L, 15L))
b
```

#### Handling multiply-matched rows {#join-multimatch}

In `a[b, on=.(id)]`, we are indexing by rows of `b` and so get matches for every row of `b`. By default, we get *all* matches in `a`, but this can be tweaked:

```{r dt-join-mult}
a[b, on=.(id), mult="first"]
```

Now each row of `b` only returns the first matching row (from the top) in `a`. Similarly, we could select `mult="last"`.

#### Handling unmatched rows {#join-unmatched}

Flipping it around, if we use `a` to index `b`, we have some index rows from `a` that don't have any matches in `b`:

```{r dt-join-nomatch}
b[a, on=.(id)]
```

These unmatched rows still show up in the result, which is usually nice. However, this behavior can also be tweaked:

```{r dt-join-nomatch0}
b[a, on=.(id), nomatch=0]
```

Dropping unmatched elements of `i` is similar to filtering `_m == 3` in Stata.

```{block2 dt-merge-res, type='rmd-caution'}
**Diagnostics for merges.** In Stata, joins report on how well they went -- did everything match? how many didn't match? The analogous question for an `x[i]` join is -- for each row of `i`, how many matches did we find in `x`? To see the answer, use `b[a, on=.(id), .N, by=.EACHI]`.
```
```{block2 dt-merge-stata, type='rmd-details'}
**Comparison with Stata.** Because `x[i]` uses `i` to look up rows in `x`, we are never looking ar rows that correspond to Stata's `_m == 1` -- that belong to `x` but are not matched by `i`.

There is another way of merging, `merge(b, a)`, that allows for `all.x` and `all.y`, resembling Stata's options, but I have never found any reason to use it.
```

#### Handling imperfect matches with rolling joins {#im-rolling}

Sometimes we want unmatched rows paired with the closest match occurring earlier or later in the table:

```{r im-rolling-im-rolling}
# target x
myxDT = list(myx = c(5L, 10L, 15L, 20L))

# exact match (equi-join)
a[myxDT, on=.(x = myx), .(i.myx, x.x)]
# nearest match
a[myxDT, on=.(x = myx), roll="nearest", .(i.myx, x.x)]
# upward match within 3
a[myxDT, on=.(x = myx), roll=-3, .(i.myx, x.x)]
```

Recall from \@ref(joins-update) that the `i.*` and `x.*` prefixes refer to where columns come from in `x[i]`.

When joining on multiple columns, the roll is taken on the last column listed in `on=`.

The value of `roll=` refers to how much higher or lower the value of `x` can be and still qualify as a match. We add (up to) `roll` to the target row if necessary to find a match. So `roll = -3` means we would accept a `x` as far away as `x - 3 = myx`. 

### Non-equi joins {#joins-nonequi}

It is sometimes useful to match on a range of values. To do this, we explicitly name all columns in `i` and define inequalities in `on=`. Suppose we want to see, for every time `t` in 1..5, how many individuals were seen in the preceding three days:

```{r join-interval, echo=-2}
mDT = data.table(id = 1:3, x_dn = 10L, x_up = 13L)
mDT
a[mDT, on=.(id, x >= x_dn, x <= x_up), .(id, i.x_dn, i.x_up, x.x)]
```

So we are defining a range `x_dn` to `x_up` for each `id` and finding all matches of `x` within the range.

These could alternately be called "interval joins." For more on interval joins and subsets, see `?inrange`, `?foverlaps` and the [IRanges package](http://www.bioconductor.org/packages/IRanges/) that inspired these data.table tools.

### Shortcuts and tricks

When joining on a single character or factor column, the `.()` in `i` can be skipped:

```{r join-char, eval=1:2}
DT = data.table(id = c("A", "A", "B"), u = c(0, 1, 2), v = c(1, 4, 7))
DT["A", on=.(id)]
# instead of 
DT[.("A"), on=.(id)]
```

#### Setting keys and indices {#keys-indices}

If a table is always joined on the same column(s), these can be set as its "key." Setting the key of `x` sorts the table and allows for skipping `on=` during `x[i,on=]` joins. It also can have some performance benefits. See `?setkey`.

```{r join-setkey}
setkey(DT, id)
DT["A"]
```

This can be dangerous, however, since even if `i` has names, they are ignored:

```{r join-setkey-warn}
setkey(DT, u, v)
DT[.(v = 7, u = 2)]
```

No match is found here since `i=.(v = 7, u = 2)` is mapped to the key `.(u,v)` by position and not by name.

It can also be dangerous in another way. If the key consists of all columns, we might not notice when joining and finding no match:

```{r join-nomatchandyetmatch}
setkey(DT, id, u, v)
DT[.("B", 1, 1)]
```

It looks like we found one match, but in fact we found zero. To avoid this, a custom manual browsing function can be written:

```{r browsetab}
r0 <- function(..., d = DT) {
  x = d[.(...), nomatch=0]
  print(x = x, nrows=Inf, row.names=FALSE)
  invisible(x)
}

r0("B", 1, 1)
r0("B", 2, 7)
```

Most of the performance benefits of a key can also be achieved with an "index." Unlike a key, which sorts the table, an index simply notes the order of the table with respect to some columns. A table can have many indices, but only one key:

```{r join-indices, eval=1:2}
setindex(DT, id, u)
setindex(DT, id, v)
key(DT)
indices(DT, vectors=TRUE)
str(DT)
```

The performance benefits show up in most joins and some subsetting operations, too. Turn on `verbose` data.table messages to see when it is and is not kicking in. Also see `vignette("datatable-secondary-indices-and-auto-indexing")`; and regarding whether setting a key is important (as some old tutorials might say), see the package developer's [post](https://github.com/Rdatatable/data.table/issues/1232#issuecomment-131190268).

Keys and indices are destroyed whenever any of their columns are edited in a way that won't obviously preserve order.

```{block2 dt-wrongkeys, type='rmd-details'}
**Attribute verification.** Keys and indices are simply stored as attributes and may be invalid. For example, with `DT = data.table(id = 2:1, v = 2:1); setattr(DT, "sorted", "id")`, we've told the table that it is sorted by `id` even though it isn't. Some functions may check the validity of the key, but others won't, like `DT[.(1L)]`.
```

#### Anti joins

We also have the option of selecting *unmatched* rows with `!`:

```{r dt-join-notjoin}
a[!b, on=.(id)]
```

This "not join" or "anti join" returns all rows of `a` that are not matched by rows in `b`.

### Exercises

1. In `DT = data.table(id = rep(1:2, 2:3), x = 1:5, y = FALSE)`, use an update self join with `mult=` to edit `y` to `TRUE` in the first row of each `id` group.
2. In `DT = data.table(id = 1:4, v = 10:13)`, use an anti join to set `v` to missing if `id` is not in `my_ids = 2:3`.
3. In `DT = data.table(id = rep(1:2, c(10,5)), t = cumsum(1:15))`, use a non-equi self join to add a column counting how many observations of the current `id` are in the interval $(t-20, t)$. The result should be...

        #     id   t N
        #  1:  1   1 0
        #  2:  1   3 1
        #  3:  1   6 2
        #  4:  1  10 3
        #  5:  1  15 4
        #  6:  1  21 4
        #  7:  1  28 3
        #  8:  1  36 2
        #  9:  1  45 2
        # 10:  1  55 2
        # 11:  2  66 0
        # 12:  2  78 1
        # 13:  2  91 1
        # 14:  2 105 1
        # 15:  2 120 1

4. The "Grouping on floats" note (\@ref(dt-extends)) says to use `cut` to bin continuous variables, but this task can also be accomplished with a rolling join. For `irisDT = data.table(iris)`, create `mDT` so that 

    ```{r, eval = FALSE}
    irisDT[, v := mDT[irisDT, on=.(Species, Petal.Width), roll = TRUE, x.lab ]]
    ```
    
    creates a column the same as..

    ```{r, eval = FALSE}
    irisDT[, vcut := cut(
      Petal.Width,
      breaks = c(-Inf, median(Petal.Width), Inf),
      right = FALSE,
      lab = c("low", "high")
    ), by=Species]
    ```
    
```{r, echo=FALSE, eval=FALSE}
# answer
mDT = irisDT[, .(Petal.Width = c(-Inf, median(Petal.Width)), lab = factor(c("low", "high"), levels=c("low", "high"))), by=.(Species)]
```

## Input and output {#input-output}

This section covers reading tables from disk; basic cleaning of column formats; and writing to disk.

### File paths

Don't provide paths with backslashes, like `"C:\data\input.csv"`, since `\` is special character in R. My workaround is to use forward slashes or construct the path using `file.path`.

- For a list of files and folders, use `dir` or `list.files`. 
- For relative paths, `"."` is the current folder; `".."` navigates one level up; and `normalizePath` converts to an absolute path. 
- For the current path, use `getwd`; and to alter it, `setwd(new_path)`.
- To extract parts of a path, use, e.g., `file_ext` from the tools package (included in the base R installation).

For functions to manipulate files, see `?files` and `?dir.create`; and for file attributes, `?file.info`.

### `fread` to read delimited files {#fread}

To read tables from CSVs or other delimited formats, `fread` is quite fast and reliable:

```{r fread, eval=FALSE}
DT = fread("file.csv")
```

See `?fread` and [the wiki](https://github.com/Rdatatable/data.table/wiki/Convenience-features-of-fread) for a discussion of features. I usually find that `fread` just works without any need for tweaks. However, it currently doesn't recognize date or time formats, which need to be handled after reading (\@ref(dates-times)).

### `rbindlist` to combine tables {#rbindlist-read}

To read and combine several tables with the same columns, `rbindlist` is the right tool. The short approach is:

```{r fread-list, eval=FALSE}
rbindlist(lapply(list.files(patt="csv$"), fread), id=TRUE)
# or...
list.files(patt="csv$") %>% lapply(fread) %>% rbindlist(id=TRUE)
```

The rest of this section covers the longer approach I recommend, using a table with one row per file:

```{r fread-list-long-pre}
# note: this code will save to your current working directory
# type getwd() and read ?setwd for details

# example data
set.seed(1)
for (i in 1:3) 
  fwrite(data.table(id = 1:2, v = sample(letters, 2)), file = sprintf("file201%s.csv", i))
```
```{r fread-list-long}
# First, identify the files wanted:
fileDT = data.table(fn = list.files(pattern="csv$"))

# Next, optionally parse the names for metadata using regex:
fileDT[, year := type.convert(sub(".*([0-9]{4}).*", "\\1", fn))]

# Finally construct a string file-ID column:
fileDT[, id := as.character(.I)][]
```

From here, read in the files as a new column:

```{r fread-list-long2}
fileDT[, contents := .(lapply(fn, fread))][]
```

Then, combine them for the final table:

```{r fread-list-long3}
DT = fileDT[, rbindlist(setNames(contents, id), idcol="file_id")]

# Perhaps add in metadata using an update join
DT[fileDT, on=.(file_id = id), year := i.year ][]
```

```{r fread-list-long-cleanup, echo=FALSE, results=FALSE}
for (fn in intersect(fileDT$fn, list.files()))
  file.remove(fn)
```

Results can be debugged by investigating individual files like `fileDT[year == 2012, contents[[1]]]`. The `setNames` function (\@ref(names)) is needed here since `rbindlist` uses names to populate the id column.

### Date and time columns {#dates-times}

Currently, `fread` does not recognize or translate time or date columns to native R formats. 

I recommend a couple packages, anytime and nanotime, for use when needed (more on which below). They have minimal dependencies and are maintained with an eye towards continued compatibility with the data.table package.

#### Date and time of day, separately {#idatetime}

I usually handle this using the formats provided with data.table, `IDate` and `ITime`, for date and time of day:

```{r datetime-pre, echo=-3}
# example data
DT = data.table(id = 1:2, d = c("2017/01/21", "2017/01/22"), t = c("5:50", "13:50"))
DT
str(DT)
```
```{r datetime}
DT[, `:=`(
  d = as.IDate(d),
  t = as.ITime(t)
)][]
str(DT)
```

In the `str` output, we can see that ITime is stored as an integer, and the same is true of IDate. This is more efficient than a floating-point class in terms of storage and many table operations.

In this example, the `as.*` functions immediately recognize how to interpret the strings given. In many cases, however, we need to pass a format parameter, like `as.IDate("01/13/2013", "%m/%d/%Y")`. See `?strptime` and `?as.Date` for documentation on how to do this. Because it's hard to get the format right the first time, I usually test a format out like...

```{r datetime-fmttry, eval=FALSE}
DT[1L, as.IDate(d, "%Y%M%D")] 
```

This is similar to the `dryrun` option in Nick Cox's numdate package for Stata. 

Parsing dates and times by tweaking format strings can be mind-numbing, so it's worth considering loading [the anytime package](http://dirk.eddelbuettel.com/code/anytime.html), which will automatically recognize many more formats than `as.IDate` and `as.ITime` normally will.

#### Datetimes

Data.table also can parse a datetime string (well, a factor) into separate IDate and ITime columns:

```{r datetime-combo}
DT = data.table(dt = "2017-01-01 01:11")
DT[, c("d", "t") := IDateTime(as.factor(dt))][]
```

However, it does not include a combined datetime class, which would be useful if we want to measure time differences across days. 

For a combined datetime class, we have to use POSIXct. This brings with it a couple of annoyances. First, it uses floating-point measurement, so grouping on it with `by=` is not a good idea and some table operations will be slower. Second, in keeping with POSIX rules, it has an irritating time zone attribute that always demands careful watching. Nonetheless, it is the standard option for a datetime format currently. (There is also a POSIXlt format, but data.table discourages it, since it does not build on atomic vectors.)

If datetimes are central to the analysis, it might be worth loading [the chron package](https://CRAN.R-project.org/package=chron), which offers a datetime class without the time-zone annoyances of POSIXct. It is "suggested" by the data.table package (in CRAN lingo), so probably reliably compatible.

#### Sub-second measures

The ITime class is an integer measure of seconds since midnight. POSIXct will be necessary if finer-measured seconds are needed (e.g., from online activity data). If even finer measurement is needed (e.g., for network latency measurement), [the `nanotime` package](http://dirk.eddelbuettel.com/code/nanotime.html) is available.

#### Computations

There are many convenient extractor functions, `month`, `wday` and so on, listed in `?IDateTime` and `?weekdays`.

The `max` or `min` of a vector of dates can be taken (just as it can be for numbers, characters and ordered factors); and inequalities also work.

Arithmetic on IDates with `+` and `-` reads `1L` as one day; and on ITimes, as one second. 

Time differences between dates or times can be taken using `difftime`, which allows for the specification of units of measure for the output.

#### Erroneously coded `NA`s {#erroneous-na}

While doing arithmetic with dates and times is convenient, it has some risks. These are essentially numeric and integer vectors with special display formats. When the display format breaks down, we see `NA` even if the underlying numeric or integer value is not missing:

```{r bewarefancyclasses}
tDT = data.table(d = as.IDate("2017-01-01") + 0:10)

# find min date greater than today
d_next = tDT[ d > as.IDate(Sys.Date()), min(d) ]
d_next

# check if d_next is past the year 2099
d_next > as.IDate("2099-01-01")
```

`NA > x` should always evaluate to `NA` (\@ref(na)), but here we see that it does not. To see the reason, we need to extract the core data by coercion or use `dput`:

```{r debugfancyclasses}
# str is no help
str(d_next)

# as.vector strips class
as.vector(d_next)

# or unclass
unclass(d_next)

# or we can inspect with dput
dput(d_next)
```

This is an issue to watch out for with any fancy vector classes. To get around it in this case, we can sort and pick the first match (which is NA since there are no valid matches):

```{r altfancyclasses}
d_next2 = tDT[ d > as.IDate(Sys.Date()) ][order(-d), d[1L] ]
dput(d_next2)

# or a rolling join
d_next3 = tDT[order(-d)][.(as.IDate(Sys.Date())), on =.(d), roll=-Inf, x.d]
dput(d_next3)
```

The second approach is a rolling join (\@ref(im-rolling)), taking advantage of the `nomatch=NA` default in joins.

### Character columns {#char-input}

With character columns that combine multiple variables, there are a few options. 

- If the "split" of the variable is fixed-width, `substr` or `substring` works.

    ```{r read-substring}
    DT = data.table(x = c("A001ABC", "B001DEF"))
    DT[, `:=`(x_id = substring(x, 1, 1), x_num = substring(x, 2, 4), x_val = substring(x, 5, 7))][]
    str(DT)
    ```

- If the split follows a pattern that can be characterized using regular expressions (\@ref(strings)), `tstrsplit` can make the split and automatically convert to appropriate column types just as `fread` does:

    ```{r read-tstrsplit}
    DT = data.table(x = c("A_1_22.2", "B_33_44.4"))
    DT[, c("x_id", "x_int", "x_float") := tstrsplit(x, "_", type.convert = TRUE)][]
    str(DT)
    ```

With US states, built-ins (\@ref(built-ins)) and an update join (\@ref(joins-update)) can be used to convert between abbreviated and long forms:

```{r read-usstates, echo=-3}
DT = data.table(state_code = c("MN", "KS"))
DT[.(state_code = state.abb, state_name = state.name), on=.(state_code), 
   state_name := i.state_name 
]
DT
```

In the case of messier strings, the convenience functions in the stringi package may be helpful, along with a read of `?regex` and the string section in the next chapter (\@ref(strings)).

### Categorical columns {#dt-recode}

`fread` will read string columns in as character by default, and this is usually the way to go.

When "recoding" values in a character column, I suggest being explicit by storing the mapping in its own table and doing an update join (\@ref(joins-update)):

```{r read-remap, echo=-7}
# example data
DT = data.table(v = c("Yes", "argle", "bargle", "No", "Yep", "garble", "Nope", "blargh"))
# desired mapping
v_translateDT = rbind(
  data.table(v_old = c("Yes", "Yep"), v_new = TRUE),
  data.table(v_old = c("No", "Nope"), v_new = FALSE)
)
# update join
DT[v_translateDT, on=.(v = v_old), vt := i.v_new ]
# ^ okay, so apparently that is printing when it shouldn't...
```

Here, we are mapping to a logical column. Of course, the same thing could be done with a factor, an ordered factor or a string.

### List columns

Data.table supports columns of class `list`. A "list column" is a list with a length equal to the number of rows of the table. Each element of the list (that is, each row) can hold anything, just like a normal list. We saw an example of this in \@ref(rbindlist-read).

List columns are rarely a good fit with tabular data for a few reasons. First, they make the data structure significantly harder to understand. Second, grouping, sorting or merging on them is not supported and probably never will be. Third, they do not play nice with time series operators like `shift` (\@ref(lag)).

They are good for many things, but should be avoided if possible while cleaning the primary data. Nonetheless, `fread` does support reading list columns in.

### `fwrite` to write delimited files {#fwrite}

To write a CSV or other delimited-table format, `fwrite` is usually the best option.

It is fast and includes nice features for printing dates and times, like the `squash` option that will print dates like `20171020`. For other tools for printing exotic columns out as strings see \@ref(strings).

### Saving and loading R objects {#dt-saveload}

To save a table for later use in R, there are native formats, RData and RDS. In contrast with delimited files like CSVs, these files are small; are quick to save and load; and will allow the later user to skip tedious and error-prone data processing tasks.

Another consideration is whether the data storage is long-term or short term. If the objects are important and need to be accessible in 10-20 years, it would naturally be safer to store them in flat delimited file (assuming a tabular format makes sense for them). On the other hand, R's storage formats will always be accessible by opening a suitable old version of R; and [R's formats are open and (implicitly) documented](http://r.789695.n4.nabble.com/RData-File-Specification-td917010.html) and so should presumably be readable with a little work.

One thing to watch out for when loading data.tables from disk is broken "self" pointers, leading to an error about "Invalid .internal.selfref". This can be fixed by using `setDT` or `alloc.col` on all loaded tables, as explained in `vignette("datatable-faq")` under "Reading `data.table` from RDS or RData file". Generally, I don't bother with this safety precaution and haven't run into problems yet, so I'm not sure how critical it is.

#### Collections of objects

The `.RData` format allows saving and loading multiple objects. When loading, the objects are dropped into the main workspace by default. To avoid the namespace conflicts this invites, I usually load to a new environment:

```{r read-saveload}
# note: this code will save to your current working directory
# type getwd() and read ?setwd for details
n   = 10
L   = list(a = 1, b = 2)
dat = data.table(v = 3)
save(n, L, dat, file = "stuff.rdata")

# and later, in another script
rm(list = ls()) # clear workspace
n   = 20          # create a different n
load("stuff.rdata", env = e_stuff <- new.env())
```
```{r read-saveload-note, eval=FALSE, echo=FALSE}
note that this file will only be cleaned up at the end
since it is used again much later
```


Now, `n` is still 20, while the loaded value of `n = 10` can be accessed like `e_stuff$n`. The `$` and `[[` extractors for lists (\@ref(extract-list)) also work on environments, as does `names(e_stuff)`.

One nice feature of the R console is that `.rdata` files can be dragged and dropped into it to load. Also, opening an `.rdata` file with the R console will start a new session and load it. (At least this is true in Windows.)

#### Single objects

Another option is the RDS format. This stores a single R object, so there is no need to fiddle with environments:

```{r read-saveloadrds}
# note: this code will save to your current working directory
# type getwd() and read ?setwd for details
dat = data.table(v = 3)
saveRDS(dat, file = "dat.rds")

# and later, in another script
old_dat <- readRDS("dat.rds")
```
```{r read-saveloadrds-cleanup, echo=FALSE, results=FALSE}
for (fn in intersect("dat.rds", list.files()))
  file.remove(fn)
```


### Reading and writing other formats

For "foreign" table formats (Excel, Stata, etc), use an internet search. These input and output tools will always be in active development because the formats themselves are in flux. of course, it is unwise to use any such formats for long-term data storage.

### Formatting the display of columns {#format-cols}

While the way R displays dates (like `2017-11-22`) might not align with what is seen elsewhere, I suggest getting used to it rather than seeking a workaround, though I'm sure there are some (like making a new class). It will be easier for everyone looking at the data if it is in a standard format.

For the display of numbers, again, I suggest leaving it alone instead of, e.g., rounding or truncating everything. If it's really an issue that must be addressed, read about `scipen` in `?options`.

### Exercises

1. To enumerate the first day of each month in a year, we can do `seq(as.IDate("2017-01-01"), by="month", length.out = 12)`. How about the last day of each month?

## Exploring data {#exploring-data}

We'll look at two example data sets `mtcars` and `airquality`:

As a first step, we'll clean up `carsDT`'s categorical variables (noted in \@ref(dt-agg)), using the approach suggested in \@ref(dt-recode):

```{r mtcars-pre}
rm(list = ls())                  # clear tables and other objects

airDT  = data.table(airquality)
carsDT = data.table(mtcars, keep.rownames = TRUE)
carsDT[.(old_am = c(0, 1), new_am = c("auto", "manual")), on=.(am = old_am), 
  new_am := i.new_am ]           # grab new column
carsDT[, am := NULL ]            # drop old column
setnames(carsDT, "new_am", "am") # rename new column

carsDT[.(old_vs = c(0, 1), new_vs = c("v", "straight")), on=.(vs = old_vs), 
  new_vs := i.new_vs ]           # grab new column
carsDT[, vs := NULL ]            # drop old column
setnames(carsDT, "new_vs", "vs") # rename new column
```

The rigmarole with dropping and renaming is necessary because data.table's type safety protections (\@ref(dt-replace-all)) prevent overwriting columns with a new class (from numeric 1/0 to character "auto"/"manual") in an update join.

### Browsing loaded objects {#browse-env}

For a list of all loaded tables, run...

```{r tables}
tables()
```

We saw in \@ref(dt-saveload) that a collection of tables (and other objects) saved on disk can be loaded into an environment. We can look at tables in such an environment, too:

```{r tables-env}
load("stuff.rdata", env = e_stuff <- new.env())
tables(env = e_stuff)
```

The `tables` output shows keys and other attributes, which is handy. If we look at objects of all types (tables, vectors, matrices, functions and more), we usually only view names:

```{r ls}
ls()
```

Output can be filtered according to a regex (\@ref(strings)):

```{r ls-patt}
ls(patt = "DT$")
```

And it can be done within an environment:

```{r ls-env}
ls(env = e_stuff)
```

To remove tables or other objects, use `rm`. A common step is to remove all objects listed by `ls()`, like `rm(list = ls())`.

### Subsetting {#dt-subset}

#### Selecting by value

This is analogous to a SQL WHERE clause or a Stata `if` clause.

```{r dt-sub}
carsDT[ am == "auto" & disp < 300 ]
```

This syntax, `DT[i]`, is the subsetting task needed in the vast majority of cases. The rest of \@ref(dt-subset) can safely be skipped the first time through; it is mostly useful for reference.

#### Selecting by group statistic

This is analogous to a SQL HAVING clause.

```{r dt-sub-gstat}
carsDT[ , if (mean(wt) > 3) .SD, by=.(am, vs) ]
```

We are including the group only if it meets our condition. The structure `if (cond) x` returns `x` if the condition is true and nothing (`NULL`) otherwise, omitting those rows.

#### Selecting rows within each group

```{r dt-sub-grow}
# sorting by carb decreasing, selecting first row per group
# thus a row with maximum carb within the group
carsDT[ order(-carb), .SD[1L], by=.(am, vs) ]

# instead select all row(s) with maximum carb
carsDT[ , .SD[carb == max(carb)], by=.(am, vs) ]
```

```{block2, type='rmd-details'}
**Efficient selection of rows within each group.**  The second task here is [somewhat inefficient](https://github.com/Rdatatable/data.table/issues/735) currently, with a faster workaround [provided by eddi](http://stackoverflow.com/a/16574176): `carsDT[carsDT[ , .I[carb == max(carb)], by=.(am, vs) ]$V1]`. And [Arun provided a further improvement](http://stackoverflow.com/a/31854111): `carsDT[carsDT[, max(carb), by=.(am, vs)], on=.(am, vs, carb = V1)]`. For a broader look at optimizing data.table calls, see `?datatable.optimize`.
```

#### Finding where a condition holds

To find where a condition holds:

```{r which}
cond = c("aleph", "baz", "boz") > "b"
which(cond)
```

So, it takes us from a logical index to one of positions. When using `which` on a matrix or array (see \@ref(matrix-array)), we can find out "where" in terms of row and column using the `arr.ind` option.

Beware of using `which` output as an index:

```{r bad-which}
x       = 1:5
w_not   = which(x^2 > 50)
w_not
x[-w_not]
```

We wanted to filter to where `x` fails our condition, but instead we lost the whole vector. Straight-up logical filtering is the safer route here: `x[!(x^2 > 50)]`.

To find which table rows a subset pertains to, use `DT[i, which = TRUE]`. The `i` index could be a logical condition or a table being joined. It might seem like the special symbol `.I` would work here instead, like `DT[i, .I]`, but currently it does not.

### Sorting

To sort the table in-place, use `setorder`. And to set a key (which sorts as a side effect), use `setkey`. When passing columns to sort by programmatically, use `setorderv` or `setkeyv`.

For a one-time sort (not altering the order of the object itself), write an `order(...)` expression in the first argument, like `carsDT[order(hp, -carb), j, by=]` where `j` and `by=` characterize some task to be done with the sorted data.

### Examining each variable {#explore-onevar}

#### Summary functions

For a snapshot of per-variable stats, there's `summary`, but it is very crude and reports nothing about character columns (not even missing values):

```{r summary}
summary(airDT)
summary(carsDT)
```

To count missing values per column (analogous to Stata's mdesc)...

```{r mdesc}
airDT[, .(
  col = names(.SD),
  nna = nna <- colSums(is.na(.SD)),
  nok = .N - nna,
  tot = .N
)]
```

These can also be done by group, like

```{r summary-mdesc-group, eval=FALSE}
split(airDT, by=c("Month")) %>% lapply(summary)
airDT[, .(
  col = names(.SD),
  nna = nna <- colSums(is.na(.SD)),
  nok = .N - nna,
  tot = .N
), by=Month]
```

```{block2 list-iter, type='rmd-details'}
**Iterative column enumeration.**  As noted in \@ref(dt-col-create), `` `:=`(...)`` does not assign columns iteratively. Similarly, `.(...)` does not define columns iteratively, so in the last example we needed to do `nna = nna <- expr` to use that column in later expressions.
```

Various function can be applied and have the usual names: `mean`, `median`, `range` (combining `min` and `max`), `var`, `skew`, `quantile`, etc. Most of these functions can be applied by group as well, with syntax like ...

```{r single-var-stats}
# single-valued functions
carsDT[, mean(drat), by=am]
# multi-valued functions
carsDT[, quantile(drat, c(.1, .25, .75, .9)) %>% as.list, by=am]
# multiple functions
carsDT[, c(
  .(mean = mean(drat)), 
  quantile(drat, c(.1, .25, .75, .9)) %>% as.list
), by=am]
```

The `as.list` is necessary to split the results over multiple columns. It is inefficient, but that won't matter much when exploring data, I imagine. Usually, to figure out how to make `j` a list of columns, some trial and error with `c`, `as.list` and `.` will work.

The messiest case is of multiple functions applied to multiple variables. The best approach might be like:

```{r mult-var-mult-stats}
carsDT[, unlist(recursive=FALSE, lapply(
  .(med = median, iqr = IQR),
  function(f) lapply(.SD, f)
)), by=.(am), .SDcols=wt:qsec]
```

To flip from names like `med.qsec` to `qsec.med`, change the order of `lapply`:

```{r mult-var-mult-stats2}
carsDT[, unlist(recursive=FALSE, lapply(
  .SD,
  function(x) lapply(.(med = median, iqr = IQR), function(f) f(x))
)), by=.(am), .SDcols=wt:qsec]
```

#### Tabulating {#tabulating}

For a better look at a character or factor column, we'd want to tabulate values:

```{r table}
carsDT[, table(am, useNA = "always")]
# or
carsDT[, .N, by=am]
```

Each of the two options above has advantages in certain cases. We'll see that `table()` is useful when browsing tabulations in more dimensions. And the `DT[...]` syntax is useful for chaining. For example, `carsDT[, .N, by=av][order(-N)][1:3]` will show counts for the three largest groups, sorted.

For a better look at continuous variables, we can use the same tools after using `cut` to put values into bins:

```{r cut-table}
carsDT[, .N, by=.(drat_bin = cut(drat, 4))]
# or
carsDT[, drat %>% cut(4) %>% table ]
```

#### Plotting {#plotting}

We can plot a binned version of the variable; or instead a stem-and-leaf plot; or histogram; or a kernel approximation to the density; or a box plot:

```{r cut-table-plot}
carsDT[, drat %>% cut(4) %>% table %>% plot ]
# or
carsDT[, drat %>% hist(4) %>% {NULL} ]
# or
carsDT[, drat %>% stem ]
# or
carsDT[, drat %>% density %>% plot ]
# or
carsDT[, drat %>% boxplot  %>% {NULL} ]
```

The first option is discouraged by a note in `?cut`:

> Instead of `table(cut(x, br))`, `hist(x, br, plot = FALSE)` is more efficient and less memory hungry.

The `{NULL}` at the end is useful for fairly esoteric reasons. With `hist`, it is to avoid printing the histogram's internal structure to the console. With `boxplot` the same reason applies, but in addition, it is to avoid an error caused by how data.table processes the `j` argument of `DT[i,j,by]`. Anyway, a trailing `NULL` can safely be tacked onto the end of any `j` operation where you don't actually want to see the return value.

As seen here, using `%>%` pipes with plotting functions has the unfortunate side effect of yielding blanks in titles and axes. Compare with `carsDT[, hist(drat, 4) ]`, for example. Nonetheless, piping is often convenient for this. The `cut`, `hist`, `density`, `boxplot` and `plot` functions all have many options that can be tweaked, documented in their help files, but with the default options they are a quick way to glance at a variable. 

#### Other analyses

Of course, various statistical tests can be applied to a vector; but that is best left up to an internet search and careful reading of the documentation.

### Examining variables jointly {#explore-vars}

#### Summary functions

The `var` function applied to a table will give its variance-covariance matrix. There's also the correlation matrix with `corr`; and a plotter for the correlation matrix in [the corrplot package](https://CRAN.R-project.org/package=corrplot):

```{r vcv}
carsDT[, var(.SD), .SDcols = sapply(carsDT, is.numeric)]
# or
carsDT[, cor(.SD), .SDcols = sapply(carsDT, is.numeric)]
# or
carsDT[, cor(.SD) %>% corrplot::corrplot() %>% {NULL}, .SDcols = sapply(carsDT, is.numeric)]
```

The trailing `()` on `corrplot::corrplot()` is necessary thanks to [a magrittr bug](https://github.com/tidyverse/magrittr/issues/12) not yet fixed on CRAN.

#### Tabulating {#table-mult}

Cross tabulation of discrete, or discretized, variables uses `table` or `.N` again:

```{r crosstab}
carsDT[, .N, by=.(am, drat_bin = cut(drat, 4))]
# or
carsDT[, table(am, drat_bin = cut(drat, 4))]
```

Note that the `table` approach includes zeros, while they don't show up in the aggregation using `.N`. Another way of filling in missing levels is with a self-join (\@ref(self-join-to-fill-in-missing-levels)).

#### Plotting

For a pair of continuous variables, `qqplot` will give a QQ plot; and `plot`, a scatterplot. To deal with the fact that varying density isn't always clear in a scatterplot (since dots end up on top of each other), one common recommendation is the [hexbin package](https://cran.r-project.org/package=hexbin):

```{r hexbin, echo=-(1:2), warning=FALSE}
# knitr demands library() apparently
library(hexbin)
airDT[, { 
  hexbin::hexbin(Wind, Temp) %>% plot
  NULL 
}]
airDT[,  plot(Wind, Temp)]
```

To see how a particular curve relates to the data, it can be overlaid on compatible plots like `curve(100 - 3*x, add = TRUE)`. Without the `add=TRUE` option, the curve would be placed in a new plot.

```{block2 plot-windows, type='rmd-caution'}
**Plot windows.** Working interactively in the R console, each plot takes the place of the last one, using the same window. To open new windows use `x11()` or whatever command the plotting function says is necessary. Multiple plots can be fit in the same window using `par`.

In R Studio, a history of plots is saved in the "Plots" area and can be browsed using the left and right arrow buttons there. The x and broom buttons can be used to remove plots from the history. If you want plots in separate windows (like seen in the R console), `x11()` or a similar call can be used.
```

#### Other analyses

For plotting, in addition to hexbin, corrplot and the base graphics package, there are...

* [lattice](https://cran.r-project.org/package=lattice) and [latticeExtra](https://cran.r-project.org/package=latticeExtra) offer more flexibility. For a quick intro, maybe see [Ross Ihaka's slides](https://www.google.com/search?q=trellis+graphics+slides+ihaka).  
* [rgl](https://cran.r-project.org/package=rgl) is fun for examining surfaces and 3D scatterplots in an interactive display.

These packages are mature and maintained by members of "R Core," so code written for them will probably continue to work. Beyond directly plotting raw data, there are many packages for estimating a model and exploring the results (e.g., plotting a fitted curve and confidence bands on top of data).

### Reshaping to wide {#dcast-browse}

The `dcast` function is a flexible way of making a table of summary stats. Extending some examples above...

````{r dcast}
carsDT[, .(.N, mean_hp = mean(hp)), by=.(am, drat_bin = cut(drat, 4))][,
  dcast(.SD, drat_bin ~ am, value.var = c("N", "mean_hp"))
]
````

So we get row counts and mean horsepower computed by auto/manual and bins of drat (rear-axle ration). Unfortunately, we get NA counts instead of zeros here, but if the goal is simply to explore data, that shouldn't be a big deal.

The `dcast` function and its counterpart, `melt`, are covered in `vignette("datatable-reshape")`. Also, `melt` comes up again in \@ref(structuring-data) below. 

```{block2 meltcast-names, type='rmd-details'}
**Melting and casting.** The functions started in [the reshape2 package](https://CRAN.R-project.org/package=reshape2), which contains extensions like melting and casting to and from arrays and matrices.

My guess is that the terms melting and casting (for conversion of a table to and from long format) are taken from mettalurgy. There is only one long-form arrangement of data -- its liquid or molten form -- and this form is the most flexible. In contrast, there are many rigid wide-form arrangements, each with its own mold to cast the molten data into.
```

### Exercises

1. With `DT = data.table(g = c(1,1,2,2,2,3), v = 1:6)`, select only `g` groups having only a single row. Then, instead select only `g` groups having more than one row.


## Organizing relational tables {#structuring-data}

The best way to organize data in R takes advantage of the power to have multiple data sets in the workspace. With tabular data, that means using "relational" tables that can be linked together by various index columns.

Some good references are the Wikipedia pages on [normalization](https://en.wikipedia.org/wiki/Database_normalization) and the [relational model](https://en.wikipedia.org/wiki/Relational_model); and [Hadley Wickham's exposition aimed at R users](https://www.jstatsoft.org/article/view/v059i10).


### Removing redundant data {#redundant}

Consider this table of experimental results (see `?datasets::CO2`):

```{r redundant-pre}
carbDT = data.table(CO2)
carbDT %>% print(nrow = 10)
```

Using the tools for inspecting and browsing tables (\@ref(dt-inspect)), we can see that each `Plant` always has the same type and treatment. Reading the docs for the data set, we see that these are in fact attributes of the plant codes.

We can spin these plant code attributes off:

```{r redundant}
plantDT = unique(carbDT[, .(Plant, Type, Treatment)])
setkey(plantDT, Plant)
carbDT[, c("Type", "Treatment") := NULL ]
setkey(carbDT, Plant, conc)
```

The `unique` function drops duplicates. Related tools are discussed in \@ref(sets).

There are a few advantages to having data in this format:

1. The data takes up less space.
1. The relationships among variables are transparent. When building a statistical model, we won't accidentally forget that `Treatment` is determined at the `Plant` level, for example; and we can easily count how many plants have each treatment with `plantDT[, table(Treatment)]` (see \@ref(tabulating)).
1. The data will be easier to browse, since we won't have columns spilling off the right side of the screen whenever inspecting the data; and, with the keys (\@ref(keys-indices)), we can subset like...
 
    ```{r relational-brows}
    carbDT[.("Mc1")]
    carbDT[.("Mc1", 95)]
    ```

1. While we are editing `carbDT` in the course of data cleaning or analysis, we no longer have to worry about accidentally introducing inconsistencies in `Type` and `Treatment` (causing them to vary within a `Plant` even though they shouldn't).
1. Any further `Plant`-level attributes we need to define (like date the sample was taken) can go into the dedicated table, `plantDT`, without cluttering up the experimental results in `carbDT`.
1. Each table has keys that we can characterize and test. To verify that there are no duplicates, for example, we can do

    ```{r assert-unique}
    stopifnot(
      plantDT %>% { .[, .N, by=key(.)][N > 1L, .N == 0L] }
    )
    stopifnot(
      carbDT %>% { .[, .N, by=key(.)][N > 1L, .N == 0L] }
    )
    ```
      
    We are counting rows per value of the key, using the tabulation methods mentioned in \@ref(table-mult), `DT[, .N, by=key(DT)]`; then filtering to key values having more than one row with `N > 1`; and then asserting that there are no such rows `.N == 0L`. The `stopifnot` wrapper is R's basic tool for assertions.
      
    Similarly, we can verify that every combination of type and treatment appears in `plantDT`; and that every combination of `Plant` and `conc` appears in `carbDT`:
      
    ```{r assert-alllevels}
    stopifnot(
      plantDT[CJ(Type = Type, Treatment = Treatment, unique = TRUE), on=.(Type, Treatment), .N, by=.EACHI][, all(N >= 1L)]
    )
    ```
      
    We are using `CJ` to enumerate combinations (see \@ref(combos)); then counting matches using merge diagnostics (\@ref(join-unmatched)); and then asserting that every combo was matched by at least 1 row. The repetition of column names required here is annoying and will hopefully be addressed by nicer syntax soon. For the time being, there's...
    
    ```{r assert-alllevels-clean}    
    test_cols = c("Type", "Treatment")
    test_lvls = do.call(CJ, c(plantDT[, ..test_cols], unique=TRUE))
    stopifnot(
      plantDT[test_lvls, on=test_cols, .N, by=.EACHI][, all(N >= 1)]
    )
    ```
    
    `do.call` and other helpful programming tricks are covered in \@ref(do-call).

Continuing the example, if we had many more types of plants, besides Quebec and Mississippi, we might end up assigning attributes to these as well (like country or latitude). In that case, we would want a further table:

```{r yetanothertable, echo=1}
typeDT = data.table(
  Type = c("Quebec", "Mississippi", "Queens", "Montreal"), 
  Country = c("CA", "US", "US", "CA")
)
typeDT
```

### Reshaping to long {#melt}

Continuing the last example, imagine we started out with the `CO2` data set in wide format, with concentrations on columns:

```{r toowide, echo=1}
DT = dcast(data.table(CO2), Plant ~ conc, value.var="uptake")
DT
```

This format is arguably better for browsing (\@ref(dcast-browse)), but it will become a hassle if we want to find a quantitative relationship between values of concentration (on the columns) and uptake (in the cells). Stubbornly sticking to this data format often (or maybe always) leads to bad code that "aggregates" over columns or runs separately for each and every row (\@ref(dt-lapply)).

To convert to a format amenable to analysis (you know, where numbers are numbers), we use `melt`:

```{r toowide-fix}
mDT = melt(DT, id = "Plant", variable.name = "conc", value.var = "uptake", variable.factor = FALSE)
setkey(mDT, Plant, conc)

# convert to numeric
mDT[, conc := as.numeric(conc)]
```

It is not essential to set keys (\@ref(keys-indices)), but it's fairly low-cost and makes thinking through the structure of the data clearer.

Even when it's not as egregious as storing numbers as column names, it is very often the case that multiple columns store similar information that really would be more naturally stored in a single column. 

Stubbornly sticking to wide form can create problems particularly with dynamic data (that is updated over time). For example, if you have a column for "daily sales online" and another column for "daily sales in-store", what happens when you must also store data on "sales through our affiliate"? Adding more and more columns on-the-fly is asking for trouble: what class should this new column have? what are the rules for where it can have missing values? is its name consistent with those of related columns ("sales_online", "sales_offline", "sales_partner", "sales_newpartner")? This problem is what is meant by ["minimise redesign when extending" (on Wikipedia's normalization page)](https://en.wikipedia.org/wiki/Database_normalization#Minimize_redesign_when_extending_the_database_structure).

See `vignette("datatable-reshape")` for more on reshaping tables.

### Preparing data with update joins {#relational-updatejoin}

The set of relational tables is just to structure the data. During analysis, of course we often need to combine variables from the disparate tables; and for this, an update join will always suffice (\@ref(joins-update)):

```{r updatejoin-analysis}
dat = copy(carbDT)
dat[plantDT, on=.(Plant), `:=`(Type = i.Type, Treatment = i.Treatment)]
# then do some analysis with dat
```

```{block2 dontdummy, type='rmd-caution'}
**Dummies for categorical variables.** As a side note: when preparing data for analysis, it is almost never necessary to convert categorical variables (\@ref(factors)) into a set 1/0 dummies (one for each category). Most R functions for analysis recognize and play nice with factors (or character columns).
```

### Programming with tables {#program-tables}

#### Passing a list of arguments {#do-call}

To pass multiple arguments to a function programmatically, use `do.call`. Here's an example for construction of a Cartesian product table (from \@ref(redundant)):

```{r do.call, echo = -2}
my_args = c(plantDT[, c("Type", "Treatment")], unique=TRUE)
my_args

do.call(CJ, my_args)
# ... does the same as entering each argument manually ...
CJ(Type = plantDT$Type, Treatment = plantDT$Treatment, unique=TRUE)
```

This is the closest R gets to Python's `**kwargs` syntax.

#### Writing wrapper functions {#fundots}

When writing functions, the `...` pseudo-argument can be used to capture multiple arguments not known ahead of time, usually to pass them on to other functions. Here's an example for browsing keyed tables (from \@ref(keys-indices)):

```{r wrapdots}
r <- function(d, ...) {
    # verify that the table has a key
    stopifnot(length(key(d)) > 0L)

    # grab the tuple
    tuple = list(...)

    # join
    x = d[tuple, nomatch=0]

    # print
    print(x = x, nrows=Inf, row.names=FALSE)

    # return
    invisible(x)
}

r(carbDT, "Mc1", 1000)

my_tuple = list("Mc1", 1000)
do.call(r, c(list(d = carbDT), my_tuple))
```

This function is just for illustration; and I don't mean to suggest that it's terribly useful.

#### Specifying columns {#program-cols}

Several arguments in `DT[...]` support extended syntax for specifying columns, similar to Stata:

```{r shortcols}
# example data
exDT = data.table(CO2)[c(1:2,60,65)]
```

- `j` slices

    ```{r shortcols-jslice}
    # basic usage
    exDT[, c("conc", "uptake")]

    # range
    exDT[, conc:uptake]

    # excluding with -
    exDT[, -c("conc", "uptake")]
    # excluding with !
    exDT[, !c("conc", "uptake")]
    
    # programming
    cols = c("conc", "uptake")
    exDT[, ..cols]
        
    # programming in-line
    exDT[, sapply(exDT, is.numeric), with=FALSE]
    ```

- `.SDcols`

    ```{r shortcols-sdcols}
    # basic usage
    exDT[, lapply(.SD, max), .SDcols = c("conc", "uptake")]

    # range
    exDT[, lapply(.SD, max), .SDcols = conc:uptake]
    
    # conditional
    exDT[, lapply(.SD, max), .SDcols = sapply(exDT, is.numeric)]
    ```

- `by`

    ```{r shortcols-by}
    # basic
    exDT[, .N, by=c("Plant", "Type")]

    # single string
    exDT[, .N, by="Plant,Type"]

    # no quotes
    exDT[, .N, by=.(Plant, Type)]

    # defined and labeled on-the-fly
    exDT[, .N, by=.(Plant, Type, v = conc + uptake)]
    ```

- `j` modifications

    ```{r shortcols-jedit}
    # programming
    newcols = c("ha", "egad")
    newvals = list("1", "2")
    exDT[, (newcols) := newvals] 
    # exDT[, newcols := newvals ] will not work

    # overwrite by function
    exDT[, (newcols) := lapply(.SD, type.convert), .SDcols=newcols]
    ```

With `:=`, the table is not printed after the changes are made. To overcome this, it's idiomatic to add a trailing `[]`, like `exDT[, (newcols) := newvals][]`.

Columns can also be modified with `set`, but this is rarely necessary:

```{r shortcols-set}
# programming
newcols = c("ha", "egad")
newvals = list("1", "2")

for (k in newcols) set(DT, j = k, v = newvals[[k]])
```

Column numbers also work in `.SDcols` and `j`, but don't do that.

#### Defining data.table calls {#dt-eval}

```{r evalparse-pre}
# example data
exDT = data.table(CO2)[c(1:2,60,65)]
```

When an `i` or `j` call in `DT[i, j, by]` must be constructed programmatically, it is best to store it as an expression and then to evaluate that expression inside `DT[...]`, for efficiency. Alternately, the entire `DT[...]` expression could be constructed and evaluated.

```{r evalparse}
my_bycol = "Type"
my_sumcol = "uptake"
myj = substitute(sum(my_sumcol), list(my_sumcol = as.name(my_sumcol)))
myj

# eval j expression
exDT[ , eval(myj), by=my_bycol]

# eval entire expression
mycall = as.call(list(as.symbol("["), as.symbol("exDT"), j = myj, by = my_bycol))
mycall
eval(mycall)
```

For named columns or more columns in `myj`, there are a couple ways to go:

```{r evalparse-moar}
my_subs = list(my_sumcol = "uptake", my_maxcol = "uptake") %>% lapply(as.name)

# with setNames
myj = setNames(
  substitute(.(sum(my_sumcol), max(my_maxcol)), my_subs),
  c("", "mysum", "mymax")
)

# or with a list
prep_expr = function(e, L = my_subs) do.call(substitute, list(e, L))

es = list()
es[["mysum"]] = quote(sum(my_sumcol)) %>% prep_expr
es[["mymax"]] = quote(max(my_maxcol)) %>% prep_expr
myj = as.call(c(as.symbol("."), es))

exDT[, eval(myj), by=Type]
```

Uglier but sometimes quicker (in terms of programming time) is to use `get`:

```{r getem}
exDT[, sum(get(my_sumcol)), by=my_bycol]
```

Along with `get` for a single column, there's `mget` for a list of them. This can be useful for update joins. Revisiting \@ref(relational-updatejoin), we can do...

```{r mget-updatejoin}
dat = copy(carbDT)
up_cols = setdiff(names(plantDT), "Plant")
dat[plantDT, on=.(Plant), (up_cols) := mget(sprintf("i.%s", up_cols))]
```

The `setdiff` operator subtracts its second argument from the first, treating both as sets. See \@ref(sets).

```{r todo-dtq, eval=FALSE, echo=FALSE}
note .() issue per 
http://stackoverflow.com/questions/41228076/using-data-tables-shortcut-in-quoted-expressions
```

When writing a function, avoid inline calls `DT[, eval(substitute(yada yada))]` and `get`, since [that's just inviting trouble](http://stackoverflow.com/q/43343269/). For details on constructing and manipulating calls, see the "Computing on the Language" section of the "R Language Definition" manual (\@ref(docs)). And if you are delving very deep into constructing data.table calls, maybe have a look at [some edge case](https://stackoverflow.com/q/41228076).

#### Data.table options

The data.table package has many global options related to optimization and console output. To review them...

```{r dt-opts}
.Options %>% .[grep("datatable", names(.))]
```

Options can be set like `options(name = value)`.

#### Debugging

When writing complicated by-group code of the form...

```
DT[, {
  ...
  do stuff
  ...
}, by=.(x, y, z)]
```

I find that a convenient way to debug is to run it for the first group only:


```
DT[, if (.GRP == 1L) {
  ...
  do stuff
  ...
}, by=.(x, y, z)]
```

The `.GRP` is a group counter, introduced in the next chapter (\@ref(gids)); and `if (FALSE) x` does not evaluate `x` and returns nothing; so we are looking at results only for the first group.

This is similar to the "dry run" approach to parsing dates (looking at only the first row or two, see \@ref(idatetime)). R has debugging tools like `debugonce` and `browser` as well, though they may take some getting used to for those who, like me, are not used to debugging properly.


```{r stuff-cleanup, echo=FALSE, results=FALSE}
for (fn in intersect("stuff.rdata", list.files()))
  file.remove(fn)
```

## Notation cheat sheet

`DT[where, select|update|do, by]` syntax is used to work with columns of a data.table.

 - The "where" part is the `i` argument
 - The "select|update|do" part is the `j` argument

These two arguments are usually passed by position instead of by name.

A sequence of steps can be chained like `DT[...][...]`.

| *Feature* | *Note* |
| ----------  | ------------------------------------------------------------------ |
|| **Shortcuts, special functions and special symbols inside `DT[...]`** |
| `.()`   | in several arguments, replaces `list()`
| `J()`   | in `i`, replaces `list()`
| `:=`    | in `j`, a function used to add or modify columns
| `.N`    | in `i`, the total number of rows <br> in `j`, the number of rows in a group
| `.I`    | in `j`, the vector of row numbers in the table (filtered by `i`)
| `.SD`   | in `j`, the current subset of the data <br> selected by the `.SDcols` argument
| `.GRP`  | in `j`, the current index of the subset of the data
| `.BY`   | in `j`, the list of by values for the current subset of data
| `V1, V2, ...`  | default names for unnamed columns created in `j`
|| **Joins inside `DT[...]`** |
| `DT1[DT2, on, j]`  | join two tables
| `i.*`  | special prefix on DT2's columns after the join
| `x.*`  | special prefix on DT1's columns after the join
| `by=.EACHI`  | special option available only with a join
| `DT1[!DT2, on, j]`  | anti-join two tables
| `DT1[DT2, on, roll, j]`  | join two tables, rolling on the last column in `on=`
|| **Reshaping, stacking and splitting** |
| `melt(DT, id.vars, measure.vars)`  | transform to long format  <br> for multiple columns, use `measure.vars = patterns(...)`
| `dcast(DT, formula)`  | transform to wide format
| `rbind(DT1, DT2, ...)`  | stack enumerated data.tables
| `rbindlist(DT_list, idcol)`  | stack a list of data.tables 
| `split(DT, by)`  | split a data.table into a list 
|| **Some other functions specialized for data.tables** |
| `foverlaps` | overlap joins
| `merge` | another way of joining two tables
| `set` | another way of adding or modifying columns
| `fintersect`, `fsetdiff`, <br> `funion`, `fsetequal`, <br> `unique`, `duplicated`, `anyDuplicated` | set-theory operations with rows as elements  
| `CJ`| the Cartesian product of vectors
| `uniqueN`  | the number of distinct rows
| `rowidv(DT, cols)`  | row ID (1 to .N) within each group determined by cols
| `rleidv(DT, cols)`  | group ID (1 to .GRP) within each group determined by runs of cols
| `shift(DT, n)` | apply a shift operator to every column
| `setorder`, `setcolorder`, <br> `setnames`, `setkey`, `setindex`, <br> `setattr` | modify attributes and order by reference
|| **Other features of the package** |
| `IDate` and `ITime` | integer dates and times