README.Rmd

---
title: "wakefield"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
  md_document:
    toc: true
    toc_depth: 4
---

```{r, echo=FALSE}
desc <- suppressWarnings(readLines("DESCRIPTION"))
regex <- "(^Version:\\s+)(\\d+\\.\\d+\\.\\d+)"
loc <- grep(regex, desc)
ver <- gsub(regex, "\\2", desc[loc])
library(pacman)
# verbadge <- sprintf('<a href="https://img.shields.io/badge/Version-%s-orange.svg"><img src="https://img.shields.io/badge/Version-%s-orange.svg" alt="Version"/></a></p>', ver, ver)
verbadge <- ''
p_load(dplyr, wakefield, knitr, tidyr, ggplot2)
````

```{r, echo=FALSE}
knit_hooks$set(htmlcap = function(before, options, envir) {
  if(!before) {
    paste('<p class="caption"><b><em>',options$htmlcap,"</em></b></p>",sep="")
    }
    })
knitr::opts_knit$set(self.contained = TRUE, cache = FALSE)
knitr::opts_chunk$set(fig.path = "tools/figure/")
```

[![Project Status: Active - The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/0.1.0/active.svg)](https://www.repostatus.org/#active)
[![Build Status](https://travis-ci.org/trinker/wakefield.svg?branch=master)](https://travis-ci.org/trinker/wakefield)
[![Coverage Status](https://s3.amazonaws.com/assets.coveralls.io/badges/coveralls_0.svg)](https://coveralls.io/github/trinker/wakefield)
[![DOI](https://zenodo.org/badge/5398/trinker/wakefield.svg)](https://dx.doi.org/10.5281/zenodo.17172)
[![](https://cranlogs.r-pkg.org/badges/wakefield)](https://cran.r-project.org/package=wakefield)
`r verbadge`


**wakefield** is designed to quickly generate random data sets.  The user passes `n` (number of rows) and predefined vectors to the `r_data_frame` function to produce a `dplyr::tbl_df` object.

![](tools/wakefield_logo/r_wakefield.png)  

# Installation

To download the development version of **wakefield**:

Download the [zip ball](https://github.com/trinker/wakefield/zipball/master) or [tar ball](https://github.com/trinker/wakefield/tarball/master), decompress and run `R CMD INSTALL` on it, or use the **pacman** package to install the development version:

```r
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/wakefield")
pacman::p_load(dplyr, tidyr, ggplot2)
```


# Contact

You are welcome to:
* submit suggestions and bug-reports at: <https://github.com/trinker/wakefield/issues>
* send a pull request on: <https://github.com/trinker/wakefield/>
* compose a friendly e-mail to: <tyler.rinker@gmail.com>

# Demonstration
## Getting Started

The `r_data_frame` function (random data frame) takes `n` (the number of rows) and any number of variables (columns).  These columns are typically produced from a **wakefield** variable function.  Each of these variable functions has a pre-set behavior that produces a named vector of n length, allowing the user to lazily pass unnamed functions (optionally, without call parenthesis).  The column name is hidden as a `varname` attribute.  For example here we see the `race` variable function:

```{r}
race(n=10)
attributes(race(n=10))
```

When this variable is used inside of `r_data_frame` the `varname` is used as a column name.  Additionally, the `n` argument is not set within variable functions but is set once in `r_data_frame`:

```{r}
r_data_frame(
    n = 500,
    race
)
```

The power of `r_data_frame` is apparent when we use many modular variable functions:

```{r}
r_data_frame(
    n = 500,
    id,
    race,
    age,
    sex,
    hour,
    iq,
    height,
    died
)
```


There are `r length(variables())` **wakefield** based variable functions to chose from, spanning **R**'s various data types (see `?variables` for details).  

```{r, results='asis', echo=FALSE, comment=NA, warning=FALSE, htmlcap="Available Variable Functions"}
p_load(pander, xtable)

variables("matrix", ncol=5) %>%
    xtable() %>%
    print(type = 'html', include.colnames = FALSE, include.rownames = FALSE,
        html.table.attributes = '')

#matrix(c(sprintf("`%s`", vect), blanks), ncol=4) %>%
#    pandoc.table(format = "markdown", caption = "Available variable functions.")
```

However, the user may also pass their own vector producing functions or vectors to `r_data_frame`.  Those with an `n` argument can be set by `r_data_frame`:

```{r}
r_data_frame(
    n = 500,
    id,
    Scoring = rnorm,
    Smoker = valid,
    race,
    age,
    sex,
    hour,
    iq,
    height,
    died
)
```


```{r}
r_data_frame(
    n = 500,
    id,
    age, age, age,
    grade, grade, grade
)
```


While passing variable functions to `r_data_frame` without call parenthesis is handy, the user may wish to set arguments.  This can be done through call parenthesis as we do with `data.frame` or `dplyr::data_frame`:

```{r}
r_data_frame(
    n = 500,
    id,
    Scoring = rnorm,
    Smoker = valid,
    `Reading(mins)` = rpois(lambda=20),  
    race,
    age(x = 8:14),
    sex,
    hour,
    iq,
    height(mean=50, sd = 10),
    died
)
```

## Random Missing Observations

Often data contains missing values.  **wakefield** allows the user to add a proportion of missing values per column/vector via the `r_na` (random `NA`).  This works nicely within a **dplyr**/**magrittr** `%>%` *then* pipeline:

```{r}
r_data_frame(
    n = 30,
    id,
    race,
    age,
    sex,
    hour,
    iq,
    height,
    died,
    Scoring = rnorm,
    Smoker = valid
) %>%
    r_na(prob=.4)
```

## Repeated Measures & Time Series

The `r_series` function allows the user to pass a single **wakefield** function and dictate how many columns (`j`) to produce.  

```{r}
set.seed(10)

r_series(likert, j = 3, n=10)
```

Often the user wants a numeric score for Likert type columns and similar variables.  For series with multiple factors the `as_integer` converts all columns to integer values.  Additionally, we may want to specify column name prefixes. This can be accomplished via the variable function's `name` argument.  Both of these features are demonstrated here.

```{r}
set.seed(10)

as_integer(r_series(likert, j = 5, n=10, name = "Item"))
```

`r_series` can be used within a `r_data_frame` as well.  

```{r}
set.seed(10)

r_data_frame(n=100,
    id,
    age,
    sex,
    r_series(likert, 3, name = "Question")
)
```


```{r}
set.seed(10)

r_data_frame(n=100,
    id,
    age,
    sex,
    r_series(likert, 5, name = "Item", integer = TRUE)
)
```

### Related Series

The user can also create related series via the `relate` argument in `r_series`.  It allows the user to specify the relationship between columns.  `relate` may be a named list of \code{c("operation", "mean", "sd")} or a short hand string of the form of `"fM_sd"` where:

- `f` is one of (+, -, *, /)
- `M` is a mean value
- `sd` is a standard deviation of the mean value 

For example you may use `relate = "*4_1"`.  If `relate = NULL` no relationship is generated between columns.  I will use the short hand string form here.

#### Some Examples With Variation

```{r}
r_series(grade, j = 5, n = 100, relate = "+1_6")
r_series(age, 5, 100, relate = "+5_0")
r_series(likert, 5,  100, name ="Item", relate = "-.5_.1")
r_series(grade, j = 5, n = 100, relate = "*1.05_.1")
```

#### Adjust Correlations

Use the `sd` command to adjust correlations.

```{r}
round(cor(r_series(grade, 8, 10, relate = "+1_2")), 2)
round(cor(r_series(grade, 8, 10, relate = "+1_0")), 2)
round(cor(r_series(grade, 8, 10, relate = "+1_20")), 2)
round(cor(r_series(grade, 8, 10, relate = "+15_20")), 2)
```

#### Visualize the Relationship

```{r, fig.height=7, fig.width=11}
dat <- r_data_frame(12,
    name,
    r_series(grade, 100, relate = "+1_6")
) 

dat %>%
    gather(Time, Grade, -c(Name)) %>%
    mutate(Time = as.numeric(gsub("\\D", "", Time))) %>%
    ggplot(aes(x = Time, y = Grade, color = Name, group = Name)) +
        geom_line(size=.8) + 
        theme_bw()
```


## Expanded Dummy Coding

The user may wish to expand a `factor` into `j` dummy coded columns.  The `r_dummy` function expands a factor into `j` columns and works similar to the `r_series` function.  The user may wish to use the original factor name as the prefix to the `j` columns.  Setting `prefix = TRUE` within `r_dummy` accomplishes this.


```{r}
set.seed(10)
r_data_frame(n=100,
    id,
    age,
    r_dummy(sex, prefix = TRUE),
    r_dummy(political)
)
```


## Visualizing Column Types

It is helpful to see the column types and `NA`s as a visualization.  The `table_heat` (also the `plot` method assigned to `tbl_df` as well) can provide visual glimpse of data types and missing cells.

```{r, fig.height=7, fig.width=11}
set.seed(10)

r_data_frame(n=100,
    id,
    dob,
    animal,
    grade, grade,
    death,
    dummy,
    grade_letter,
    gender,
    paragraph,
    sentence
) %>%
   r_na() %>%
   plot(palette = "Set1")
```