-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy path02-scan-your-data.Rmd
111 lines (77 loc) · 5.53 KB
/
02-scan-your-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
title: "Scan Your Data"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(pointblank)
library(tidyverse)
library(palmerpenguins)
library(safetyData)
```
## Intro
Sometimes you know nothing about a new dataset. The **pointblank** package is here to help and it has the `scan_data()` function. So simple, and it gives you so much information on a data table. The function generates an HTML report that scours the input table data.
In the same spirit, generating validation steps can be laborious and difficult at first. There's a function available to kickstart that process: `draft_validation()`. It'll generate a new .R file with a suggested validation plan that's meant to work and is tweakable.
### Performing table scans with `scan_data()`
The `scan_data()` function is available for providing an interactive overview of a tabular dataset. The reporting output contains several sections to make everything more digestible, and these are:
- `Overview`: Shows table dimensions, duplicate row counts, column types, and reproducibility information
- `Variables`: Provides a summary for each table variable and further statistics and summaries depending on the variable type
- `Interactions`: Displays a matrix plot that describes the interactions between variables
- `Correlations`: This is a set of correlation matrix plots for numerical variables
- `Missing Values`: A summary figure that shows the degree of missingness across variables
- `Sample`: A table that provides the head and tail rows of the dataset
The output HTML report will appear in the RStudio Viewer and can also be integrated in R Markdown or Quarto HTML output. Here’s an example that uses the `penguins_raw` dataset from the **palmerpenguins** package.
```{r}
scan_data(tbl = palmerpenguins::penguins_raw, navbar = FALSE)
```
As could be seen, the first two sections had a lot of additional information tucked behind detail views (with the `Toggle details` buttons) and within tab sets. Should this amount of information be a little overwhelming, there is the option to disable one or more sections. With `scan_data()`’s `sections` argument, you can specify just the sections that are needed for a specific scan.
The default value for `sections` is the string `"OVICMS"` and each letter of that stands for the following sections in their default order:
`"O"`: `"overview"`
`"V"`: `"variables"`
`"I"`: `"interactions"`
`"C"`: `"correlations"`
`"M"`: `"missing"`
`"S"`: `"sample"`
This string can contain less key characters and the order can be changed to suit the desired layout of the report. For example, if you just need the Overview, a Sample, and the description of Variables in the target table, the string to use for sections would be `"OSV"`.
The `tbl` supplied could be a data frame, tibble, a `tbl_dbi` object, or a `tbl_spark` object. Here are a few more datasets that could be scanned, this time using `sections = "OSV"`:
```{r eval=FALSE}
scan_data(tbl = safetyData::adam_adae, sections = "OSV")
```
```{r eval=FALSE}
scan_data(tbl = safetyData::adam_advs, sections = "OSV")
```
The reporting generated by scan_data() can be presented in one of eight spoken languages: English (`"en"`, the default), French (`"fr"`), German (`"de"`), Italian (`"it"`), Spanish (`"es"`), Portuguese (`"pt"`), Turkish (`"tr"`), Chinese (`"zh"`), Russian (`"ru"`), Polish (`"pl"`), Danish (`"da"`), Swedish (`"sv"`), and Dutch (`"nl"`). These two-letter language codes can be used as an argument to the `lang` argument.
Here's an example that scans **dplyr**'s `starwars` dataset and creates the report in Danish.
```{r}
scan_data(tbl = dplyr::starwars, sections = "OVS", lang = "da")
```
It's possible to export this reporting to a self-contained HTML file. To do so, use the `export_report()` function (this also works for every other type of reporting you'll see in the Viewer).
```{r eval=FALSE}
# Use `scan_data()` and assign reporting to `tbl_scan`
tbl_scan <- scan_data(tbl = dplyr::storms, sections = "OVS")
# Write the `ptblank_tbl_scan` object to an HTML file
export_report(
tbl_scan,
filename = "tbl_scan-storms.html"
)
```
### Drafting a nice, new validation plan with `draft_validation()`
We can generate a draft validation plan in a new `.R` or `.Rmd` file using an input data table (just like with `scan_data()`). With `draft_validation()` the data table will be scanned to learn about its column data and a set of starter validation steps (constituting a validation plan) will be written.
Let's draft a validation plan for the `dplyr::storms` dataset. Here's a quick look at that table:
```{r paged.print=FALSE}
dplyr::storms
```
Here's how we generate the new `.R` file:
```{r eval=FALSE}
draft_validation(
tbl = ~dplyr::storms, # This `~` makes it an expression for getting the data
tbl_name = "storms",
filename = "storms-validation"
)
```
Check out the new file called `"storms-validation.R"`! It's ready to run, all the validation steps run without failing test units, and the process (thanks to column inference routines) knows what to do with certain types of columns (like the latitude and longitude ones).
Once in the file, it's possible to tweak the validation steps to better fit the expectations to the particular domain. It's best to use a data extract that contains a good amount of rows and is relatively free of spurious data.
------
### SUMMARY
1. It's a great idea to examine data you're unfamiliar with with `scan_data()`!
2. The `draft_validation()` function can give you a super-quickstart for data validation (it scans your data, but in a different way).