nc: named capture | |
tests | |
coverage |
User-friendly functions for extracting a data table (row for each match, column for each group) from non-tabular text data using regular expressions, and for melting/reshaping columns that match a regular expression. Please read and cite my related R Journal papers, if you use this code!
- Comparing namedCapture with other R packages for regular expressions (2019).
- Wide-to-tall Data Reshaping Using Regular Expressions and the nc Package (2021).
fruit.vec <- c("granny smith apple", "blood orange and yellow banana")
fruit.pattern <- list(type=".*?", " ", fruit="orange|apple|banana")
nc::capture_first_vec(fruit.vec, fruit.pattern)
#> type fruit
#> 1: granny smith apple
#> 2: blood orange
nc::capture_all_str(fruit.vec, fruit.pattern)
#> type fruit
#> 1: granny smith apple
#> 2: blood orange
#> 3: and yellow banana
(one.iris <- iris[1,])
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
nc::capture_melt_single(one.iris, part=".*", "[.]", dim=".*")
#> Species part dim value
#> 1: setosa Sepal Length 5.1
#> 2: setosa Sepal Width 3.5
#> 3: setosa Petal Length 1.4
#> 4: setosa Petal Width 0.2
nc::capture_melt_multiple(one.iris, part=".*", "[.]", column=".*")
#> Species part Length Width
#> 1: setosa Petal 1.4 0.2
#> 2: setosa Sepal 5.1 3.5
nc::capture_melt_multiple(one.iris, column=".*", "[.]", dim=".*")
#> Species dim Petal Sepal
#> 1: setosa Length 1.4 5.1
#> 2: setosa Width 0.2 3.5
install.packages("nc")
## or:
if(!require(devtools))install.packages("devtools")
devtools::install_github("tdhock/nc")
Watch the screencast tutorial videos!
The main functions provided in nc are:
Subject | nc function | Similar to | And |
---|---|---|---|
Single string | capture_all_str | stringr::str_match_all | rex::re_matches |
Character vector | capture_first_vec | stringr::str_match | rex::re_matches |
Data frame chr cols | capture_first_df | tidyr::extract/separate_wider_regex | data.table::tstrsplit |
Data frame col names | capture_melt_single | tidyr::pivot_longer | data.table::melt |
Data frame col names | capture_melt_multiple | tidyr::pivot_longer | data.table::melt |
File paths | capture_first_glob | arrow::open_dataset |
- Vignette 0 provides an overview of the various functions.
- Vignette 1 discusses
capture_first_vec
andcapture_first_df
, which capture the first match in each of several subjects (character vector, data frame character columns). - Vignette 2 discusses
capture_all_str
which captures all matches in a single string, or a single multi-line text file. The vignette also shows how to usecapture_all_str
on several different strings/files, using data.tableby
syntax. - Vignette 3 discusses
capture_melt_single
andcapture_melt_multiple
which match a regex to the column names of a wide data frame, then melt/reshape the matching columns. These functions are especially useful when more than one separate piece of information can be captured from each column name, e.g. the iris column namesPetal.Width
,Sepal.Width
, etc each have two pieces of information (flower part and measurement dimension). - Vignette 4 shows comparisons with related R packages.
- Vignette 5 explains how to use helper functions for creating complex regular expressions.
- Vignette 6 explains how to use different regex engines.
- Vignette 7 explains how to read regularly named files, and use a
regex to extract meta-data from the file names, using
nc::capture_first_glob
.
By default, nc uses PCRE. Other options include ICU and RE2.
To tell nc that you would like to use a certain engine,
options(nc.engine="RE2")
Every function also has an engine argument, e.g.
nc::capture_first_vec(
"foo a\U0001F60E# bar",
before=".*?",
emoji="\\p{EMOJI_Presentation}",
after=".*",
engine="ICU")
#> before emoji after
#> 1 foo a 😎 # bar
For an detailed comparison of regex C libraries in R (ICU, PCRE, TRE, RE2), see my R journal (2019) paper about namedCapture.
The nc reshaping functions provide functionality similar to packages
tidyr, stats, data.table, reshape, reshape2, cdata, utils, etc. The
main difference is that nc::capture_melt_*
support named capture
regular expressions with type conversion, which (1) makes it easier to
create/maintain a complex regex, and (2) results in less repetition in
user code. For a detailed comparison, see my R Journal (2021) paper
about nc.
Below I list the main
differences between the functions in nc
and other analogous R functions:
- Main
nc
functions all have thecapture_
prefix for easy auto-completion. - Output in
nc
is always a data.table (other packages output either a list, character matrix, or data frame). - For memory efficiency,
nc::capture_first_df
modifies the input if it is a data table, whereastidyr
functions always copy the input table. - By default the
nc::capture_first_vec
stops with an error if any subjects do not match, whereas other functions return NA/missing rows. nc::capture_all_str
only supports capturing multiple matches in a single subject (returning a data table), whereas other functions support multiple subjects (and return list of character matrices). For handling multiple subjects usingnc
, useDT[, nc::capture_all_str(subject), by]
(see vignette 2 for more info).nc::capture_melt_single
andnc::capture_melt_multiple
use regex for wide-to-tall data reshaping, see Vignette 3 and my R Journal (2021) paper for more info. Whereas in nc these are two separate functions, other packages typically provide a single function which does both kinds of reshaping, for example measure indata.table
.nc::capture_first_glob
is for reading any kind of regularly named files into R using regex, whereasarrow::open_dataset
requires a particular naming scheme (does not support regex).- Helper function
nc::measure
can be used to create themeasure.vars
argument ofdata.table::melt
, andnc::capture_longer_spec
can be used to create thespec
argument oftidyr::pivot_longer
. This can be useful if you want to use nc to define the regex, but you want to use the other package functions to do the reshape. - Similar to rex::capture, helper function
nc::field
is provided for defining patterns that match subjects like variable=value, and create a column/group named variable (useful to avoid repeating variable names in regex code). See vignette 2 for more info. - Similar to rex::or,
nc::alternatives_with_shared_groups
is provided for defining a pattern containing alternatives with shared groups. See vignette 5 for more info.