Skip to content
/ nc Public

Named capture regular expressions for text parsing and data reshaping

Notifications You must be signed in to change notification settings

tdhock/nc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nc: named capturehttps://rdatatable-community.github.io/The-Raft/posts/2024-08-01-seal_of_approval-nc/hex_approved.png
testshttps://github.com/tdhock/nc/workflows/R-CMD-check/badge.svg
coveragehttps://codecov.io/gh/tdhock/nc/branch/master/graph/badge.svg

User-friendly functions for extracting a data table (row for each match, column for each group) from non-tabular text data using regular expressions, and for melting/reshaping columns that match a regular expression. Please read and cite my related R Journal papers, if you use this code!

Quick demo of matching functions

fruit.vec <- c("granny smith apple", "blood orange and yellow banana")
fruit.pattern <- list(type=".*?", " ", fruit="orange|apple|banana")
nc::capture_first_vec(fruit.vec, fruit.pattern)
#>            type  fruit
#> 1: granny smith  apple
#> 2:        blood orange
nc::capture_all_str(fruit.vec, fruit.pattern)
#>            type  fruit
#> 1: granny smith  apple
#> 2:        blood orange
#> 3:   and yellow banana

Quick demo of reshaping functions

(one.iris <- iris[1,])
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
nc::capture_melt_single(one.iris, part=".*", "[.]", dim=".*")
#>    Species   part    dim value
#> 1:  setosa  Sepal Length   5.1
#> 2:  setosa  Sepal  Width   3.5
#> 3:  setosa  Petal Length   1.4
#> 4:  setosa  Petal  Width   0.2
nc::capture_melt_multiple(one.iris, part=".*", "[.]", column=".*")
#>    Species   part Length Width
#> 1:  setosa  Petal    1.4   0.2
#> 2:  setosa  Sepal    5.1   3.5
nc::capture_melt_multiple(one.iris, column=".*", "[.]", dim=".*")
#>    Species    dim Petal Sepal
#> 1:  setosa Length   1.4   5.1
#> 2:  setosa  Width   0.2   3.5

Installation

install.packages("nc")
## or:
if(!require(devtools))install.packages("devtools")
devtools::install_github("tdhock/nc")

Usage overview

Watch the screencast tutorial videos!

The main functions provided in nc are:

Subjectnc functionSimilar toAnd
Single stringcapture_all_strstringr::str_match_allrex::re_matches
Character vectorcapture_first_vecstringr::str_matchrex::re_matches
Data frame chr colscapture_first_dftidyr::extract/separate_wider_regexdata.table::tstrsplit
Data frame col namescapture_melt_singletidyr::pivot_longerdata.table::melt
Data frame col namescapture_melt_multipletidyr::pivot_longerdata.table::melt
File pathscapture_first_globarrow::open_dataset
  • Vignette 0 provides an overview of the various functions.
  • Vignette 1 discusses capture_first_vec and capture_first_df, which capture the first match in each of several subjects (character vector, data frame character columns).
  • Vignette 2 discusses capture_all_str which captures all matches in a single string, or a single multi-line text file. The vignette also shows how to use capture_all_str on several different strings/files, using data.table by syntax.
  • Vignette 3 discusses capture_melt_single and capture_melt_multiple which match a regex to the column names of a wide data frame, then melt/reshape the matching columns. These functions are especially useful when more than one separate piece of information can be captured from each column name, e.g. the iris column names Petal.Width, Sepal.Width, etc each have two pieces of information (flower part and measurement dimension).
  • Vignette 4 shows comparisons with related R packages.
  • Vignette 5 explains how to use helper functions for creating complex regular expressions.
  • Vignette 6 explains how to use different regex engines.
  • Vignette 7 explains how to read regularly named files, and use a regex to extract meta-data from the file names, using nc::capture_first_glob.

Choice of regex engine

By default, nc uses PCRE. Other options include ICU and RE2.

To tell nc that you would like to use a certain engine,

options(nc.engine="RE2")

Every function also has an engine argument, e.g.

nc::capture_first_vec(
  "foo a\U0001F60E# bar",
  before=".*?",
  emoji="\\p{EMOJI_Presentation}",
  after=".*",
  engine="ICU")
#>   before emoji after
#> 1  foo a     😎 # bar

Related work

For an detailed comparison of regex C libraries in R (ICU, PCRE, TRE, RE2), see my R journal (2019) paper about namedCapture.

The nc reshaping functions provide functionality similar to packages tidyr, stats, data.table, reshape, reshape2, cdata, utils, etc. The main difference is that nc::capture_melt_* support named capture regular expressions with type conversion, which (1) makes it easier to create/maintain a complex regex, and (2) results in less repetition in user code. For a detailed comparison, see my R Journal (2021) paper about nc.

Below I list the main differences between the functions in nc and other analogous R functions:

  • Main nc functions all have the capture_ prefix for easy auto-completion.
  • Output in nc is always a data.table (other packages output either a list, character matrix, or data frame).
  • For memory efficiency, nc::capture_first_df modifies the input if it is a data table, whereas tidyr functions always copy the input table.
  • By default the nc::capture_first_vec stops with an error if any subjects do not match, whereas other functions return NA/missing rows.
  • nc::capture_all_str only supports capturing multiple matches in a single subject (returning a data table), whereas other functions support multiple subjects (and return list of character matrices). For handling multiple subjects using nc, use DT[, nc::capture_all_str(subject), by] (see vignette 2 for more info).
  • nc::capture_melt_single and nc::capture_melt_multiple use regex for wide-to-tall data reshaping, see Vignette 3 and my R Journal (2021) paper for more info. Whereas in nc these are two separate functions, other packages typically provide a single function which does both kinds of reshaping, for example measure in data.table.
  • nc::capture_first_glob is for reading any kind of regularly named files into R using regex, whereas arrow::open_dataset requires a particular naming scheme (does not support regex).
  • Helper function nc::measure can be used to create the measure.vars argument of data.table::melt, and nc::capture_longer_spec can be used to create the spec argument of tidyr::pivot_longer. This can be useful if you want to use nc to define the regex, but you want to use the other package functions to do the reshape.
  • Similar to rex::capture, helper function nc::field is provided for defining patterns that match subjects like variable=value, and create a column/group named variable (useful to avoid repeating variable names in regex code). See vignette 2 for more info.
  • Similar to rex::or, nc::alternatives_with_shared_groups is provided for defining a pattern containing alternatives with shared groups. See vignette 5 for more info.

About

Named capture regular expressions for text parsing and data reshaping

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published