Substrait is a cross-language specification for data compute operations. The Substrait specification creates a standard for compute expressions—or what should be done to data.
The Substrait R package provides an R interface to Substrait, allowing users to construct a Substrait plan from R for evaluation by a Substrait consumer, such as Arrow or DuckDB. This is an experimental package that is under heavy development!
You can install the development version of substrait from GitHub with:
# install.packages("remotes")
remotes::install_github("voltrondata/substrait-r")
Basic construction of a Substrait plan and evaluating it using the DuckDB Substrait consumer:
library(substrait)
library(dplyr)
mtcars %>%
duckdb_substrait_compiler() %>%
mutate(mpg_plus_one = mpg + 1) %>%
select(mpg, wt, mpg_plus_one) %>%
collect()
#> # A tibble: 32 × 3
#> mpg wt mpg_plus_one
#> <dbl> <dbl> <dbl>
#> 1 21 2.62 22
#> 2 21 2.88 22
#> 3 22.8 2.32 23.8
#> 4 21.4 3.22 22.4
#> 5 18.7 3.44 19.7
#> 6 18.1 3.46 19.1
#> 7 14.3 3.57 15.3
#> 8 24.4 3.19 25.4
#> 9 22.8 3.15 23.8
#> 10 19.2 3.44 20.2
#> # … with 22 more rows
You can inspect the Plan that will be generated by saving the result and
calling $plan()
:
compiler <- data.frame(col1 = 1L) %>%
duckdb_substrait_compiler() %>%
mutate(mpg_plus_one = col1 + 1)
compiler$plan()
#> message of type 'substrait.Plan' with 3 fields set
#> extension_uris {
#> extension_uri_anchor: 1
#> }
#> extensions {
#> extension_function {
#> extension_uri_reference: 1
#> function_anchor: 2
#> name: "+"
#> }
#> }
#> relations {
#> root {
#> input {
#> project {
#> common {
#> emit {
#> output_mapping: 1
#> output_mapping: 2
#> }
#> }
#> input {
#> read {
#> base_schema {
#> names: "col1"
#> struct_ {
#> types {
#> i32 {
#> nullability: NULLABILITY_NULLABLE
#> }
#> }
#> }
#> }
#> named_table {
#> names: "named_table_1"
#> }
#> }
#> }
#> expressions {
#> selection {
#> direct_reference {
#> struct_field {
#> }
#> }
#> root_reference {
#> }
#> }
#> }
#> expressions {
#> scalar_function {
#> function_reference: 2
#> output_type {
#> i32 {
#> nullability: NULLABILITY_NULLABLE
#> }
#> }
#> arguments {
#> value {
#> selection {
#> direct_reference {
#> struct_field {
#> }
#> }
#> root_reference {
#> }
#> }
#> }
#> }
#> arguments {
#> value {
#> literal {
#> fp64: 1
#> }
#> }
#> }
#> }
#> }
#> }
#> }
#> names: "col1"
#> names: "mpg_plus_one"
#> }
#> }
You can also construct a Substrait plan and evaluate it using the Acero
Substrait consumer. To use the Acero Substrait consumer you will need a
special build of the arrow
package
with Arrow configured using -DARROW_SUBSTRAIT=ON
.
library(substrait)
library(dplyr)
mtcars %>%
arrow_substrait_compiler() %>%
mutate(mpg_plus_one = mpg + 1) %>%
select(mpg, wt, mpg_plus_one) %>%
collect()
#> # A tibble: 32 × 3
#> mpg wt mpg_plus_one
#> <dbl> <dbl> <dbl>
#> 1 21 2.62 22
#> 2 21 2.88 22
#> 3 22.8 2.32 23.8
#> 4 21.4 3.22 22.4
#> 5 18.7 3.44 19.7
#> 6 18.1 3.46 19.1
#> 7 14.3 3.57 15.3
#> 8 24.4 3.19 25.4
#> 9 22.8 3.15 23.8
#> 10 19.2 3.44 20.2
#> # … with 22 more rows
You can create Substrait proto objects using the substrait
base object
or using substrait_create()
:
substrait$Type$Boolean$create()
#> message of type 'substrait.Type.Boolean' with 0 fields set
substrait_create("substrait.Type.Boolean")
#> message of type 'substrait.Type.Boolean' with 0 fields set
You can convert an R object to a Substrait object using
as_substrait(object, type)
:
(msg <- as_substrait(4L, "substrait.Expression"))
#> message of type 'substrait.Expression' with 1 field set
#> literal {
#> i32: 4
#> }
The type
can be either a string of the qualified name or an object
(which is needed to communicate certain types like
"substrait.Expression.Literal.Decimal"
which has a precision
and
scale
in addition to the value
).
Restore an R object from a Substrait object using
from_substrait(message, prototype)
:
from_substrait(msg, integer())
#> [1] 4
Substrait objects are list-like (i.e., methods defined for [[
and
$
), so you can get or set fields. Note that unset is different than
NULL
(just like an R list).
msg$literal <- substrait$Expression$Literal$create(i32 = 5L)
msg
#> message of type 'substrait.Expression' with 1 field set
#> literal {
#> i32: 5
#> }
The constructors are currently implemented using about 1000 lines of
auto-generated code made by inspecting the nanopb-compiled .proto files
and RProtoBuf. This is probably not the best final approach but allows
us to get started writing good as_substrait()
and from_substrait()
methods for various types of R objects.