Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Establish syntax for defining a translation/binding #194

Open
Tracked by #182
paleolimbot opened this issue Aug 31, 2022 · 1 comment
Open
Tracked by #182

Establish syntax for defining a translation/binding #194

paleolimbot opened this issue Aug 31, 2022 · 1 comment
Assignees

Comments

@paleolimbot
Copy link
Contributor

For #182, we are going to need to define mappings between R syntax and functions that are part of the Substrait spec. The functions we have now are defined in an ad-hoc way to make the tests pass, but the current syntax is not well-suited to defining a lot of these.

The Arrow package uses syntax like:

register_binding("base::as.logical", function(x) {
  build_expr("some_fun", all, the, args)
})

...we should maximize compatibility with that syntax because we'll need to copy those bindings at some point. On the flip side, we now have (will shortly have) the ability to use bindings within functions so that we can do stuff like:

register_binding_that_contains_bindings("some_binding", function(x) {
  nchar(x) + 1L
})

With the ability to define bindings/translations, we also need the ability to do the translation. Right now our evaluation strategy is a bit complicated and that could be revisited here.

@paleolimbot
Copy link
Contributor Author

I had to look into this a little in #181 because the substrait.Expression.ScalarFunction definition changed a little.

In Arrow, we define alternative versions of functions in an environment and use that environment to "mask" calls to functions we support. A simplified version is here:

library(rlang)

substrait_funcs <- new.env(parent = emptyenv())
substrait_funcs$nchar <- function(x) {
  # something returning a substrait.Expression
  message("The special function is getting called")
  nchar(x)
}

current_columns <- list(col1 = "some value")

# without our function redefines
eval_tidy(quo(nchar(col1)), data = current_columns)
#> [1] 10

# with our function redefines
eval_mask <- c(current_columns, as.list(substrait_funcs))
eval_tidy(quo(nchar(col1)), data = eval_mask)
#> The special function is getting called
#> [1] 10

Currently in Substrait, we do a custom evaluation strategy where we walk the syntax tree ourselves whilst carefully replacing function calls. This is much more complicated and more error-prone on our end, and I'd like to move to something more like Arrow does. Every time I've tried refactoring in this direction I run into some problems. I think these are solvable but at the time I didn't have a good idea of the big picture. With the new function work that @thisisnic is working on, I think there's a better way.

Part of what the custom evaluation thing does is keep track of which functions are actually used. The substrait.Plan contains a manifest of all the kernels (i.e., function name + input argument combinations) that are used in the plan (as opposed to which functions are available), so we need some way of keeping track of that. Instead of keeping track as we go, we can either use an active binding or post-process the substrait.Expression to make the list smaller before we send the Plan to the consumer.

Another part of what custom evaluation does is enable type resolution. When function arguments are evaluated using regular R evaluation, sometiemes we get an object (notably, the field reference) whose type can't be resolved unless we have access to a SubstraitCompiler. We can provide access to the compiler via some global variable (e.g., expose substrait::current_compiler() or something), although I think the reason that we needed access to the types was to be able to calculate the output type, which is a property of the expression that we have to return from our special translation of the function. Another thing we could do is leave this blank and walk the expression after evaluation and fill in the output types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant