Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13834: [R][Documentation] Document the process of creating R bindings for compute kernels and rationale behind conventions #11915

Closed
wants to merge 19 commits into from
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions r/_pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,8 @@ navbar:
href: articles/developers/install_details.html
- text: Docker
href: articles/developers/docker.html
- text: Writing Bindings
href: articles/developers/bindings.html
reference:
- title: Multi-file datasets
contents:
Expand Down
238 changes: 238 additions & 0 deletions r/vignettes/developers/bindings.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,238 @@
---
title: "Writing Bindings"
---

```{r, include=FALSE}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
```


When writing bindings between C++ compute functions and R functions, the aim is
to expose the C++ functionality via existing R functions. The syntax and
functionality should (usually) exactly match that of the existing R functions
(though with some exceptions) so that users are able to use existing tidyverse
thisisnic marked this conversation as resolved.
Show resolved Hide resolved
or base R syntax, or call existing S3 methods on objects, whilst taking
advantage of the speed and functionality of the underlying arrow package.
thisisnic marked this conversation as resolved.
Show resolved Hide resolved

# Implementing bindings for S3 generics
thisisnic marked this conversation as resolved.
Show resolved Hide resolved

If a function is an S3 generic method, you may be able to define a version of it for
thisisnic marked this conversation as resolved.
Show resolved Hide resolved
Arrow objects. There are two base classes which have been defined in the
R package so that S3 methods don't have to be defined repeatedly for objects with
similar behaviour:

* ArrowTabular - for RecordBatch and Table objects
* ArrowDatum - for Scalar, Array, and ChunkedArray objects

What this means is that any function defined for the base class will work with
the child class. For example, the function `dim()` may be defined as:

```{r, eval = FALSE}
dim.ArrowTabular <- function(x) c(x$num_rows, x$num_columns)
```

This implements `dim()` for both RecordBatch and Table objects.

```{r}
arrow_table(x = c(1, 2, 3), y = c(4, 5, 6)) %>%
dim()
```

# Implementing bindings to work within dplyr pipelines

One of main ways in which users interact with arrow is via dplyr syntax called
on Arrow objects. For example, when a user calls `dplyr::mutate()` on an Arrow Tabular,
Dataset, or arrow data query object, the Arrow implementation of `mutate()` is
used and under the hood, translates the dplyr code into Arrow C++ code.

When using `dplyr::mutate()` or `dplyr::filter()`, you may want to use functions
from other packages. The example below uses `stringr::str_detect()`.

```{r}
library(dplyr)
library(stringr)
starwars %>%
filter(str_detect(name, "Darth"))
```
This functionality has also been implemented in Arrow, e.g.:

```{r}
library(arrow)
arrow_table(starwars) %>%
filter(str_detect(name, "Darth")) %>%
collect()
```
thisisnic marked this conversation as resolved.
Show resolved Hide resolved

This is possible as a **binding** has been created between the stringr function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit has me questioning the term "binding"...whereas str_detect() and match_substring_regex is a 1:1 link, many "bindings" implement a few Arrow compute functions linked together. I'm not sure that either terminology "we created an Arrow binding for str_detect()` or "we created an R binding for match_substring_regex" are correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this point, yeah, I see what you mean; this could cause confusion. What about now that I've rephrased it?

This is possible as a binding has been created between the call to the
stringr function str_detect() and the Arrow C++ code, here as a direct mapping
to match_substring_regex. You can see this for yourself by inspecting the
arrow data query object without retrieving the results via collect().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this last week! I like how you've rephrased it.

`str_detect()` and the Arrow C++ function `match_substring_regex`. You can see
this for yourself by inspecting the arrow data query object without retrieving the
results via `collect()`.

```{r}
arrow_table(starwars) %>%
filter(str_detect(name, "Darth"))
thisisnic marked this conversation as resolved.
Show resolved Hide resolved
```

In the following sections, we'll walk through how to create a binding between an
R function and an Arrow C++ function.

## Walkthrough

Imagine you are writing the bindings for the C++ function
[`starts_with()`](https://arrow.apache.org/docs/cpp/compute.html#containment-tests)
and want to bind it to the (base) R function `startsWith()`.

First, take a look at the docs for both of those functions.

### Examining the R function

Here are the docs for R's `startsWith()` (also available at https://stat.ethz.ch/R-manual/R-devel/library/base/html/startsWith.html)

```{r, echo=FALSE, out.width="50%"}
knitr::include_graphics("./startswithdocs.png")
```
thisisnic marked this conversation as resolved.
Show resolved Hide resolved

It takes 2 parameters; `x` - the input, and `prefix` - the characters to check
if `x` starts with.

### Examining the C++ function

Now, go to
[the compute function documentation](https://arrow.apache.org/docs/cpp/compute.html#containment-tests)
and look for the Arrow C++ library's `starts_with()` function:

```{r, echo=FALSE, out.width="50%"}
knitr::include_graphics("./starts_with_docs.png")
```

The docs show that `starts_with()` is a unary function, which means that it takes a
single data input. The data input must be a string-like class, and the returned
value is boolean, both of which match up to R's `startsWith()`.

There is an options class associated with `starts_with()` - called [`MatchSubstringOptions`](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute21MatchSubstringOptionsE)
- so let's take a look at that.

```{r, echo=FALSE, out.width="50%"}
knitr::include_graphics("./matchsubstringoptions.png")
```

Options classes allow the user to control the behaviour of the function. In
this case, there are two possible options which can be supplied - `pattern` and
`ignore_case`, which are described in the docs shown above.

### Comparing the R and C++ functions

What conclusions can be drawn from what you've seen so far?

Base R's `startsWith()` and Arrow's `starts_with()` operate on equivalent data
types, return equivalent data types, and as there are no options implemented in
thisisnic marked this conversation as resolved.
Show resolved Hide resolved
R that Arrow doesn't have, this should be fairly simple to map without a great
deal of extra work.

As `starts_with()` has an options class associated with it, we'll need to make
sure that it's linked up with this in the R code.

In case you're wondering about the difference between arguments in R and options
in Arrow, in R, arguments to functions can include the actual data to be
analysed as well as options governing how the function works, whereas in the
C++ compute functions, the arguments are the data to be analysed and the
options are for specifying how exactly the function works.
thisisnic marked this conversation as resolved.
Show resolved Hide resolved

So let's get started.

### Step 1 - add unit tests

Look up the R function that you want to bind the compute kernel to, and write a
set of unit tests that use a dplyr pipeline and `compare_dplyr_binding()` (and
perhaps even `compare_dplyr_error()` if necessary. These functions compare the
output of the original function with the dplyr bindings and make sure they match.

Make sure you're testing all parameters of the R function.

Below is a possible example test for `startsWith()`.
thisisnic marked this conversation as resolved.
Show resolved Hide resolved

```{r, eval = FALSE}
test_that("startsWith", {
thisisnic marked this conversation as resolved.
Show resolved Hide resolved
df <- tibble(x = c("Foo", "bar", "baz", "qux"))

thisisnic marked this conversation as resolved.
Show resolved Hide resolved
compare_dplyr_binding(
.input %>%
filter(startsWith(x, "b")) %>%
collect(),
df
)

})
```

### Step 2 - hook up the compute function with options class if necessary

If the C++ compute function can have options specified, make sure that the
function is linked with its options class in `make_compute_options()` in the
file `arrow/r/src/compute.cpp`. You can find out if a compute function requires
options by looking in the docs here: https://arrow.apache.org/docs/cpp/compute.html

In the case of `starts_with()`, it looks something like this:

```cpp
if (func_name == "starts_with") {
using Options = arrow::compute::MatchSubstringOptions;
bool ignore_case = false;
if (!Rf_isNull(options["ignore_case"])) {
ignore_case = cpp11::as_cpp<bool>(options["ignore_case"]);
}
return std::make_shared<Options>(cpp11::as_cpp<std::string>(options["pattern"]),
ignore_case);
}
```

You can usually copy and paste from a similar existing example. In this case,
as the option `ignore_case` doesn't map to any parameters of `startsWith()`, we
give it a default value of `false` but if it's been set, use the set value
instead. As the `pattern` argument maps directly to `prefix` in `startsWith()`
we can pass it straight through.

### Step 3 - see if direct mapping is appropriate
Compare the C++ function and R function. If they are simple functions with no
options, it might be possible to directly map between the C++ and R in
`unary_function_map`, in the case of compute functions that operate on single
columns of data, or `binary_function_map` for those which operate on 2 columns
of data.

As `startsWith()` requires options, direct mapping is not appropriate.

### Step 4 - If direct mapping not possible, try a modified implementation
If the function cannot be mapped directly, some extra work may be needed to
ensure that calling the arrow version of the function results in the same result
as calling the R version of the function. In this case, the function will need
adding to the `nse_funcs` list in `arrow/r/R/dplyr-functions.R`. Here is how
this might look for `startsWith()`:

```{r, eval = FALSE}
nse_funcs$startsWith <- function(x, prefix) {
Expression$create(
"starts_with",
x,
options = list(pattern = prefix)
)
}
```

Hint: you can use `call_function()` to call a compute function directly from R.
This might be useful if you want to experiment with a compute function while
you're writing bindings for it, e.g.

```{r}
call_function(
"starts_with",
Array$create(c("Apache", "Arrow", "R", "package")),
options = list(pattern = "A")
)
```

### Step 5 - Run your tests.

If they pass, you're done! Submit a PR. If you've modified the C++ code in the
R package (for example, when hooking up a binding to its options class), you
should make sure to run `arrow/r/lint.sh` to lint the code.
Binary file added r/vignettes/developers/matchsubstringoptions.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added r/vignettes/developers/starts_with_docs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added r/vignettes/developers/startswithdocs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.