Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(r): Add bindings to IPC writer #608

Merged
merged 6 commits into from
Sep 17, 2024

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Sep 15, 2024

This PR adds a basic level of support for IPC writing in the R package. This is basically a thin wrapper around ArrowIpcWriterWriteStream() and could be more feature-rich like the Python version (that allows schemas and batches to be written individually).

I also added a bit of code to handle interrupts (which should catch interrupts on read and write and wasn't handled before).

library(nanoarrow)
tf <- tempfile()

nycflights13::flights |> write_nanoarrow(tf)
(df <- tf |> read_nanoarrow() |> tibble::as_tibble())
#> # A tibble: 336,776 × 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      542            540         2      923            850
#>  4  2013     1     1      544            545        -1     1004           1022
#>  5  2013     1     1      554            600        -6      812            837
#>  6  2013     1     1      554            558        -4      740            728
#>  7  2013     1     1      555            600        -5      913            854
#>  8  2013     1     1      557            600        -3      709            723
#>  9  2013     1     1      557            600        -3      838            846
#> 10  2013     1     1      558            600        -2      753            745
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>
identical(df, nycflights13::flights)
#> [1] TRUE

Created on 2024-09-14 with reprex v2.1.1

@paleolimbot paleolimbot marked this pull request as ready for review September 15, 2024 04:20
Copy link
Member

@amoeba amoeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Left some notes, feel free to ignore or open separate issues. I'll note that I couldn't do a thorough job reviewing the C parts of this.

r/R/ipc.R Show resolved Hide resolved
r/R/ipc.R Show resolved Hide resolved
@@ -107,6 +109,42 @@ read_nanoarrow.connection <- function(x, ..., lazy = FALSE) {
check_stream_if_requested(reader, lazy)
}

#' @rdname read_nanoarrow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This puts write_nanoarrow on read_nanoarrow's page which either needs to be updated or maybe write_nanoarrow just needs its own help page. I'm happy to contribute the latter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a basic level of inclusion on the read page but feel free to contribute something better!

r/src/ipc.c Outdated
static SEXP call_writebin(void* hdata) {
struct ConnectionInputStreamHandler* data = (struct ConnectionInputStreamHandler*)hdata;

// Write 16MB chunks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does the 16MB value come from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a note about this...it's somewhere between a big value (to reduce the number of R calls) and a small value (to minimize the copying/ensure there's an R call every once in a while to catch an interrupt).

r/src/ipc.c Outdated

int code = ArrowIpcWriterInit(writer, output_stream);
if (code != NANOARROW_OK) {
Rf_error("ArrowIpcWriterInit() failed");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could a more useful error message be produced here? It looks like code is an errno-style error so it seems like it'd be possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the errno value! Pretty much all that can happen here is a failure to allocate (rare, since it's a few bytes). I never quite got to adding a macro like ERROR_NOT_OK() to help with this.

@paleolimbot paleolimbot merged commit 396d851 into apache:main Sep 17, 2024
12 checks passed
@paleolimbot paleolimbot deleted the r-ipc-writer branch September 17, 2024 15:33
@paleolimbot paleolimbot added this to the nanoarrow 0.6.0 milestone Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants