Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(r): Optimize conversion from sfc to ArrowArray #76

Merged
merged 10 commits into from
Nov 23, 2023

Conversation

paleolimbot
Copy link
Contributor

@paleolimbot paleolimbot commented Nov 19, 2023

This PR implements an optimized path for sf objects to ArrowArray when we know in advance all elements of the sfc are the same type and dimension (very common). This is ~2.3-3x faster than the visitor-based approach and results in very few datasets that are large enough for anybody to notice (~100ms). This version can also drop dimensions (e.g., for when you send a 3D geometry to S2 where the Z values will be ignored anyway).

I spent a reasonable amount of time attempting a two-pass conversion (one pass to count elements to perfectly preallocate, one pass to fill the buffers). This is slower than overallocating because (1) we have a pretty good idea of how many elements we have in advance (most multilinestrings have at least one element and at least one coordinate) and (2) the cost of reallocating is on par with the cost of calling the R API for every element in the sfc twice. The version here is more readable, too (e.g., can use GeoArrowBuilderOffsetAppend() instead of builder->view.buffers[i + 1].data.as_int32...).

library(sf)
#> Linking to GEOS 3.12.0, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE

# https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/ns-water-water_line.fgb.zip
multilinestrings <- read_sf("~/Downloads/ns-water-water_line.fgb")$geometry

bench::mark(
  geoarrow:::as_geoarrow_array.default(multilinestrings),
  geoarrow:::as_geoarrow_array.sfc(multilinestrings),
  iterations = 10,
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 geoarrow:::as_geoarrow_array.def… 264.5ms 283.3ms      3.56    2.87MB        0
#> 2 geoarrow:::as_geoarrow_array.sfc…  92.5ms  99.4ms      6.47   74.44KB        0

linestrings <- st_cast(multilinestrings, "LINESTRING")
bench::mark(
  geoarrow:::as_geoarrow_array.default(linestrings),
  geoarrow:::as_geoarrow_array.sfc(linestrings),
  iterations = 10,
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 geoarrow:::as_geoarrow_array.defaul… 294ms  308ms      3.28   302.1MB        0
#> 2 geoarrow:::as_geoarrow_array.sfc(li… 134ms  137ms      5.91    59.3KB        0


points <- wk::wk_vertices(linestrings)
bench::mark(
  geoarrow:::as_geoarrow_array.default(points),
  geoarrow:::as_geoarrow_array.sfc(points),
  iterations = 10,
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 geoarrow:::as_geoarrow_array.d…    1.42s    1.45s     0.680    59.1KB        0
#> 2 geoarrow:::as_geoarrow_array.s… 236.42ms 965.97ms     1.18     46.5KB        0

In all cases this is faster than generating WKB, although that might just be limitations of R or Rcpp that could be optimized in sf:

library(sf)
#> Linking to GEOS 3.12.0, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE

# https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/ns-water-water_line.fgb.zip
multilinestrings <- read_sf("~/Downloads/ns-water-water_line.fgb")$geometry

bench::mark(
  geoarrow:::as_geoarrow_array.default(multilinestrings),
  geoarrow:::as_geoarrow_array.sfc(multilinestrings),
  geoarrow:::as_geoarrow_array.default(multilinestrings, schema = geoarrow::na_extension_wkb()),
  sf:::CPL_write_wkb(multilinestrings),
  iterations = 10,
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 geoarrow:::as_geoarrow_array.d… 262.63ms 283.18ms     3.57     2.87MB    0    
#> 2 geoarrow:::as_geoarrow_array.s…  97.56ms 167.43ms     5.26    74.44KB    0    
#> 3 geoarrow:::as_geoarrow_array.d… 293.47ms  302.4ms     3.30     6.56KB    0    
#> 4 sf:::CPL_write_wkb(multilinest…    1.45s    1.64s     0.577  378.59MB    0.924

Created on 2023-11-20 with reprex v2.0.2

return(NextMethod())
}

if (class(x)[1] %in% c("sfc_POINT",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use wk::wk_vector_meta() here?

Suggested change
if (class(x)[1] %in% c("sfc_POINT",
# or use the labels instead
if (wk::wk_vector_meta(x)$geometry_type %in% 1:6) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!


# Let the default method handle M values (the optimized path doesn't
# handle mixed XYZ/XYZM/XYM but can deal with mixed XY and XYZ)
if (!is.null(attr(x, "m_range"))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use wk::wk_vector_meta() here?

Suggested change
if (!is.null(attr(x, "m_range"))) {
if (wk::wk_vector_meta(x)$has_m) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

@paleolimbot paleolimbot marked this pull request as ready for review November 23, 2023 01:08
@paleolimbot paleolimbot merged commit f7c3aa5 into geoarrow:main Nov 23, 2023
6 checks passed
@paleolimbot paleolimbot deleted the r-sf branch November 23, 2023 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants