Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(r): Improve vctr class integration #79

Merged
merged 15 commits into from
Dec 2, 2023

Conversation

paleolimbot
Copy link
Contributor

@paleolimbot paleolimbot commented Nov 27, 2023

Closes #78.

Works, but because of a limitation in nanoarrow the R package, it can only convert one chunk at a time:

library(arrow, warn.conflicts = FALSE)
library(geoarrow)

tmp <- tempfile()
curl::curl_download(
  "https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/ns-water-water_junc.arrow",
  tmp
)

table <- read_feather(tmp, col_select = c("geometry"), as_data_frame = FALSE)
(stream <- nanoarrow::as_nanoarrow_array_stream(table))
#> <nanoarrow_array_stream struct<geometry: geoarrow.multipoint{list<points: struct<x: double, y: double>>}>>
#>  $ get_schema:function ()  
#>  $ get_next  :function (schema = x$get_schema(), validate = TRUE)  
#>  $ release   :function ()
tibble::as_tibble(stream$get_next())
#> # A tibble: 65,536 × 1
#>    geometry                                     
#>    <grrw_vct>                                   
#>  1 <MULTIPOINT (301431.9676173 4818251.5775598)>
#>  2 <MULTIPOINT (283766.8647915 4818265.9772479)>
#>  3 <MULTIPOINT (305785.7673356 4818288.1786451)>
#>  4 <MULTIPOINT (301149.2665608 4818332.6785636)>
#>  5 <MULTIPOINT (305721.3693198 4818378.3766595)>
#>  6 <MULTIPOINT (301778.4666771 4818385.3765993)>
#>  7 <MULTIPOINT (284873.7639941 4818408.0773191)>
#>  8 <MULTIPOINT (305661.9683072 4818412.3776635)>
#>  9 <MULTIPOINT (300857.8665021 4818425.9776693)>
#> 10 <MULTIPOINT (291681.7660744 4818435.4774984)>
#> # ℹ 65,526 more rows

# exported in zzz.R
st_as_sfc.geoarrow_vctr <- function(x, ..., promote_multi = FALSE) {
sfc <- wk::wk_handle(x, wk::sfc_writer(promote_multi))
sf::st_set_crs(sfc, sf::st_crs(wk::wk_crs(x)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we prefer wk::wk_set_crs(sfc, wk::wk_crs(x)) here?

In theory, this change reduces geoarrow's knowledge of sf by deferring that responsibility to wk. In reality, it's the same code and still requires sf to be installed, so maybe it doesn't matter.

@anthonynorth
Copy link
Contributor

I tried debugging this:
#> Error in if (first_chunk_index == last_chunk_index) {: missing value where TRUE/FALSE needed

But didn't get very far... What I think is happening, is the geoarrow_vctr attributes aren't being preserved somewhere in nanoarrow::convert_array_stream(). We end up with a vector of indices, but attributes are all defaults.

@paleolimbot
Copy link
Contributor Author

Thank you! That is incredibly useful...after reading it I'm pretty sure I know exactly what is happening: when converting an array stream with more than one chunk, nanoarrow does a "preallocate + fill" thing. Unfortunately it doesn't allow propagating any changes to the attributes and I'm pretty sure this requires some non-trivial changes to nanoarrow's stream conversion (they are, however, changes I've been meaning to make for a while, so this is good incentive to make them happen!).

@paleolimbot paleolimbot marked this pull request as ready for review November 29, 2023 02:17
r/geoarrow/R/vctr.R Outdated Show resolved Hide resolved
@anthonynorth
Copy link
Contributor

anthonynorth commented Dec 1, 2023

Maybe for another PR.
Do you think it's worth putting the geoarrow_vctr attribute getters in internal utility functions? e.g.

get_chunks <- function(x) attr(x, "chunks", exact = TRUE)
get_schema <- function(x) attr(x, "schema", exact = TRUE)
get_offsets <- function(x) attr(x, "offsets", exact = TRUE)

geo <- geoarrow_schema_parse(schema)
vctr <- as_geoarrow_vctr(array)

if (geo$extension_name == "geoarrow.wkt") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a future pr, I think we could extract some of this into as_wkt, as_wkb, as_xy methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then these could be skeletons. Or the reverse, and as_ methods are just calling these methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They almost certainly should be optimized! I'm pretty sure they just go through the handler right now (but at least they work!):

library(wk)
library(geoarrow)

as_xy(as_geoarrow_vctr("POINT (0 1)"))
#> <wk_xy[1]>
#> [1] (0 1)
as_wkt(as_geoarrow_vctr("POINT (0 1)"))
#> <wk_wkt[1]>
#> [1] POINT (0 1)
as_wkb(as_geoarrow_vctr("POINT (0 1)"))
#> <wk_wkb[1]>
#> [1] <POINT (0 1)>

Created on 2023-12-01 with reprex v2.0.2

@@ -105,3 +105,51 @@ test_that("as_geoarrow_array() for wk generates the correct metadata", {
sprintf('{"edges":"spherical"}')
)
})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relates to #79 (comment) and #79 (comment), add some geoedesic and crs checks.

@anthonynorth
Copy link
Contributor

I haven't much input on the nanoarrow specifics as I'm not familiar with that api.

Two identical bugs in convert_array.wk_wkt & convert_array.wk_wkb methods (see comments). Otherwise I think it's go to go.

paleolimbot and others added 5 commits December 1, 2023 21:20
Co-authored-by: Anthony North <anthony.jl.north@gmail.com>
Co-authored-by: Anthony North <anthony.jl.north@gmail.com>
Co-authored-by: Anthony North <anthony.jl.north@gmail.com>
@paleolimbot
Copy link
Contributor Author

Thank you for the detailed look!

@paleolimbot paleolimbot merged commit e3d748d into geoarrow:main Dec 2, 2023
6 checks passed
@paleolimbot paleolimbot deleted the r-vctr branch December 4, 2023 01:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R] processing nanoarrow streams with geoarrow column
2 participants