Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental support for GDAL 3.6 columnar API #2036

Merged
merged 22 commits into from
Oct 1, 2023

Conversation

paleolimbot
Copy link
Contributor

This PR is a proof-of-concept for future support for the new columnar access API that is introduced in GDAL 3.6. The API exposes a pull-style iterator as an ArrowArrayStream. The pyogrio package has a PR up to support this and has noticed a ~2x improvement on their test data set ( geopandas/pyogrio#155 ). I've spent some time implementing conversions for Arrow C data interface objects ( apache/arrow-nanoarrow#65 ) and am curious to see if we can get any of that speed in R too!

To give it a try:

# (Requires development GDAL!)
# remotes::install_github("apache/arrow-nanoarrow/r#65")
# remotes::install_github("paleolimbot/sf@stream-reading")
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.7.0dev-908498a4d8, PROJ 6.3.1; sf_use_s2() is
TRUE
read_sf(system.file("shape/nc.shp", package = "sf"), use_stream = TRUE)
#> Simple feature collection with 100 features and 15 fields
#> Geometry type: GEOMETRY
#> Dimension:     XY
#> Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS:  NAD27
#> # A tibble: 100 × 16
#>    OGC_FID  AREA PERIMETER CNTY_ CNTY_ID NAME   FIPS  FIPSNO CRESS…¹ BIR74 SID74
#>      <dbl> <dbl>     <dbl> <dbl>   <dbl> <chr>  <chr>  <dbl>   <int> <dbl> <dbl>
#>  1       0 0.114      1.44  1825    1825 Ashe   37009  37009       5  1091     1
#>  2       1 0.061      1.23  1827    1827 Alleg… 37005  37005       3   487     0
#>  3       2 0.143      1.63  1828    1828 Surry  37171  37171      86  3188     5
#>  4       3 0.07       2.97  1831    1831 Curri… 37053  37053      27   508     1
#>  5       4 0.153      2.21  1832    1832 North… 37131  37131      66  1421     9
#>  6       5 0.097      1.67  1833    1833 Hertf… 37091  37091      46  1452     7
#>  7       6 0.062      1.55  1834    1834 Camden 37029  37029      15   286     0
#>  8       7 0.091      1.28  1835    1835 Gates  37073  37073      37   420     0
#>  9       8 0.118      1.42  1836    1836 Warren 37185  37185      93   968     4
#> 10       9 0.124      1.43  1837    1837 Stokes 37169  37169      85  1612     1
#> # … with 90 more rows, 5 more variables: NWBIR74 <dbl>, BIR79 <dbl>,
#> #   SID79 <dbl>, NWBIR79 <dbl>, wkb_geometry <GEOMETRY [°]>, and abbreviated
#> #   variable name ¹​CRESS_ID
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

All sorts of options aren't supported but I'm mostly trying to find out whether or not the nanoarrow conversions are going to be helpful in this context. I will follow up here with more experiments!

@rsbivand
Copy link
Member

Thanks, interesting! Is it feasible to use libgdal without needing nanoarrow? I think https://gdal.org/development/rfc/rfc86_column_oriented_api.html implements locally without requiring access to Arrow itself?

@paleolimbot
Copy link
Contributor Author

You will need either arrow or nanoarrow to get R objects...nanoarrow is brand new/in-development and will be zero-dependency for expressly this use-case! It's GDAL includes conversions to Numpy arrays for its Python bindings that are similar in spirit and dependency-ness to the converters that I'm writing in apache/arrow-nanoarrow#65 .

src/stars.cpp Outdated
@@ -509,7 +509,7 @@ void CPL_write_gdal(NumericMatrix x, CharacterVector fname, CharacterVector driv
eType = GDT_Byte; // #nocov
#if GDAL_VERSION_NUM >= 3070000
else if (Type[0] == "Int8")
eType = GDT_Int8; // #nocov
eType = GDT_Byte; // #nocov
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look good IMO, see #2033 - do you actually run into this code? That would mean the #if condition is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure it's wrong...but yes, this failed to compile for me with GDT_Int8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this - it's because I'm running from the latest dev branch which has moved on to 3.7.xx versioning because there's a release candidate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GDT_Int8 implementation has now landed in GDAL master

@rsbivand
Copy link
Member

Is @edzer 's point to do with WIP RFC87 rspatial/terra#885 (comment)?

@paleolimbot
Copy link
Contributor Author

It also may be related to my specific installation where maybe there are some old headers sitting in the include path!

@edzer
Copy link
Member

edzer commented Nov 10, 2022

We can also outcomment the whole Int8 section and wait until that moves in GDAL.

@paleolimbot
Copy link
Contributor Author

I'm sure one or more of the missing features will eat into this considerably, but the initial test I did reads ~400,000 lines about 4 times faster. This might have to do with the use of st_as_sf(wk::as_wkb()) but probably not anything that can't be replicated using sf tooling - my recollection is that sf's wkb parser is slightly faster anyway.

library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.7.0dev-908498a4d8, PROJ 6.3.1; sf_use_s2() is
#> TRUE
# curl::curl_download(
#   "https://github.com/paleolimbot/geoarrow-data/releases/download/v0.0.1/nshn_water_line.gpkg",
#   "nshn_water_line.gpkg"
# )

system.time(
  tbl1 <- read_sf("nshn_water_line.gpkg", use_stream = FALSE)
)
#>    user  system elapsed 
#>  20.264   1.355  21.619 

system.time(
  tbl2 <- read_sf("nshn_water_line.gpkg", use_stream = TRUE)
)
#>    user  system elapsed 
#>   5.960   0.568   5.421 

@paleolimbot
Copy link
Contributor Author

Still a ways to go here, but progress! Some TODOs:

  • Need to implement "promote to multi" somehow, and probably use Sf's WKB reader rather than wk's
  • M coordinates cause segfaults?
  • If there is no geometry column, we get a segfault?

I'm currently testing using R_SF_ST_READ_USE_STREAM=true R -e 'devtools::test()'.

@paleolimbot
Copy link
Contributor Author

Still faster, but I switched it to use sf's WKB reader instead. I think the speed is due to the underlying Arrow driver being fast for gpkg (not nanoarrow).

Many of the tests fail if you turn the default to use the stream (and one segfaults)...there's no way to implement promote_to_multi at the moment (you'd have to do it yourself in the WKB reader).

library(sf)
#> Linking to GEOS 3.12.0, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE

# curl::curl_download(
#   "https://github.com/geoarrow/geoarrow-data/releases/download/latest-dev/ns-water-water_line.gpkg",
#   "ns-water-water_line.gpkg"
# )

system.time(
    tbl1 <- read_sf("ns-water-water_line.gpkg", use_stream = FALSE)
)
#>    user  system elapsed 
#>   7.266   0.556   7.866

system.time(
    tbl2 <- read_sf("ns-water-water_line.gpkg", use_stream = TRUE)
)
#> Simple feature collection with 483268 features and 33 fields
#> Geometry type: GEOMETRY
#> Dimension:     XY
#> Bounding box:  xmin: 215869.1 ymin: 4790519 xmax: 781792.9 ymax: 5237312
#> Projected CRS: NAD_1983_CSRS_2010_UTM_20_Nova_Scotia + CGVD2013(CGG2013) height
#>    user  system elapsed 
#>   3.107   0.629   3.222

Created on 2023-09-21 with reprex v2.0.2

@edzer
Copy link
Member

edzer commented Sep 21, 2023

Looks great - trains, airports and planes are always the best place for some decent development!

src/gdal_read_stream.cpp Outdated Show resolved Hide resolved
Co-authored-by: Even Rouault <even.rouault@spatialys.com>
@paleolimbot
Copy link
Contributor Author

@mdsumner Gave this a go and found a segfault for a POINT (that doesn't exist in Python): https://gist.github.com/mdsumner/21c5e74f8565487e3304dedf596a10c4 . There's also one segfault in the tests (for something with M geometries but I haven't had a chance to try it with the debugger yet).

@paleolimbot
Copy link
Contributor Author

This still needs tests for use_stream = TRUE, but it's getting close! Setting R_SF_ST_READ_USE_STREAM=true I get a few failures because (1) nanoarrow will never return a timezone-less timestamp (always sets to UTC) and (2) the WKB parsing doesn't do "promote multi", so the ubiquitous nc.gpkg gets converted to GEOMETRY which causes some tests to fail.

@mdsumner
Copy link
Member

also dialect for when running sql, that's just a string input doesn't have any layer cleanup implications (so can run SQLITE st_funs on OGRSQL drivers, or vice versa, for example)

@edzer edzer merged commit 7a36142 into r-spatial:main Oct 1, 2023
7 checks passed
@kadyb
Copy link
Contributor

kadyb commented Oct 2, 2023

Thanks, great to see this feature! Did you try do some benchmarks on synthetic data? In the example below, the differences in timings are quite small. What could be the reason (or is this expected on this dataset)?

Benchmark
library("sf")

n = 500000
df = data.frame(x = rnorm(n), y = rnorm(n),
                col_1 = sample(c(TRUE, FALSE), n, replace = TRUE), # logical
                col_2 = sample(letters, n, replace = TRUE),        # character
                col_3 = runif(n),                                  # double
                col_4 = sample(1:100, n, replace = TRUE))          # integer

## points ##
pts = st_as_sf(df, coords = c("x", "y"), crs = "EPSG:2180")
write_sf(pts, "pts.gpkg")

bench::mark(
  check = FALSE, iterations = 10,
  stream = read_sf("pts.gpkg", use_stream = TRUE),
  non_stream = read_sf("pts.gpkg", use_stream = FALSE)
)
#>   expression      min   median `itr/sec` mem_alloc
#> 1 stream         1.8s    2.14s     0.491    87.8MB
#> 2 non_stream     1.7s    1.79s     0.545    72.5MB

## buffers ##
buff = st_buffer(pts, dist = 1000)
write_sf(buff, "buffers.gpkg")

bench::mark(
  check = FALSE, iterations = 5,
  stream = read_sf("buffers.gpkg", use_stream = TRUE),
  non_stream = read_sf("buffers.gpkg", use_stream = FALSE)
)
#>   expression      min   median `itr/sec` mem_alloc
#> 1 stream        4.22s    5.97s     0.181    1.93GB
#> 2 non_stream    4.21s    6.23s     0.182    1.91GB

@kadyb
Copy link
Contributor

kadyb commented Oct 2, 2023

This doesn't work for me:

n = 10
df = data.frame(x = rnorm(n), y = rnorm(n)) # without attributes
pts = st_as_sf(df, coords = c("x", "y"), crs = "EPSG:2180")
write_sf(pts, "pts.gpkg")
x = read_sf("pts.gpkg", use_stream = TRUE)
#> Error in st_as_sfc.WKB(x) : 
#>   cannot read WKB object from zero-length raw vector

@paleolimbot
Copy link
Contributor Author

This doesn't work for me:

Make an issue! I'll collect any stream-reading related issues into a follow-up PR in the next week or so.

Did you try do some benchmarks on synthetic data?

The initial PR is mostly about connecting the wires. I expect that optimizing the WKB reader will help (I'm planning to do so in the next week)...other than that, even with equivalent performance, this PR is more about long-term maintainability: it offloads the entire OGR--R conversion into code that sf doesn't have to maintain. There are a number of pending performance improvements there, too (e.g., ALTREP for strings) that sf (in the long term) can get for free.

@kadyb
Copy link
Contributor

kadyb commented Oct 2, 2023

Good to hear, I will wait for further news! In which repository should I create the issue?

@edzer
Copy link
Member

edzer commented Oct 2, 2023

In which repository should I create the issue?

here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants