Experimental support for GDAL 3.6 columnar API #2036

paleolimbot · 2022-11-10T16:32:06Z

This PR is a proof-of-concept for future support for the new columnar access API that is introduced in GDAL 3.6. The API exposes a pull-style iterator as an ArrowArrayStream. The pyogrio package has a PR up to support this and has noticed a ~2x improvement on their test data set ( geopandas/pyogrio#155 ). I've spent some time implementing conversions for Arrow C data interface objects ( apache/arrow-nanoarrow#65 ) and am curious to see if we can get any of that speed in R too!

To give it a try:

# (Requires development GDAL!)
# remotes::install_github("apache/arrow-nanoarrow/r#65")
# remotes::install_github("paleolimbot/sf@stream-reading")
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.7.0dev-908498a4d8, PROJ 6.3.1; sf_use_s2() is
TRUE
read_sf(system.file("shape/nc.shp", package = "sf"), use_stream = TRUE)
#> Simple feature collection with 100 features and 15 fields
#> Geometry type: GEOMETRY
#> Dimension:     XY
#> Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS:  NAD27
#> # A tibble: 100 × 16
#>    OGC_FID  AREA PERIMETER CNTY_ CNTY_ID NAME   FIPS  FIPSNO CRESS…¹ BIR74 SID74
#>      <dbl> <dbl>     <dbl> <dbl>   <dbl> <chr>  <chr>  <dbl>   <int> <dbl> <dbl>
#>  1       0 0.114      1.44  1825    1825 Ashe   37009  37009       5  1091     1
#>  2       1 0.061      1.23  1827    1827 Alleg… 37005  37005       3   487     0
#>  3       2 0.143      1.63  1828    1828 Surry  37171  37171      86  3188     5
#>  4       3 0.07       2.97  1831    1831 Curri… 37053  37053      27   508     1
#>  5       4 0.153      2.21  1832    1832 North… 37131  37131      66  1421     9
#>  6       5 0.097      1.67  1833    1833 Hertf… 37091  37091      46  1452     7
#>  7       6 0.062      1.55  1834    1834 Camden 37029  37029      15   286     0
#>  8       7 0.091      1.28  1835    1835 Gates  37073  37073      37   420     0
#>  9       8 0.118      1.42  1836    1836 Warren 37185  37185      93   968     4
#> 10       9 0.124      1.43  1837    1837 Stokes 37169  37169      85  1612     1
#> # … with 90 more rows, 5 more variables: NWBIR74 <dbl>, BIR79 <dbl>,
#> #   SID79 <dbl>, NWBIR79 <dbl>, wkb_geometry <GEOMETRY [°]>, and abbreviated
#> #   variable name ¹CRESS_ID
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

All sorts of options aren't supported but I'm mostly trying to find out whether or not the nanoarrow conversions are going to be helpful in this context. I will follow up here with more experiments!

rsbivand · 2022-11-10T18:07:41Z

Thanks, interesting! Is it feasible to use libgdal without needing nanoarrow? I think https://gdal.org/development/rfc/rfc86_column_oriented_api.html implements locally without requiring access to Arrow itself?

paleolimbot · 2022-11-10T18:12:40Z

You will need either arrow or nanoarrow to get R objects...nanoarrow is brand new/in-development and will be zero-dependency for expressly this use-case! It's GDAL includes conversions to Numpy arrays for its Python bindings that are similar in spirit and dependency-ness to the converters that I'm writing in apache/arrow-nanoarrow#65 .

edzer · 2022-11-10T18:20:32Z

src/stars.cpp

@@ -509,7 +509,7 @@ void CPL_write_gdal(NumericMatrix x, CharacterVector fname, CharacterVector driv
 		eType = GDT_Byte; // #nocov
 #if GDAL_VERSION_NUM >= 3070000
 	else if (Type[0] == "Int8")
-		eType = GDT_Int8; // #nocov
+		eType = GDT_Byte; // #nocov


This doesn't look good IMO, see #2033 - do you actually run into this code? That would mean the #if condition is wrong.

I'm sure it's wrong...but yes, this failed to compile for me with GDT_Int8.

I fixed this - it's because I'm running from the latest dev branch which has moved on to 3.7.xx versioning because there's a release candidate.

GDT_Int8 implementation has now landed in GDAL master

rsbivand · 2022-11-10T18:26:53Z

Is @edzer 's point to do with WIP RFC87 rspatial/terra#885 (comment)?

paleolimbot · 2022-11-10T18:30:30Z

It also may be related to my specific installation where maybe there are some old headers sitting in the include path!

edzer · 2022-11-10T18:34:30Z

We can also outcomment the whole Int8 section and wait until that moves in GDAL.

paleolimbot · 2022-11-11T02:20:23Z

I'm sure one or more of the missing features will eat into this considerably, but the initial test I did reads ~400,000 lines about 4 times faster. This might have to do with the use of st_as_sf(wk::as_wkb()) but probably not anything that can't be replicated using sf tooling - my recollection is that sf's wkb parser is slightly faster anyway.

library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.7.0dev-908498a4d8, PROJ 6.3.1; sf_use_s2() is
#> TRUE
# curl::curl_download(
#   "https://github.com/paleolimbot/geoarrow-data/releases/download/v0.0.1/nshn_water_line.gpkg",
#   "nshn_water_line.gpkg"
# )

system.time(
  tbl1 <- read_sf("nshn_water_line.gpkg", use_stream = FALSE)
)
#>    user  system elapsed 
#>  20.264   1.355  21.619 

system.time(
  tbl2 <- read_sf("nshn_water_line.gpkg", use_stream = TRUE)
)
#>    user  system elapsed 
#>   5.960   0.568   5.421

paleolimbot · 2022-11-22T16:31:58Z

Still a ways to go here, but progress! Some TODOs:

Need to implement "promote to multi" somehow, and probably use Sf's WKB reader rather than wk's
M coordinates cause segfaults?
If there is no geometry column, we get a segfault?

I'm currently testing using R_SF_ST_READ_USE_STREAM=true R -e 'devtools::test()'.

paleolimbot · 2023-09-21T12:04:16Z

Still faster, but I switched it to use sf's WKB reader instead. I think the speed is due to the underlying Arrow driver being fast for gpkg (not nanoarrow).

Many of the tests fail if you turn the default to use the stream (and one segfaults)...there's no way to implement promote_to_multi at the moment (you'd have to do it yourself in the WKB reader).

library(sf)
#> Linking to GEOS 3.12.0, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE

# curl::curl_download(
#   "https://github.com/geoarrow/geoarrow-data/releases/download/latest-dev/ns-water-water_line.gpkg",
#   "ns-water-water_line.gpkg"
# )

system.time(
    tbl1 <- read_sf("ns-water-water_line.gpkg", use_stream = FALSE)
)
#>    user  system elapsed 
#>   7.266   0.556   7.866

system.time(
    tbl2 <- read_sf("ns-water-water_line.gpkg", use_stream = TRUE)
)
#> Simple feature collection with 483268 features and 33 fields
#> Geometry type: GEOMETRY
#> Dimension:     XY
#> Bounding box:  xmin: 215869.1 ymin: 4790519 xmax: 781792.9 ymax: 5237312
#> Projected CRS: NAD_1983_CSRS_2010_UTM_20_Nova_Scotia + CGVD2013(CGG2013) height
#>    user  system elapsed 
#>   3.107   0.629   3.222

^{Created on 2023-09-21 with reprex v2.0.2}

edzer · 2023-09-21T12:19:42Z

Looks great - trains, airports and planes are always the best place for some decent development!

src/gdal_read_stream.cpp

Co-authored-by: Even Rouault <even.rouault@spatialys.com>

paleolimbot · 2023-09-26T01:09:16Z

@mdsumner Gave this a go and found a segfault for a POINT (that doesn't exist in Python): https://gist.github.com/mdsumner/21c5e74f8565487e3304dedf596a10c4 . There's also one segfault in the tests (for something with M geometries but I haven't had a chance to try it with the debugger yet).

…to stream-reading

paleolimbot · 2023-09-27T02:36:38Z

This still needs tests for use_stream = TRUE, but it's getting close! Setting R_SF_ST_READ_USE_STREAM=true I get a few failures because (1) nanoarrow will never return a timezone-less timestamp (always sets to UTC) and (2) the WKB parsing doesn't do "promote multi", so the ubiquitous nc.gpkg gets converted to GEOMETRY which causes some tests to fail.

mdsumner · 2023-09-27T10:50:08Z

also dialect for when running sql, that's just a string input doesn't have any layer cleanup implications (so can run SQLITE st_funs on OGRSQL drivers, or vice versa, for example)

kadyb · 2023-10-02T09:45:35Z

Thanks, great to see this feature! Did you try do some benchmarks on synthetic data? In the example below, the differences in timings are quite small. What could be the reason (or is this expected on this dataset)?

Benchmark

library("sf")

n = 500000
df = data.frame(x = rnorm(n), y = rnorm(n),
                col_1 = sample(c(TRUE, FALSE), n, replace = TRUE), # logical
                col_2 = sample(letters, n, replace = TRUE),        # character
                col_3 = runif(n),                                  # double
                col_4 = sample(1:100, n, replace = TRUE))          # integer

## points ##
pts = st_as_sf(df, coords = c("x", "y"), crs = "EPSG:2180")
write_sf(pts, "pts.gpkg")

bench::mark(
  check = FALSE, iterations = 10,
  stream = read_sf("pts.gpkg", use_stream = TRUE),
  non_stream = read_sf("pts.gpkg", use_stream = FALSE)
)
#>   expression      min   median `itr/sec` mem_alloc
#> 1 stream         1.8s    2.14s     0.491    87.8MB
#> 2 non_stream     1.7s    1.79s     0.545    72.5MB

## buffers ##
buff = st_buffer(pts, dist = 1000)
write_sf(buff, "buffers.gpkg")

bench::mark(
  check = FALSE, iterations = 5,
  stream = read_sf("buffers.gpkg", use_stream = TRUE),
  non_stream = read_sf("buffers.gpkg", use_stream = FALSE)
)
#>   expression      min   median `itr/sec` mem_alloc
#> 1 stream        4.22s    5.97s     0.181    1.93GB
#> 2 non_stream    4.21s    6.23s     0.182    1.91GB

kadyb · 2023-10-02T09:59:42Z

This doesn't work for me:

n = 10
df = data.frame(x = rnorm(n), y = rnorm(n)) # without attributes
pts = st_as_sf(df, coords = c("x", "y"), crs = "EPSG:2180")
write_sf(pts, "pts.gpkg")
x = read_sf("pts.gpkg", use_stream = TRUE)
#> Error in st_as_sfc.WKB(x) : 
#>   cannot read WKB object from zero-length raw vector

paleolimbot · 2023-10-02T13:30:56Z

This doesn't work for me:

Make an issue! I'll collect any stream-reading related issues into a follow-up PR in the next week or so.

Did you try do some benchmarks on synthetic data?

The initial PR is mostly about connecting the wires. I expect that optimizing the WKB reader will help (I'm planning to do so in the next week)...other than that, even with equivalent performance, this PR is more about long-term maintainability: it offloads the entire OGR--R conversion into code that sf doesn't have to maintain. There are a number of pending performance improvements there, too (e.g., ALTREP for strings) that sf (in the long term) can get for free.

kadyb · 2023-10-02T17:03:05Z

Good to hear, I will wait for further news! In which repository should I create the issue?

edzer · 2023-10-02T17:07:36Z

In which repository should I create the issue?

here.

paleolimbot mentioned this pull request Nov 10, 2022

ENH: use new columnar GetArrowStream if GDAL>=3.6 and pyarrow available geopandas/pyogrio#155

Merged

edzer reviewed Nov 10, 2022

View reviewed changes

paleolimbot force-pushed the stream-reading branch from 90fa025 to 37ea3c0 Compare December 14, 2022 02:47

rsbivand mentioned this pull request Feb 7, 2023

Writing a large gpkg file taking forever #1409

Closed

kadyb mentioned this pull request Mar 16, 2023

Implement column-oriented read API rspatial/terra#1067

Open

paleolimbot added 14 commits September 21, 2023 12:51

start to add stream interface

ca3a5ff

maybe fix starts.cpp for gdal3.6?

9c093d0

theoretically working read

4564990

fix typo

b6b1cdd

more updates

e37d8c0

try other remote spec

824118e

don't include remote

1f1eb59

comment-out code

af9db24

document new argument

31ea858

start supporting more options

a032ecc

passing read tests using the stream interface

04ec9b3

mark segfaulting tests

c3318f1

uncomment int8 bits

4dafd94

regenerate rcpp

a32ba79

paleolimbot force-pushed the stream-reading branch from 37ea3c0 to a32ba79 Compare September 21, 2023 10:54

paleolimbot added 2 commits September 21, 2023 13:13

mark segfaulting test

71dee0e

compiler warning

4a6321a

rouault reviewed Sep 21, 2023

View reviewed changes

src/gdal_read_stream.cpp Outdated Show resolved Hide resolved

Update src/gdal_read_stream.cpp

ebb45ac

Co-authored-by: Even Rouault <even.rouault@spatialys.com>

paleolimbot added 5 commits September 26, 2023 21:41

Merge branch 'stream-reading' of https://github.com/paleolimbot/sf in…

13b63dc

…to stream-reading

run document

57a7bbd

maybe fix build on old gdal

89e4e98

fix segfaults with NA crs

5595f5d

align column ordering and printing

7a36142

edzer merged commit 7a36142 into r-spatial:main Oct 1, 2023

paleolimbot deleted the stream-reading branch October 3, 2023 01:39

paleolimbot mentioned this pull request Oct 23, 2023

Improve reading with use_stream = TRUE #2247

Merged

edzer mentioned this pull request Feb 15, 2024

Compatibility with arrow storage formats #1756

Closed

Experimental support for GDAL 3.6 columnar API #2036

Experimental support for GDAL 3.6 columnar API #2036

Uh oh!

Conversation

paleolimbot commented Nov 10, 2022

Uh oh!

rsbivand commented Nov 10, 2022

Uh oh!

paleolimbot commented Nov 10, 2022

Uh oh!

edzer Nov 10, 2022

Choose a reason for hiding this comment

Uh oh!

paleolimbot Nov 10, 2022

Choose a reason for hiding this comment

Uh oh!

paleolimbot Nov 11, 2022

Choose a reason for hiding this comment

Uh oh!

rouault Nov 17, 2022

Choose a reason for hiding this comment

Uh oh!

rsbivand commented Nov 10, 2022

Uh oh!

paleolimbot commented Nov 10, 2022

Uh oh!

edzer commented Nov 10, 2022

Uh oh!

paleolimbot commented Nov 11, 2022

Uh oh!

paleolimbot commented Nov 22, 2022

Uh oh!

paleolimbot commented Sep 21, 2023

Uh oh!

edzer commented Sep 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

paleolimbot commented Sep 26, 2023

Uh oh!

paleolimbot commented Sep 27, 2023

Uh oh!

mdsumner commented Sep 27, 2023

Uh oh!

kadyb commented Oct 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kadyb commented Oct 2, 2023

Uh oh!

paleolimbot commented Oct 2, 2023

Uh oh!

kadyb commented Oct 2, 2023

Uh oh!

edzer commented Oct 2, 2023

Uh oh!

Uh oh!

edzer commented Sep 21, 2023 •

edited

Loading

kadyb commented Oct 2, 2023 •

edited

Loading