Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve reading with use_stream = TRUE #2247

Merged
merged 7 commits into from
Oct 30, 2023

Conversation

paleolimbot
Copy link
Contributor

This PR seeks to eliminate differences between use_stream = TRUE and use_stream = FALSE. It (1) uses wk's WKB -> sfc conversion when using use_stream = TRUE because it (now) supports promote_multi = TRUE, and (2) prefers "geometry" as the geometry column name.

@edzer I know you commented this out on the last PR, but it seems that when using the usual read_sf() we always get "geometry" as the geometry column name for shapefiles and flatgeobuf (but not for gpkg?). I think I am just missing where exactly this value gets set, but the hack I have here just converts "wkb_geometry" to "geometry".

Before this PR:

library(sf)
#> Linking to GEOS 3.12.0, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE

nc_file <- system.file("shape/nc.shp", package = "sf")
waldo::compare(
    read_sf(nc_file, use_stream = TRUE),
    read_sf(nc_file, use_stream = FALSE),
    ignore_attr = "crs"
)
#> `names(old)[12:15]`: "BIR79" "SID79" "NWBIR79" "wkb_geometry"
#> `names(new)[12:15]`: "BIR79" "SID79" "NWBIR79" "geometry"    
#> 
#> `attr(old, 'sf_column')`: "wkb_geometry"
#> `attr(new, 'sf_column')`: "geometry"    
#> 
#> `old$wkb_geometry` is an S3 object of class <sfc_GEOMETRY/sfc>, a list
#> `new$wkb_geometry` is absent
#> 
#> `old$geometry` is absent
#> `new$geometry` is an S3 object of class <sfc_MULTIPOLYGON/sfc>, a list

After this PR:

library(sf)
#> Linking to GEOS 3.12.0, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE

nc_file <- system.file("shape/nc.shp", package = "sf")
waldo::compare(
    read_sf(nc_file, use_stream = TRUE),
    read_sf(nc_file, use_stream = FALSE),
    ignore_attr = "crs"
)
#> ✔ No differences

nc_file <- system.file("gpkg/nc.gpkg", package = "sf")
waldo::compare(
    read_sf(nc_file, use_stream = TRUE),
    read_sf(nc_file, use_stream = FALSE),
    ignore_attr = "crs"
)
#> ✔ No differences

This seems to improve performance for points, but decrease performance for polygons with many vertices. A better long-term solution would be to support promote_multi in sf's WKB reader, or, even better, optimize the Arrow WKB representation (basically: one big long buffer with all the wkb packed together rather than many tiny buffers all within their own raw vector) -> sfc. From the benchmark provided by @kadyb in #2036 (comment) :

library("sf")
#> Linking to GEOS 3.12.0, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE

n = 500000
df = data.frame(x = rnorm(n), y = rnorm(n),
                col_1 = sample(c(TRUE, FALSE), n, replace = TRUE), # logical
                col_2 = sample(letters, n, replace = TRUE),        # character
                col_3 = runif(n),                                  # double
                col_4 = sample(1:100, n, replace = TRUE))          # integer

## points ##
pts = st_as_sf(df, coords = c("x", "y"), crs = "EPSG:2180")
write_sf(pts, "pts.gpkg")

bench::mark(
    check = FALSE, iterations = 10,
    stream = read_sf("pts.gpkg", use_stream = TRUE),
    non_stream = read_sf("pts.gpkg", use_stream = FALSE)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 stream     743.26ms    1.13s     0.947    53.2MB     2.75
#> 2 non_stream    1.13s    1.36s     0.751    75.9MB     4.58

## buffers ##
buff = st_buffer(pts, dist = 1000)
write_sf(buff, "buffers.gpkg")

bench::mark(
    check = FALSE, iterations = 5,
    stream = read_sf("buffers.gpkg", use_stream = TRUE),
    non_stream = read_sf("buffers.gpkg", use_stream = FALSE)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 stream         4.5s    6.31s     0.127     1.9GB    0.152
#> 2 non_stream    3.91s    4.52s     0.112    1.91GB    0.269

As a side note, I find these benchmark numbers to be rather variable, perhaps due to the garbage collection of tens of thousands of geometries, or perhaps due to my computer running on battery.

@edzer
Copy link
Member

edzer commented Oct 29, 2023

LGTM - ready to go?

@paleolimbot
Copy link
Contributor Author

I think so! I think stream reading could still use some dedicated tests (but I will follow up with those later!).

@edzer edzer merged commit 1bc3af4 into r-spatial:main Oct 30, 2023
7 checks passed
@paleolimbot paleolimbot deleted the use-promote-multi branch October 30, 2023 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants