Improve reading with use_stream = TRUE
#2247
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR seeks to eliminate differences between
use_stream = TRUE
anduse_stream = FALSE
. It (1) uses wk's WKB -> sfc conversion when usinguse_stream = TRUE
because it (now) supportspromote_multi = TRUE
, and (2) prefers "geometry" as the geometry column name.@edzer I know you commented this out on the last PR, but it seems that when using the usual
read_sf()
we always get "geometry" as the geometry column name for shapefiles and flatgeobuf (but not for gpkg?). I think I am just missing where exactly this value gets set, but the hack I have here just converts "wkb_geometry" to "geometry".Before this PR:
After this PR:
This seems to improve performance for points, but decrease performance for polygons with many vertices. A better long-term solution would be to support
promote_multi
in sf's WKB reader, or, even better, optimize the Arrow WKB representation (basically: one big long buffer with all the wkb packed together rather than many tiny buffers all within their own raw vector) -> sfc. From the benchmark provided by @kadyb in #2036 (comment) :As a side note, I find these benchmark numbers to be rather variable, perhaps due to the garbage collection of tens of thousands of geometries, or perhaps due to my computer running on battery.