Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeoArrow Support? #2098

Open
techmavengeospatial opened this issue Aug 25, 2024 · 3 comments
Open

GeoArrow Support? #2098

techmavengeospatial opened this issue Aug 25, 2024 · 3 comments
Labels
Type: enhancement New feature or request

Comments

@techmavengeospatial
Copy link

What feature or improvement would you like to see?

Is GeoArrow Supported?

@techmavengeospatial techmavengeospatial added the Type: enhancement New feature or request label Aug 25, 2024
@lidavidm
Copy link
Member

Which driver specifically? I think it would be reasonable to map the PostGIS types to GeoArrow types in the PostgreSQL driver, at least.

@paleolimbot
Copy link
Member

For PostGIS, we are reasonably close to being able to export geometry column with the "geoarrow.wkb" extension type. The main barrier that we have is that Postgres embeds the spatial reference identifier as an integer key at the row level, and GeoArrow embeds the coordinate reference system as a string at the type level.

Various tools work around this differently...GeoPandas, for example, pulls the SRID out of the first element, assumes that is an EPSG code, and assumes that all following values have the same SRID. This is frequently the case; however, I would like to do something safer by default at the driver level. There are a few options:

  • We actually do have the ability to inspect one or or more rows of data at the time that we pull the first data from COPY (which includes the header and the first row of data). We can record the SRID there, put it as type metadata, and check that all subsequent items have an identical SRID. This won't work if the first item is NULL but that doesn't happen frequently for geometry columns.
  • I think we can get the table/column or object identifier if we're pulling straight from a table (or filtered table), in which case we might be able to issue a query to the SRID table: https://postgis.net/docs/using_postgis_dbmanagement.html#geometry_columns / https://postgis.net/docs/using_postgis_dbmanagement.html#spatial_ref_sys_table .
  • For non-COPY queries, we get the whole result at once and we can do the same thing, but possibly doing a better job for queries where the first value(s) are null.

In PostGIS, the values themselves are emitted as EKWB, which is currently "valid" for the geoarrow.wkb extension type (although we prefer ISO WKB). Reencoding EWKB as ISO WKB is probably not worth it since most readers can handle both.

For SQLite, the type we're dealing with is probably GeoPackage's GEOMETRY (as reported by the decltype, which I don't think we inspect at all at the moment). This has a similar setup to PostGIS: the GEOMETRY column has an embeded item-level SRID but all elements in a single column have the same value whose textual representation is embedded in a dedicated table. The BLOB value that comes through from GeoPackage is prefixed WKB ( https://www.geopackage.org/spec/#gpb_data_blob_format ) which can be stripped before appending to the output ( https://github.com/paleolimbot/gpkg/blob/4d18da00b753f79b225d0e3ac7dac4e517733b06/src/gpkg.cpp#L196-L232 ).

For SQLite, there is also Spatialite, which has its own encoding ( https://www.gaia-gis.it/gaia-sins/BLOB-Geometry.html ). Annoyingly, I think this also has a decltype of GEOMETRY.

I'll have to double check for snowflake and bigquery, but I believe that snowflake emits GeoJSON with item-level SRID and BigQuery emits well-known text. BigQuery only deals with global "geographies", so item-level SRID is not a problem (i.e. it's always OGC:CRS84).

All of this is possible to stick in ADBC, but there's enough geometry-specific formatting that it might be worth figuring out how to make a wrapper driver (or make it easy to make a wrapper driver!) so that the geometry-specific details could live elsewhere. Also useful would be a driver embedded into GDAL (since GDAL already implements all of this logic, including export to Arrow and import from Arrow).

@paleolimbot
Copy link
Member

An example of a current workaround for getting PostGIS to GeoArrow IPC in R (after a request for an example on Mastadon!):

# docker run -it --rm \
#   -e POSTGRES_PASSWORD=password \
#   -e POSTGRES_USER=postgres \
#   -p 5432:5432 \
#   postgis/postgis:latest
# ogr2ogr \
#   -f PostgreSQL PG:"host=127.0.0.1 user=postgres password=password dbname=postgres" \
#   /vsicurl/https://github.com/geoarrow/geoarrow-data/releases/download/latest-dev/ns-water-basin_point.gpkg \
#   -nln basin_point

library(adbcdrivermanager)
library(arrow, warn.conflicts = FALSE)
library(geoarrow)

uri <- "postgresql://localhost:5432/postgres?user=postgres&password=password"
db <- adbc_database_init(adbcpostgresql::adbcpostgresql(), uri = uri)
con <- db |> 
  adbc_connection_init()

# Have to look up the CRS separately, unfortunately
crs_info <- con |> 
  read_adbc(
    paste(
      "SELECT geometry_columns.f_table_name AS tbl, spatial_ref_sys.srtext",
      "AS crs FROM geometry_columns INNER JOIN spatial_ref_sys ON ", 
      "geometry_columns.srid = spatial_ref_sys.srid",
      "WHERE geometry_columns.f_table_name = 'basin_point'"
    )
  ) |> 
  tibble::as_tibble() |> 
  print()
#> # A tibble: 1 × 2
#>   tbl         crs                                                               
#>   <chr>       <chr>                                                             
#> 1 basin_point "COMPD_CS[\"NAD_1983_CSRS_2010_UTM_20_Nova_Scotia + CGVD2013(CGG2…

# Read into an arrow::Table
tbl <- con |> 
  read_adbc("SELECT * FROM basin_point") |> 
  as_arrow_table()

# Convert the geometry column to geoarrow (probably zero copy although I forget the details)
# Explicitly attaching the a CRS here doesn't seem to work but should.
# You can also pick another geoarrow encoding here (e.g., geoarrow_point()).
vctr <- as_geoarrow_vctr(tbl$geom, schema = geoarrow_wkb())
attr(vctr, "schema") <- nanoarrow::as_nanoarrow_schema(geoarrow_wkb(crs = crs_info$crs))
tbl$geom <- vctr


# Write IPC
out <- tempfile()
arrow::write_ipc_stream(tbl, out)

# Should be able to read with crs
df <- nanoarrow::read_nanoarrow(out) |> 
  tibble::as_tibble()
df$geom <- sf::st_as_sfc(df$geom)
sf::st_as_sf(df)
#> Simple feature collection with 46 features and 6 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 245206.7 ymin: 4820886 xmax: 739542 ymax: 5197865
#> Projected CRS: NAD_1983_CSRS_2010_UTM_20_Nova_Scotia + CGVD2013(CGG2013) height
#> # A tibble: 46 × 7
#>      fid objectid feat_code basin_name river     hid                 geom
#>    <int>    <dbl> <chr>     <chr>      <chr>     <chr>        <POINT [m]>
#>  1     1        1 WABA30    01EB000    BARRINGT… 9192… (277022.6 4820886)
#>  2     2        2 WABA30    01EC000    ROSEWAY/… 5293… (315701.1 4855050)
#>  3     3        3 WABA30    01EA000    TUSKET R… A759… (255728.5 4851021)
#>  4     4        4 WABA30    01DA000    METEGHAN  47EF… (245206.7 4895608)
#>  5     5        5 WABA30    01ED000    MERSEY    425C…   (337143 4860311)
#>  6     6        6 WABA30    01EE000    HERRING … C7BF… (370526.6 4896768)
#>  7     7        7 WABA30    01EG000    GOLD      7C6E… (396155.5 4928849)
#>  8     8        8 WABA30    01EF000    LAHAVE    B3D6… (380065.5 4918188)
#>  9     9        9 WABA30    01EJ000    SACKVILLE 4FA4… (449560.5 4927417)
#> 10    10       10 WABA30    01EH000    EAST/IND… 72FE… (417856.9 4932181)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants