Is there any plan to support reading from postgis via Arrow Flight SQL? #639
-
or, is it really worth supporting that? The question is based on some very simple perf testing results. I prepared two polygon datasets extracted from OSM:
And created 2 GeoParquet files with QGIS ( I tried firstly with the reader from geoarrow-rs/io/postgis and measured the running time of async fn main() {
println!("Start reading geometries from postgis...");
let connection_url = "postgresql://user:pass@localhost:5432/testdb";
let pool = PgPoolOptions::new().connect(connection_url).await.unwrap();
let sql1 = "SELECT * FROM munich_osm_buildings;";
let mut start = Instant::now();
let _table1 = read_postgis(&pool, sql1).await.unwrap();
let mut duration = start.elapsed();
println!("Done. Time elapsed in read_postgis() for munich_osm_buildings is: {:?}", duration);
let sql2 = "SELECT * FROM bayern_osm_buildings;";
start = Instant::now();
let _table2 = read_postgis(&pool, sql2).await.unwrap();
duration = start.elapsed();
println!("Done. Time elapsed in read_postgis() for bayern_osm_buildings is: {:?}", duration);
} and got: ❯ cargo run
Compiling geoarrow-postgis v0.1.0 (/home/me/projects/geoarrow-postgis)
Finished dev [unoptimized + debuginfo] target(s) in 3.70s
Running `target/debug/geoarrow-postgis`
Start reading geometries from postgis...
Done. Time elapsed in read_postgis() for munich_osm_buildings is: 4.349586901s
Done. Time elapsed in read_postgis() for bayern_osm_buildings is: 37.877833901s After that, I tried loading the 2 GeoParquet files with import pyarrow.parquet as parquet
from time import perf_counter as pc
t0 = pc()
# 674 thousand polygons
# 4.33 million coords
gdf = parquet.read_pandas('munich_buildings_osm.parquet')
print(f"time for reading Munich buildings geoparquet: {pc()-t0}")
t0 = pc()
# 5.98 million polygons
# 37.36 million coords
gdf = parquet.read_pandas('bayern_buildings_osm.parquet')
print(f"time for reading Bayern buildings geoparquet: {pc()-t0}") and got: time for reading Munich buildings geoparquet: 0.10657382477074862
time for reading Bayern buildings geoparquet: 0.2329485327936709 actually what I observed is that sometimes the geoparquet file loading for Bayern could be even faster than that for Munich. I don't know why. So here are my questions:
Thank you very much. p.s. I actually have the Apache Arrow Flight SQL adapter for PostgreSQL up and running already, and can successfully fetch some rows of the polygons in wkt from the above db with flight_sql_client. But I'm not sure if it makes sense to continue with implementing sth. like a |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Benchmarks run in debug mode are completely meaningless. You need to compile and run with |
Beta Was this translation helpful? Give feedback.
The existing Postgis reader is not particularly efficient. My plan for the Postgis reader is to wait for the Rust ADBC driver to become more stable and then to use that for reading from Postgres, and just have a thin wrapper around that for parsing geometries.
I'm not intricately familiar with Arrow Flight and Flight SQL but I see them as pretty complex and out of scope for geoarrow-rs at the moment. It should be possible to use Arrow Flight yourself and convert an Arrow object to a geoarrow object with one or two lines of code