-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC (do not merge): Add row-group and column select #222
base: main
Are you sure you want to change the base?
Conversation
Thanks for the PR!
Right, this is where it's nicer to have a stateful reader like
The Rust |
I suppose I am a little confused about how you see there different packages relating to each other. Clearly geoarrow cares about parquet, but it also uses arro3 internally, so why wouldn't dataset live here? I see it also supports remote stores... geoarrow, of course, depends on the whole of pyarrow (all that C++ stuff). Not to mention GEOS and GDAL!! |
Because I don't know of a good way (yet) to have the main Python binding live here but inject spatial-specific extensions from geoarrow-rs's Python bindings. If we figured that out then I would put the core Python parquet bindings here. It's the same story on the JS side, where I have WebAssembly bindings to Parquet, but then reimplement a bunch of that stuff for GeoParquet in JS. https://github.com/geoarrow/geoarrow-rs/tree/main/js/src/io/parquet
No. geoarrow-rs does not depend on pyarrow, nor GEOS nor GDAL. geoarrow-rs is pure Rust, and especially the GeoParquet reader and all the core dependencies are pure Rust. That's what allows it to go in WebAssembly so easily https://explore.overturemaps.org (clicking Download Visible uses my Rust GeoParquet reader). geoarrow-rs is separate from geoarrow-c and geoarrow-pyarrow, which depend on pyarrow. |
To be pedantic, the |
pub fn read_parquet( | ||
py: Python, | ||
file: FileReader, | ||
rgs: Option<Vec<usize>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd call this row_groups
for symmetry with pyarrow https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile.read_row_groups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly! It really ought to be on the reader (.select() or such) anyway, so that you only read the parquet metadata once. Or maybe have a separate "fetch details" function. This was only minimalist POC to show that you can pick data portions like this.
Makes
read_parquet(filename, rgs=None, columns=None)
, where the optional inputs are integer lists. To be useful, the user needs to first know how many row-groups there are (this is easy) and how the columns they want map onto schema indices (this is hard for the nested case).I note that loading refuses (correctly) in the case that you have a MAP type and you only specify the keys or values but not both.
I also found that "filename" must be a real file not a directory, which I found surprising, since parquet datasets are usually multi-file. Of course dask et al already has code for extracting the potentially many files of a dataset, first constructing an aggregate schema including any partitioning keys. I suppose this paragraph belongs in #195 .