Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TODOs to get zarr arrow in a reasonable, usable state #21

Open
maximedion2 opened this issue Jun 30, 2024 · 2 comments
Open

TODOs to get zarr arrow in a reasonable, usable state #21

maximedion2 opened this issue Jun 30, 2024 · 2 comments

Comments

@maximedion2
Copy link
Collaborator

maximedion2 commented Jun 30, 2024

This will be a list of TODOs for the overall project of writing a query engine for Zarr files (and eventually other raster formats... maybe). I'm going to split the overall project in 3 phases, numbered 0, 1 and 2. Each TODO on the list will eventually be assigned an issue with more details and a PR for the implementation.

@maximedion2
Copy link
Collaborator Author

maximedion2 commented Jun 30, 2024

Phase 0:
This phase is about implementing the foundation for a query engine that shamelessly leverages that 1) Zarr is a heavily chunked up storage format and that 2) raster data typically involves some of the data representing some sort of coordinates, with most queries involving filtering on those coordinates. As I'm making this list, I already have the basics implemented, what's left is

  • Use io_uring for reading local files. Issue for the task.
  • Implement reading 1D arrays and broadcasting them to 2D or 3D arrays to minimize I/O. Issue for the task.
  • Fix bug with chunk alignment and sharding. Issue for the task.
  • Implement a proper non-blocking async reader. Issue for the task.
  • Implement metrics for filter push down optimization.
  • Somehow test queries on something else than local files, e.g. AWS S3.
  • Investigate the performance profile of the whole process, in particular why reading the data seemingly only takes a small fraction of the time it rakes for a simple "select *" query (what is taking up most of the time?!). Issue for the task.
  • Implement a cache mechanism based on spatial indices to improve the performance of some filter push downs.
  • Clean up/refactor code after I implement everything else. The overall design and goals have changed a bit since I started writing the code.

Phase 1:
This phase will be about implementing a more generic version of the query engine that can be implemented for various raster formats. The broad steps will be

  • Define a trait that implements all the methods needed for a "raster reader", that file/store wrappers will need to implement.
  • Implement a "raster reader".

Phase 2:
This phase will be about implements efficient geospatial queries, that will work of off WKT strings. Realistically, I'm not going to implement a completely new type of data in DataFusion, I will have to rely on passing string to geospatial functions, or transforming data (like 2 floats for a point) into a string, that can then be passed to geospatial functions. The steps would be

  • Make sure that the geo Rust library supports spatial indexing and operations on multiple geometries at once, if not see if I can help out the project and implement it.
  • Implement within and intersect operations.
  • Implement distance and intersection operations.
  • Implement more "exotic" geospatial operations.
  • Implement a hypothetical "nearest" operation to allow for a "join nearest...".

@maximedion2
Copy link
Collaborator Author

@tshauck feel free to add anything here of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant