You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would propose creating a new datafusion-contrib crate, perhaps datafusion-catalog-glue, which communicates with an AWS Glue Data Catalog.
I'll leave the exact design for whoever picks this up, but I might expect something along the following lines.
Create a GlueCatalog with an optional catalog ID
Provide a async fn GlueCatalog::list_databases(&self) -> Vec<String> to list the databases
Provide a async fn GlueCatalog::get_database(&self, name: &str) -> Result<GlueDatabase> to get a database
Implement SchemaProvider for GlueDatabase
I think it should be possible to reuse the FileScanConfig structure used by ListingTable to simplify implementation of the TableProvider.
Describe alternatives you've considered
We could not support AWS Glue
Additional context
This will help with datafusion-contrib/datafusion-objectstore-s3#53 by alleviating the need to infer the schema from the files on every query, and only listing files in non-pruned partitions.
I created issue on the official aws-sdk-rust repo to add support for Glue. I believe they prioritize based on getting 👍 so anyone interested in this please up vote there if you get the chance.
I suppose this would mean using rusoto in the meantime. im unsure what this would mean if we tried to use rusoto for glue and aws-sdk-rust for s3 (via datafusion-objectstore-s3). If needed / any type of incompatibility perhaps we could add a rusoto feature to datafusion-objectstore-s3 that could be used until official support lands.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This has been discussed in various places, #907 and datafusion-contrib/datafusion-objectstore-s3#53 to name a few, so creating an issue for visibility.
Describe the solution you'd like
I would propose creating a new datafusion-contrib crate, perhaps
datafusion-catalog-glue
, which communicates with an AWS Glue Data Catalog.I'll leave the exact design for whoever picks this up, but I might expect something along the following lines.
GlueCatalog
with an optional catalog IDasync fn GlueCatalog::list_databases(&self) -> Vec<String>
to list the databasesasync fn GlueCatalog::get_database(&self, name: &str) -> Result<GlueDatabase>
to get a databaseSchemaProvider
forGlueDatabase
I think it should be possible to reuse the
FileScanConfig
structure used byListingTable
to simplify implementation of theTableProvider
.Describe alternatives you've considered
We could not support AWS Glue
Additional context
This will help with datafusion-contrib/datafusion-objectstore-s3#53 by alleviating the need to infer the schema from the files on every query, and only listing files in non-pruned partitions.
This may need to depend on https://github.com/datafusion-contrib/datafusion-objectstore-s3 as I think it will still need to list S3 in order to get the files within a given partition.
The Glue API is not the snappiest of things, so a future extension might be to cache the metadata returned, as is done by the Java client.
The text was updated successfully, but these errors were encountered: