Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[datafusion-contrib] AWS Glue Integration #2206

Open
tustvold opened this issue Apr 12, 2022 · 2 comments
Open

[datafusion-contrib] AWS Glue Integration #2206

tustvold opened this issue Apr 12, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

This has been discussed in various places, #907 and datafusion-contrib/datafusion-objectstore-s3#53 to name a few, so creating an issue for visibility.

Describe the solution you'd like

I would propose creating a new datafusion-contrib crate, perhaps datafusion-catalog-glue, which communicates with an AWS Glue Data Catalog.

I'll leave the exact design for whoever picks this up, but I might expect something along the following lines.

  • Create a GlueCatalog with an optional catalog ID
  • Provide a async fn GlueCatalog::list_databases(&self) -> Vec<String> to list the databases
  • Provide a async fn GlueCatalog::get_database(&self, name: &str) -> Result<GlueDatabase> to get a database
  • Implement SchemaProvider for GlueDatabase

I think it should be possible to reuse the FileScanConfig structure used by ListingTable to simplify implementation of the TableProvider.

Describe alternatives you've considered

We could not support AWS Glue

Additional context

This will help with datafusion-contrib/datafusion-objectstore-s3#53 by alleviating the need to infer the schema from the files on every query, and only listing files in non-pruned partitions.

This may need to depend on https://github.com/datafusion-contrib/datafusion-objectstore-s3 as I think it will still need to list S3 in order to get the files within a given partition.

The Glue API is not the snappiest of things, so a future extension might be to cache the metadata returned, as is done by the Java client.

@matthewmturner
Copy link
Contributor

I created issue on the official aws-sdk-rust repo to add support for Glue. I believe they prioritize based on getting 👍 so anyone interested in this please up vote there if you get the chance.

I suppose this would mean using rusoto in the meantime. im unsure what this would mean if we tried to use rusoto for glue and aws-sdk-rust for s3 (via datafusion-objectstore-s3). If needed / any type of incompatibility perhaps we could add a rusoto feature to datafusion-objectstore-s3 that could be used until official support lands.

@timvw
Copy link
Contributor

timvw commented May 3, 2022

I have some sample code which uses the aws-sdk-glue client, iterates over all databases and tables, and registers them in a memorycatalogprovider

https://gist.github.com/timvw/84246389d9c79fc0bf07570c625fdaf4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants