Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Athena adapter #3154

Merged
merged 3 commits into from
Sep 23, 2024
Merged

Feat: Athena adapter #3154

merged 3 commits into from
Sep 23, 2024

Conversation

erindru
Copy link
Collaborator

@erindru erindru commented Sep 19, 2024

Initial implementation of an Athena adapter. Addresses #1315

@georgesittas
Copy link
Contributor

Released v25.22.0, should be able to upgrade and get the CI working now.

branches:
only:
- main
#- snowflake
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: i'll uncomment these immediately prior to merging. It's helpful to be able to run the Athena integration tests for this PR

@erindru erindru force-pushed the erin/athena-adapter branch 5 times, most recently from 88e2e48 to 3110b64 Compare September 19, 2024 23:15
exp.select(
exp.case()
.when(
# 'awsdatacatalog' is the default catalog that is invisible for all intents and purposes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't quite explain why do we set catalog to NULL of the actual value is awsdatacatalog. What does "invisible" actually mean here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because the integration test test_get_data_objects expects that if it passes a schema like test_schema_x (as opposed to a catalog-schema combo like test_catalog.test_schema_x) to get_data_objects(), the resulting data objects should have None set on the catalog property.

I'll amend the comment


is_hive = self._table_type(table_properties) == "hive"

# Filter any PARTITIONED BY properties from the main column list since they cant be specified in both places
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good old hive

Use the user-specified table_properties to figure out of this is a Hive or an Iceberg table
"""
# if table_type is not defined or is not set to "iceberg", this is a Hive table
if table_properties and (table_type := table_properties.get("table_type", None)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not using storage_format for this instead?

Copy link
Member

@izeigerman izeigerman Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically any value that is not iceberg should be treated as hive.

Copy link
Collaborator Author

@erindru erindru Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about storage_format but decided not to use it because it describes a different concept. Both Hive and Iceberg tables support different storage formats.

For example, a Hive table can be STORED AS PARQUET or STORED AS ORC or if you really dont like your colleagues STORED AS TEXTFILE.

Same for Iceberg, the internal format can be set to parquet or orc or whatever the engine supports.

So storage_format=hive / storage_format=iceberg doesn't make sense because they're table formats that can encompass a particular storage format.

We dont have a top-level table_format property and I didnt want to add one just for Athena

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, when used with spark iceberg is provided through storage_format because the SparkSQL syntax looks like:

CREATE TABLE ... USING [iceberg|parquet|etc]

Should we be consistent?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you did that, how would you specify "Hive + Parquet", "Hive + ORC", "Iceberg + Parquet", "Iceberg + ORC" etc?

The table format is independent of the storage format.

storage_format is currently defined as:

Storage format is a property for engines such as Spark or Hive that support storage formats such as parquet and orc.

in the docs which I think makes sense. It mentions the actual format of the files on disk, eg Parquet or ORC. Not the type of table that is managing those files.

The Hive syntax is:

CREATE TABLE x (i int) STORED BY ICEBERG STORED AS ORC;

Which differentiates STORED BY (table format) and STORED AS (storage format). STORED AS ICEBERG is invalid.

I'm not super familiar with Spark but it looks like USING triggers a "Data Source" and then if Iceberg is available on the classpath, you can USING iceberg and then set the storage format in TBLPROPERTIES (or in the Iceberg catalog config). So it definitely blurs these concepts compared to other engines

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I haven't seen anyone using iceberg with anything other than parquet, so values parquet, orc automatically assume Hive, while iceberg just assumes iceberg with parquet as per CREATE TABLE syntax in spark: https://iceberg.apache.org/docs/nightly/spark-ddl/#create-table

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, I wouldn't worry too much about ORC, at least in the context of Iceberg.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the bigger picture here is that table formats and storage formats are two separate concepts and we should not conflate the two.

We also shouldn't just assume that people will only ever want to use parquet and hardcode it, isnt the point of physical_properties to expose features of the underlying database system?

I can already imagine someone with an established data lake that chose ORC for whatever reason and has downstream consumers that expect Iceberg/ORC tables, and now they cant easily use SQLMesh because SQLMesh will only create Iceberg/Parquet tables.

We also need to add DBT support and even DBT clearly separates these concepts

Copy link

@nicor88 nicor88 Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erindru @izeigerman few comments from my side about this, after maintaining dbt-athena for 2 years.

People want to be in control of the iceberg tables that they use.
What @izeigerman raised

I haven't seen anyone using iceberg with anything other than parquet, so values

it's correct, but as a user I want to be in control of the data format used by Iceberg, therefore they might team that want to use ORC, other that want to use parquet, but the user must be in control, and have the possibility to decide.

This is valid for also for table properties, full list is available here.
There are certain property that can be templated of course, but some other properties like vacuum_min_snapshots_to_keep (just to mention one), where the final user must be in control.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nicola, appreciate your insight!

# To make a CTAS expression persist as iceberg, alongside setting `table_type=iceberg` (which the user has already
# supplied in physical_properties and is thus set above), you also need to set:
# - is_external=false
# - table_location='s3://<path>'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we ensure somehow that the location has already been set at this point?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice catch!

Copy link
Collaborator Author

@erindru erindru Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the location will be set already a few lines up if the user supplied it (or s3_warehouse_location was set in the config). The original idea was that if it wasnt set at all, Athena can figure out what to do.

But ive just done some tests and unlike Trino, it looks like Athena will not automatically generate table locations for you if the schema the table is in was created with a location set.

I created a schema using CREATE SCHEMA foo LOCATION 's3://path' and then tried to create both Hive and Iceberg tables in that schema without setting a location explicitly. Both times it failed with an error asking to set the location.

So i'll tighten this up and throw an error if SQLMesh cant figure out the table location

@erindru erindru force-pushed the erin/athena-adapter branch 5 times, most recently from 87048d1 to d4ebd63 Compare September 22, 2024 23:55
@erindru erindru merged commit 700b679 into main Sep 23, 2024
23 checks passed
@erindru erindru deleted the erin/athena-adapter branch September 23, 2024 21:32
@georgesittas georgesittas mentioned this pull request Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants