Feat: Athena adapter #3154

erindru · 2024-09-19T05:28:08Z

Initial implementation of an Athena adapter. Addresses #1315

georgesittas · 2024-09-19T07:33:53Z

Released v25.22.0, should be able to upgrade and get the CI working now.

erindru · 2024-09-19T21:40:34Z

.circleci/continue_config.yml

- branches:
- only:
- - main
+ #- snowflake


Note: i'll uncomment these immediately prior to merging. It's helpful to be able to run the Athena integration tests for this PR

sqlmesh/core/config/connection.py

sqlmesh/core/engine_adapter/athena.py

izeigerman · 2024-09-19T23:52:07Z

sqlmesh/core/engine_adapter/athena.py

+ exp.select(
+ exp.case()
+ .when(
+ # 'awsdatacatalog' is the default catalog that is invisible for all intents and purposes


This doesn't quite explain why do we set catalog to NULL of the actual value is awsdatacatalog. What does "invisible" actually mean here?

It's because the integration test test_get_data_objects expects that if it passes a schema like test_schema_x (as opposed to a catalog-schema combo like test_catalog.test_schema_x) to get_data_objects(), the resulting data objects should have None set on the catalog property.

I'll amend the comment

sqlmesh/core/engine_adapter/athena.py

izeigerman · 2024-09-20T00:03:21Z

sqlmesh/core/engine_adapter/athena.py

+
+ is_hive = self._table_type(table_properties) == "hive"
+
+ # Filter any PARTITIONED BY properties from the main column list since they cant be specified in both places


good old hive

izeigerman · 2024-09-20T00:04:21Z

sqlmesh/core/engine_adapter/athena.py

+ Use the user-specified table_properties to figure out of this is a Hive or an Iceberg table
+ """
+ # if table_type is not defined or is not set to "iceberg", this is a Hive table
+ if table_properties and (table_type := table_properties.get("table_type", None)):


Why are we not using storage_format for this instead?

Basically any value that is not iceberg should be treated as hive.

I thought about storage_format but decided not to use it because it describes a different concept. Both Hive and Iceberg tables support different storage formats.

For example, a Hive table can be STORED AS PARQUET or STORED AS ORC or if you really dont like your colleagues STORED AS TEXTFILE.

Same for Iceberg, the internal format can be set to parquet or orc or whatever the engine supports.

So storage_format=hive / storage_format=iceberg doesn't make sense because they're table formats that can encompass a particular storage format.

We dont have a top-level table_format property and I didnt want to add one just for Athena

FWIW, when used with spark iceberg is provided through storage_format because the SparkSQL syntax looks like:

CREATE TABLE ... USING [iceberg|parquet|etc]

Should we be consistent?

If you did that, how would you specify "Hive + Parquet", "Hive + ORC", "Iceberg + Parquet", "Iceberg + ORC" etc?

The table format is independent of the storage format.

storage_format is currently defined as:

Storage format is a property for engines such as Spark or Hive that support storage formats such as parquet and orc.

in the docs which I think makes sense. It mentions the actual format of the files on disk, eg Parquet or ORC. Not the type of table that is managing those files.

The Hive syntax is:

CREATE TABLE x (i int) STORED BY ICEBERG STORED AS ORC;

Which differentiates STORED BY (table format) and STORED AS (storage format). STORED AS ICEBERG is invalid.

I'm not super familiar with Spark but it looks like USING triggers a "Data Source" and then if Iceberg is available on the classpath, you can USING iceberg and then set the storage format in TBLPROPERTIES (or in the Iceberg catalog config). So it definitely blurs these concepts compared to other engines

Honestly, I haven't seen anyone using iceberg with anything other than parquet, so values parquet, orc automatically assume Hive, while iceberg just assumes iceberg with parquet as per CREATE TABLE syntax in spark: https://iceberg.apache.org/docs/nightly/spark-ddl/#create-table

IMHO, I wouldn't worry too much about ORC, at least in the context of Iceberg.

I think the bigger picture here is that table formats and storage formats are two separate concepts and we should not conflate the two.

We also shouldn't just assume that people will only ever want to use parquet and hardcode it, isnt the point of physical_properties to expose features of the underlying database system?

I can already imagine someone with an established data lake that chose ORC for whatever reason and has downstream consumers that expect Iceberg/ORC tables, and now they cant easily use SQLMesh because SQLMesh will only create Iceberg/Parquet tables.

We also need to add DBT support and even DBT clearly separates these concepts

@erindru @izeigerman few comments from my side about this, after maintaining dbt-athena for 2 years.

People want to be in control of the iceberg tables that they use.
What @izeigerman raised

I haven't seen anyone using iceberg with anything other than parquet, so values

it's correct, but as a user I want to be in control of the data format used by Iceberg, therefore they might team that want to use ORC, other that want to use parquet, but the user must be in control, and have the possibility to decide.

This is valid for also for table properties, full list is available here.
There are certain property that can be templated of course, but some other properties like vacuum_min_snapshots_to_keep (just to mention one), where the final user must be in control.

Thanks Nicola, appreciate your insight!

izeigerman · 2024-09-20T00:08:13Z

sqlmesh/core/engine_adapter/athena.py

+ # To make a CTAS expression persist as iceberg, alongside setting `table_type=iceberg` (which the user has already
+ # supplied in physical_properties and is thus set above), you also need to set:
+ # - is_external=false
+ # - table_location='s3://<path>'


Should we ensure somehow that the location has already been set at this point?

Oh nice catch!

Actually, the location will be set already a few lines up if the user supplied it (or s3_warehouse_location was set in the config). The original idea was that if it wasnt set at all, Athena can figure out what to do.

But ive just done some tests and unlike Trino, it looks like Athena will not automatically generate table locations for you if the schema the table is in was created with a location set.

I created a schema using CREATE SCHEMA foo LOCATION 's3://path' and then tried to create both Hive and Iceberg tables in that schema without setting a location explicitly. Both times it failed with an error asking to set the location.

So i'll tighten this up and throw an error if SQLMesh cant figure out the table location

erindru force-pushed the erin/athena-adapter branch from 4f8a160 to 10f91ba Compare September 19, 2024 20:46

erindru commented Sep 19, 2024

View reviewed changes

erindru force-pushed the erin/athena-adapter branch 5 times, most recently from 88e2e48 to 3110b64 Compare September 19, 2024 23:15

izeigerman reviewed Sep 19, 2024

View reviewed changes

sqlmesh/core/config/connection.py Outdated Show resolved Hide resolved

izeigerman reviewed Sep 19, 2024

View reviewed changes

sqlmesh/core/engine_adapter/athena.py Outdated Show resolved Hide resolved

izeigerman reviewed Sep 19, 2024

View reviewed changes

sqlmesh/core/engine_adapter/athena.py Outdated Show resolved Hide resolved

izeigerman reviewed Sep 19, 2024

View reviewed changes

sqlmesh/core/engine_adapter/athena.py Outdated Show resolved Hide resolved

izeigerman reviewed Sep 19, 2024

View reviewed changes

sqlmesh/core/engine_adapter/athena.py Outdated Show resolved Hide resolved

izeigerman reviewed Sep 20, 2024

View reviewed changes

erindru force-pushed the erin/athena-adapter branch 5 times, most recently from 87048d1 to d4ebd63 Compare September 22, 2024 23:55

izeigerman approved these changes Sep 23, 2024

View reviewed changes

erindru added 3 commits September 23, 2024 21:20

Feat: Athena adapter

7bb1cba

PR feedback

3c96a86

reinstate circleci config

0154813

erindru force-pushed the erin/athena-adapter branch from d4ebd63 to 0154813 Compare September 23, 2024 21:20

erindru merged commit 700b679 into main Sep 23, 2024
23 checks passed

erindru deleted the erin/athena-adapter branch September 23, 2024 21:32

erindru mentioned this pull request Sep 25, 2024

Feat!: Add new table_format property alongside storage_format #3175

Merged

georgesittas mentioned this pull request Sep 25, 2024

Support Athena #1315

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Athena adapter #3154

Feat: Athena adapter #3154

erindru commented Sep 19, 2024 •

edited

Loading

georgesittas commented Sep 19, 2024

erindru Sep 19, 2024

izeigerman Sep 19, 2024

erindru Sep 20, 2024

izeigerman Sep 20, 2024

izeigerman Sep 20, 2024

izeigerman Sep 20, 2024 •

edited

Loading

erindru Sep 20, 2024 •

edited

Loading

izeigerman Sep 20, 2024

erindru Sep 22, 2024

izeigerman Sep 23, 2024

izeigerman Sep 23, 2024

erindru Sep 23, 2024

nicor88 Sep 25, 2024 •

edited

Loading

erindru Sep 25, 2024

izeigerman Sep 20, 2024

erindru Sep 20, 2024

erindru Sep 20, 2024 •

edited

Loading


		is_hive = self._table_type(table_properties) == "hive"

		# Filter any PARTITIONED BY properties from the main column list since they cant be specified in both places

Feat: Athena adapter #3154

Feat: Athena adapter #3154

Conversation

erindru commented Sep 19, 2024 • edited Loading

georgesittas commented Sep 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

izeigerman Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

erindru Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicor88 Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erindru Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

erindru commented Sep 19, 2024 •

edited

Loading

izeigerman Sep 20, 2024 •

edited

Loading

erindru Sep 20, 2024 •

edited

Loading

nicor88 Sep 25, 2024 •

edited

Loading

erindru Sep 20, 2024 •

edited

Loading