-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Athena adapter #3154
Feat: Athena adapter #3154
Conversation
Released v25.22.0, should be able to upgrade and get the CI working now. |
4f8a160
to
10f91ba
Compare
.circleci/continue_config.yml
Outdated
branches: | ||
only: | ||
- main | ||
#- snowflake |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: i'll uncomment these immediately prior to merging. It's helpful to be able to run the Athena integration tests for this PR
88e2e48
to
3110b64
Compare
exp.select( | ||
exp.case() | ||
.when( | ||
# 'awsdatacatalog' is the default catalog that is invisible for all intents and purposes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't quite explain why do we set catalog to NULL of the actual value is awsdatacatalog
. What does "invisible" actually mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because the integration test test_get_data_objects
expects that if it passes a schema like test_schema_x
(as opposed to a catalog-schema combo like test_catalog.test_schema_x
) to get_data_objects()
, the resulting data objects should have None set on the catalog property.
I'll amend the comment
|
||
is_hive = self._table_type(table_properties) == "hive" | ||
|
||
# Filter any PARTITIONED BY properties from the main column list since they cant be specified in both places |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good old hive
Use the user-specified table_properties to figure out of this is a Hive or an Iceberg table | ||
""" | ||
# if table_type is not defined or is not set to "iceberg", this is a Hive table | ||
if table_properties and (table_type := table_properties.get("table_type", None)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we not using storage_format
for this instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically any value that is not iceberg
should be treated as hive
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about storage_format
but decided not to use it because it describes a different concept. Both Hive and Iceberg tables support different storage formats.
For example, a Hive table can be STORED AS PARQUET
or STORED AS ORC
or if you really dont like your colleagues STORED AS TEXTFILE
.
Same for Iceberg, the internal format
can be set to parquet
or orc
or whatever the engine supports.
So storage_format=hive
/ storage_format=iceberg
doesn't make sense because they're table formats that can encompass a particular storage format.
We dont have a top-level table_format
property and I didnt want to add one just for Athena
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, when used with spark iceberg
is provided through storage_format
because the SparkSQL syntax looks like:
CREATE TABLE ... USING [iceberg|parquet|etc]
Should we be consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you did that, how would you specify "Hive + Parquet", "Hive + ORC", "Iceberg + Parquet", "Iceberg + ORC" etc?
The table format is independent of the storage format.
storage_format
is currently defined as:
Storage format is a property for engines such as Spark or Hive that support storage formats such as parquet and orc.
in the docs which I think makes sense. It mentions the actual format of the files on disk, eg Parquet or ORC. Not the type of table that is managing those files.
The Hive syntax is:
CREATE TABLE x (i int) STORED BY ICEBERG STORED AS ORC;
Which differentiates STORED BY
(table format) and STORED AS
(storage format). STORED AS ICEBERG
is invalid.
I'm not super familiar with Spark but it looks like USING
triggers a "Data Source" and then if Iceberg is available on the classpath, you can USING iceberg
and then set the storage format in TBLPROPERTIES
(or in the Iceberg catalog config). So it definitely blurs these concepts compared to other engines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I haven't seen anyone using iceberg with anything other than parquet, so values parquet
, orc
automatically assume Hive, while iceberg
just assumes iceberg with parquet as per CREATE TABLE syntax in spark: https://iceberg.apache.org/docs/nightly/spark-ddl/#create-table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, I wouldn't worry too much about ORC, at least in the context of Iceberg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the bigger picture here is that table formats and storage formats are two separate concepts and we should not conflate the two.
We also shouldn't just assume that people will only ever want to use parquet and hardcode it, isnt the point of physical_properties
to expose features of the underlying database system?
I can already imagine someone with an established data lake that chose ORC for whatever reason and has downstream consumers that expect Iceberg/ORC tables, and now they cant easily use SQLMesh because SQLMesh will only create Iceberg/Parquet tables.
We also need to add DBT support and even DBT clearly separates these concepts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erindru @izeigerman few comments from my side about this, after maintaining dbt-athena for 2 years.
People want to be in control of the iceberg tables that they use.
What @izeigerman raised
I haven't seen anyone using iceberg with anything other than parquet, so values
it's correct, but as a user I want to be in control of the data format used by Iceberg, therefore they might team that want to use ORC, other that want to use parquet, but the user must be in control, and have the possibility to decide.
This is valid for also for table properties, full list is available here.
There are certain property that can be templated of course, but some other properties like vacuum_min_snapshots_to_keep (just to mention one), where the final user must be in control.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Nicola, appreciate your insight!
# To make a CTAS expression persist as iceberg, alongside setting `table_type=iceberg` (which the user has already | ||
# supplied in physical_properties and is thus set above), you also need to set: | ||
# - is_external=false | ||
# - table_location='s3://<path>' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we ensure somehow that the location has already been set at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the location will be set already a few lines up if the user supplied it (or s3_warehouse_location
was set in the config). The original idea was that if it wasnt set at all, Athena can figure out what to do.
But ive just done some tests and unlike Trino, it looks like Athena will not automatically generate table locations for you if the schema the table is in was created with a location set.
I created a schema using CREATE SCHEMA foo LOCATION 's3://path'
and then tried to create both Hive and Iceberg tables in that schema without setting a location explicitly. Both times it failed with an error asking to set the location.
So i'll tighten this up and throw an error if SQLMesh cant figure out the table location
87048d1
to
d4ebd63
Compare
d4ebd63
to
0154813
Compare
Initial implementation of an Athena adapter. Addresses #1315