Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat!: Add new table_format property alongside storage_format #3175

Merged
merged 3 commits into from
Sep 25, 2024

Conversation

erindru
Copy link
Collaborator

@erindru erindru commented Sep 25, 2024

Context: #3154 (comment)

Currently, the storage_format property is a bit overloaded. It's defined as setting the on-disk file format (eg parquet, orc) which makes sense but it's also been overloaded in various places to include table formats such as iceberg.

This PR introduces a table_format property to clearly separate table formats from storage formats.

The problem with the current setup of a single property is that it makes it difficult to describe concepts like "Iceberg + ORC", "Iceberg + Parquet", "Hive + Ion" etc.

Since orc and parquet are already valid values for storage_format, table_format is a new property to hold the table format (eg hive or iceberg). To keep backwards compatibility, it's opt-in, so engine adapters need to explicitly take advantage of it.

Implementing this is a precursor for compatibility with models created using dbt-athena which cleanly separates these concepts.

It also means that any users that may have gone "all in" on a format like Iceberg/ORC can use SQLMesh to create tables that are compatible with their existing downstream consumers. Currently, SQLMesh assumes Parquet when storage_format=iceberg

"table_type": exp.Literal.string("iceberg")
},
# it's painfully slow, but it works
table_format="iceberg",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Promoting it to a top-level property makes it explicit and consistent across adapters

# if we cant detect any indication of Iceberg, this is a Hive table
if table_properties and (table_type := table_properties.get("table_type", None)):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This became a lot simpler, we now dont need to try and parse the table type out of a physical_properties entry

@@ -149,9 +149,9 @@ def test_create_table_iceberg(adapter: AthenaEngineAdapter) -> None:
name test_table,
kind FULL,
partitioned_by (colc, bucket(16, cola)),
table_format iceberg,
storage_format parquet,
physical_properties (
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ this is what it looks like inside a model definition

@erindru erindru force-pushed the erin/table-format branch 2 times, most recently from 52724ca to 78e63b4 Compare September 25, 2024 01:28
properties.append(
exp.Property(this="write.format.default", value=exp.Literal.string(storage_format))
)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elif storage_format?

branches:
only:
- main
#filters:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to self: Uncomment these prior to merging

@erindru erindru merged commit 8996278 into main Sep 25, 2024
23 checks passed
@erindru erindru deleted the erin/table-format branch September 25, 2024 23:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants