Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Managed models #2809

Merged
merged 3 commits into from
Jul 1, 2024
Merged

Feat: Managed models #2809

merged 3 commits into from
Jul 1, 2024

Conversation

erindru
Copy link
Collaborator

@erindru erindru commented Jun 24, 2024

Initial implementation of Managed models. This defines the general semantics and contains an implementation for Snowflake, based on Dynamic Tables.

As a side effect, this PR also fixes issue 1049.

Follow-up PR's:

  • Support for other engines
  • Cleanup / compaction support, essentially the ability for something like sqlmesh clean to drop any objects that aren't referenced in a specific environment. The assumption up until now is that storage is cheap so keeping old snapshots around for a while isn't a problem. However, costs could add up quickly with managed models because users would be paying for models to be refreshed that aren't actually being referenced

@erindru erindru marked this pull request as draft June 24, 2024 00:58
else None
)
return exp.Create(
this=table_name_or_schema,
kind="TABLE",
kind=table_kind or "TABLE",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that I could leverage all the existing CREATE TABLE machinery for Snowflake if I was just able to change CREATE TABLE to CREATE DYNAMIC TABLE.

So I made the kind a parameter rather than being hardcoded to "TABLE"

@tobymao
Copy link
Contributor

tobymao commented Jun 24, 2024

not sure i like engine managed.

what about dynamic, or live, or auto

@erindru erindru force-pushed the erin/managed-tables branch 2 times, most recently from 9d5a107 to 2767e41 Compare June 24, 2024 02:04
@erindru
Copy link
Collaborator Author

erindru commented Jun 24, 2024

not sure i like engine managed.

what about dynamic, or live, or auto

Dynamic = Snowflake terminology
Live = Databricks terminology

I'm not against those names per se, but I did think we were trying to remain vendor neutral.

Auto = could work? "Automatic Models"?

The other way to do this could be to drop any pretense of a generic implementation have a Kind per Engine. That way, we can name them accordingly and also "strongly type" the physical_properties and make them part of the Kind instead of arbitrary key-value pairs, which would allow easier validation via pydantic.

For example:

MODEL (
   name foo,
   kind SNOWFLAKE_DYNAMIC_TABLE (
     warehouse "small",
     target_lag '20 minutes'
   ),
  ...

The downside is that engine-specific Kind's sort of defeat the purpose, every other Kind defines something generic and the engine-specific parts are implemented in the adapters

@tobymao
Copy link
Contributor

tobymao commented Jun 24, 2024

maybe auto then

# vendors providing managed tables (such as Snowflake's dynamic tables) tend to charge more for them, they arent as cheap as normal tables
if is_table_deployable:
logger.info("Creating managed table: %s", table_name)
self.adapter.create_engine_managed_table(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we defer the creation of the deployable table until the change is promoted to production?

@erindru erindru changed the title Feat: Engine-managed models Feat: Managed models Jun 25, 2024
@@ -1661,6 +1663,93 @@ def get_custom_materialization_type(name: str) -> t.Type[CustomMaterialization]:
return strategy_type


class EngineManagedStrategy(PromotableStrategy):
Copy link
Collaborator Author

@erindru erindru Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so, I think I have figured out what to do here.

On create(), because the DeployabilityIndex always returns False to allow creating self-referential models, we just create the dev table and carry on. The call to create the prod table is a no-op because we dont yet know if it will be deployable.

On insert(), the DeployabilityIndex works as expected. If the current snapshot is deployable, we call create_managed_table. Otherwise, we replace the dev table with the preview data. We only do this for the "first insert" because managed models are not incremental, any subsequent calls will be a no-op.

This approach means that promotion works as expected with no need for a table creation side-effect and we dont create unnecessary managed tables if the plan is forward-only / the snapshot is not deployable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds really good 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this won't work if the "MANAGED" model snapshot IS deployable and has downstream dependencies that are part of the same plan. The downstream models will attempt to create their own tables referencing the deployable version of the "MANAGED" model and it will be missing at that point.

Copy link
Member

@izeigerman izeigerman Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you probably need to do is to check whether the snapshot is deployable before we modify the deployability index to make the current snapshot non-deployable. Eg. add something like this:

is_snapshot_deployable = deployability_index.is_deployable(snapshot)

after this line: https://github.com/TobikoData/sqlmesh/blob/main/sqlmesh/core/snapshot/evaluator.py#L640

And then pass the flag into the create method of the strategy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course 😞 , I was so focused on upstream models and if they were forward-only or not that I completely forgot about downstream models.

I've implemented your suggestion to pass the snapshot deployability to create() since we cant rely on the DeployabilityIndex there.

This changes the logic to:

  • on create(), if both the table and snapshot are deployable, we create a managed table. Otherwise, we fallback to the superclass logic which just creates a normal table
  • on insert(), we populate the table only if the snapshot is NOT deployable (which would have resulted in a normal table being created in create()). If the snapshot IS deployable, then it's a no-op because the data would have been populated in create().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's now an extra case added to insert(). If a snapshot that was not deployable at creation (and thus would have been created as a normal table) suddenly becomes deployable - then we "upgrade" it to a managed table by dropping it and recreating it as a managed table

@georgesittas
Copy link
Contributor

FYI in case it's somehow relevant to this PR: tobymao/sqlglot@60fa5e3.

@erindru
Copy link
Collaborator Author

erindru commented Jun 25, 2024

FYI in case it's somehow relevant to this PR: tobymao/sqlglot@60fa5e3.

Ahh, thanks, good to know that exists now.

I think it's more relevant in a future PR. Using it now would involve modifying a bunch of method signatures (for example, drop_table would need to be modified to take table properties and the current mechanism of specifying table properties elsewhere assumes key-value pairs instead of individual properties for anything that isnt "top level")

@erindru erindru marked this pull request as ready for review June 25, 2024 22:35
@@ -1071,6 +1081,7 @@ def create(
table_name: str,
model: Model,
is_table_deployable: bool,
is_snapshot_deployable: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not make it a part of the public API to avoid making it more complex for users willing to implement a custom strategy. Can we hide this in kwargs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, it's been banished back to kwargs

snapshot: Snapshot = kwargs["snapshot"]
is_snapshot_deployable = deployability_index.is_deployable(snapshot)

if is_first_insert:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if is_first_insert and not is_snapshot_deployable:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, why do we do this only on the first insert? Shouldn't we always update the dev table in dev?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My lack of knowledge around exactly when and how insert() is called multiple times.

Managed models are not incremental. The main model would only be created once if it was deployable, so I figured we should only populate the dev preview table once to match?

If the user changes something, isnt that a new snapshot / new table?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there is no harm is doing it for all insert()'s, i've adjusted the logic

if not is_snapshot_deployable:
# Snapshot isnt deployable; update the preview table instead
# If the snapshot was deployable, then data would have already been loaded in create() because a managed table would have been created
logger.info("Updating preview table: %s", table_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clarify that the preview table is related to a managed model specifically.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

) -> None:
# Not entirely true, many engines support modifying some of the metadata fields on a managed table
# eg Snowflake allows you to ALTER DYNAMIC TABLE foo SET WAREHOUSE=my_other_wh;
raise ConfigError("Cannot mutate a managed table")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the target table name to the message

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

f"Table {table_name} exists but isnt being returned in the metadata. Cannot determine if it's a managed table or not"
)

if table_metadata[0].type != DataObjectType.MANAGED_TABLE:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the "forward-only in a virtual environment that might eventually get deployed" edge case is resolved.

I added an integration test test_managed_model_upstream_forward_only to exercise this code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand how this scenario can happen.

Copy link
Collaborator Author

@erindru erindru Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarise: it was due to unecessarily creating the "main" table next to the dev table, when we only needed to create the dev table. This has now been adjusted

@@ -90,6 +90,7 @@ class EngineAdapter:
SUPPORTS_MATERIALIZED_VIEWS = False
SUPPORTS_MATERIALIZED_VIEW_SCHEMA = False
SUPPORTS_CLONING = False
SUPPORTS_MANAGED_MODELS = False
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this should be SUPPORTS_MANAGED_TABLES instead. It's mainly just used to tell pytest to skip the managed model integration tests for adapters that dont support managed models

@erindru erindru force-pushed the erin/managed-tables branch 2 times, most recently from fb07bfa to 379153f Compare June 30, 2024 22:10
@erindru erindru requested a review from izeigerman June 30, 2024 22:20
target_lag = '2 minutes',
data_retention_time_in_days = 2
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
)
);

column_descriptions: t.Optional[t.Dict[str, str]] = None,
**kwargs: t.Any,
) -> None:
"""Create an managed table using a query.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Create an managed table using a query.
"""Create a managed table using a query.

)

self._create_table_from_source_queries(
quote_identifiers(target_table),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we quoting here? Note that by default _to_sql quotes the identifiers in the expression.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I saw it used elsewhere in a similar situation. Monkey see, monkey do :)

But you're right, it appears to be unnecessary. I've removed it

@@ -2242,3 +2293,199 @@ def _mutate_config(current_gateway_name: str, config: Config):

finally:
ctx.cleanup(context)


def test_managed_model_upstream_forward_only(ctx: TestContext):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I liked how detailed this test is, thanks for providing context 👍

render_kwargs: t.Dict[str, t.Any],
**kwargs: t.Any,
) -> None:
is_snapshot_deployable: bool = render_kwargs["is_snapshot_deployable"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be kwargs, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, since the merge that broke out render_kwargs and kwargs into two separate things. It looks like there is a missing test case, i'll add one

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this was in the render_kwargs, but i'll move it

snapshot: Snapshot = kwargs["snapshot"]
is_snapshot_deployable = deployability_index.is_deployable(snapshot)

if is_first_insert and is_snapshot_deployable and self.adapter.table_exists(table_name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we handle the case when the deployable (prod) table doesn't exist?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic has been adjusted to create the managed table if it doesnt exist, rather than upgrading an existing non-managed table that didnt need to be created in the first place

table_description=model.description,
column_descriptions=model.column_descriptions,
)
else:
Copy link
Member

@izeigerman izeigerman Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else not is_table_deployable:

@@ -652,6 +652,8 @@ def _create_snapshot(
**common_render_kwargs,
# Refers to self as non-deployable to successfully create self-referential tables / views.
deployability_index=deployability_index.with_non_deployable(snapshot),
# However, it can still be useful to know if the snapshot was actually deployable
is_snapshot_deployable=deployability_index.is_deployable(snapshot),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deployability index is already in the render kwargs which is why I put is_snapshot_deployable right next to it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to kwargs

@@ -755,6 +765,15 @@ def to_expression(
)


class ManagedKind(_ModelKind):
name: Literal[ModelKindName.MANAGED] = ModelKindName.MANAGED
disable_restatement: SQLGlotBool = True
Copy link
Member

@izeigerman izeigerman Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it t.Literal[True] = True so that the value can't be overridden by a user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or make it a property method which returns True


- Creating a Virtual Environment creates a pointer to the current model snapshot
- Modifying the model causes a new snapshot to be created
- Breaking upstream changes cause a new snapshot to be created
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] Any upstream changes cause a new snapshot to be created

Copy link
Member

@izeigerman izeigerman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more smaller comments. LGTM otherwise!

@erindru erindru merged commit 606a9b4 into main Jul 1, 2024
15 checks passed
@erindru erindru deleted the erin/managed-tables branch July 1, 2024 23:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support cluster_by in snowflake engine adapter
4 participants