-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Managed models #2809
Feat: Managed models #2809
Conversation
else None | ||
) | ||
return exp.Create( | ||
this=table_name_or_schema, | ||
kind="TABLE", | ||
kind=table_kind or "TABLE", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that I could leverage all the existing CREATE TABLE
machinery for Snowflake if I was just able to change CREATE TABLE
to CREATE DYNAMIC TABLE
.
So I made the kind
a parameter rather than being hardcoded to "TABLE"
not sure i like engine managed. what about dynamic, or live, or auto |
9d5a107
to
2767e41
Compare
Dynamic = Snowflake terminology I'm not against those names per se, but I did think we were trying to remain vendor neutral. Auto = could work? "Automatic Models"? The other way to do this could be to drop any pretense of a generic implementation have a Kind per Engine. That way, we can name them accordingly and also "strongly type" the For example:
The downside is that engine-specific Kind's sort of defeat the purpose, every other Kind defines something generic and the engine-specific parts are implemented in the adapters |
maybe auto then |
sqlmesh/core/snapshot/evaluator.py
Outdated
# vendors providing managed tables (such as Snowflake's dynamic tables) tend to charge more for them, they arent as cheap as normal tables | ||
if is_table_deployable: | ||
logger.info("Creating managed table: %s", table_name) | ||
self.adapter.create_engine_managed_table( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we defer the creation of the deployable table until the change is promoted to production?
2767e41
to
9c9e7f8
Compare
sqlmesh/core/snapshot/evaluator.py
Outdated
@@ -1661,6 +1663,93 @@ def get_custom_materialization_type(name: str) -> t.Type[CustomMaterialization]: | |||
return strategy_type | |||
|
|||
|
|||
class EngineManagedStrategy(PromotableStrategy): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so, I think I have figured out what to do here.
On create()
, because the DeployabilityIndex always returns False to allow creating self-referential models, we just create the dev table and carry on. The call to create the prod table is a no-op because we dont yet know if it will be deployable.
On insert()
, the DeployabilityIndex works as expected. If the current snapshot is deployable, we call create_managed_table
. Otherwise, we replace the dev table with the preview data. We only do this for the "first insert" because managed models are not incremental, any subsequent calls will be a no-op.
This approach means that promotion works as expected with no need for a table creation side-effect and we dont create unnecessary managed tables if the plan is forward-only / the snapshot is not deployable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds really good 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this won't work if the "MANAGED" model snapshot IS deployable and has downstream dependencies that are part of the same plan. The downstream models will attempt to create their own tables referencing the deployable version of the "MANAGED" model and it will be missing at that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you probably need to do is to check whether the snapshot is deployable before we modify the deployability index to make the current snapshot non-deployable. Eg. add something like this:
is_snapshot_deployable = deployability_index.is_deployable(snapshot)
after this line: https://github.com/TobikoData/sqlmesh/blob/main/sqlmesh/core/snapshot/evaluator.py#L640
And then pass the flag into the create
method of the strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course 😞 , I was so focused on upstream models and if they were forward-only or not that I completely forgot about downstream models.
I've implemented your suggestion to pass the snapshot deployability to create()
since we cant rely on the DeployabilityIndex there.
This changes the logic to:
- on
create()
, if both the table and snapshot are deployable, we create a managed table. Otherwise, we fallback to the superclass logic which just creates a normal table - on
insert()
, we populate the table only if the snapshot is NOT deployable (which would have resulted in a normal table being created increate()
). If the snapshot IS deployable, then it's a no-op because the data would have been populated increate()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's now an extra case added to insert()
. If a snapshot that was not deployable at creation (and thus would have been created as a normal table) suddenly becomes deployable - then we "upgrade" it to a managed table by dropping it and recreating it as a managed table
FYI in case it's somehow relevant to this PR: tobymao/sqlglot@60fa5e3. |
b69f2f1
to
d9753c4
Compare
Ahh, thanks, good to know that exists now. I think it's more relevant in a future PR. Using it now would involve modifying a bunch of method signatures (for example, |
sqlmesh/core/snapshot/evaluator.py
Outdated
@@ -1071,6 +1081,7 @@ def create( | |||
table_name: str, | |||
model: Model, | |||
is_table_deployable: bool, | |||
is_snapshot_deployable: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not make it a part of the public API to avoid making it more complex for users willing to implement a custom strategy. Can we hide this in kwargs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, it's been banished back to kwargs
sqlmesh/core/snapshot/evaluator.py
Outdated
snapshot: Snapshot = kwargs["snapshot"] | ||
is_snapshot_deployable = deployability_index.is_deployable(snapshot) | ||
|
||
if is_first_insert: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if is_first_insert and not is_snapshot_deployable:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, why do we do this only on the first insert? Shouldn't we always update the dev table in dev?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My lack of knowledge around exactly when and how insert()
is called multiple times.
Managed models are not incremental. The main model would only be created once if it was deployable, so I figured we should only populate the dev preview table once to match?
If the user changes something, isnt that a new snapshot / new table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there is no harm is doing it for all insert()
's, i've adjusted the logic
sqlmesh/core/snapshot/evaluator.py
Outdated
if not is_snapshot_deployable: | ||
# Snapshot isnt deployable; update the preview table instead | ||
# If the snapshot was deployable, then data would have already been loaded in create() because a managed table would have been created | ||
logger.info("Updating preview table: %s", table_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's clarify that the preview table is related to a managed model specifically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sqlmesh/core/snapshot/evaluator.py
Outdated
) -> None: | ||
# Not entirely true, many engines support modifying some of the metadata fields on a managed table | ||
# eg Snowflake allows you to ALTER DYNAMIC TABLE foo SET WAREHOUSE=my_other_wh; | ||
raise ConfigError("Cannot mutate a managed table") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add the target table name to the message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
502b67c
to
0bed0c8
Compare
sqlmesh/core/snapshot/evaluator.py
Outdated
f"Table {table_name} exists but isnt being returned in the metadata. Cannot determine if it's a managed table or not" | ||
) | ||
|
||
if table_metadata[0].type != DataObjectType.MANAGED_TABLE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the "forward-only in a virtual environment that might eventually get deployed" edge case is resolved.
I added an integration test test_managed_model_upstream_forward_only
to exercise this code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand how this scenario can happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To summarise: it was due to unecessarily creating the "main" table next to the dev table, when we only needed to create the dev table. This has now been adjusted
@@ -90,6 +90,7 @@ class EngineAdapter: | |||
SUPPORTS_MATERIALIZED_VIEWS = False | |||
SUPPORTS_MATERIALIZED_VIEW_SCHEMA = False | |||
SUPPORTS_CLONING = False | |||
SUPPORTS_MANAGED_MODELS = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this should be SUPPORTS_MANAGED_TABLES
instead. It's mainly just used to tell pytest to skip the managed model integration tests for adapters that dont support managed models
fb07bfa
to
379153f
Compare
target_lag = '2 minutes', | ||
data_retention_time_in_days = 2 | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) | |
); |
sqlmesh/core/engine_adapter/base.py
Outdated
column_descriptions: t.Optional[t.Dict[str, str]] = None, | ||
**kwargs: t.Any, | ||
) -> None: | ||
"""Create an managed table using a query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Create an managed table using a query. | |
"""Create a managed table using a query. |
) | ||
|
||
self._create_table_from_source_queries( | ||
quote_identifiers(target_table), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we quoting here? Note that by default _to_sql
quotes the identifiers in the expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I saw it used elsewhere in a similar situation. Monkey see, monkey do :)
But you're right, it appears to be unnecessary. I've removed it
@@ -2242,3 +2293,199 @@ def _mutate_config(current_gateway_name: str, config: Config): | |||
|
|||
finally: | |||
ctx.cleanup(context) | |||
|
|||
|
|||
def test_managed_model_upstream_forward_only(ctx: TestContext): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I liked how detailed this test is, thanks for providing context 👍
sqlmesh/core/snapshot/evaluator.py
Outdated
render_kwargs: t.Dict[str, t.Any], | ||
**kwargs: t.Any, | ||
) -> None: | ||
is_snapshot_deployable: bool = render_kwargs["is_snapshot_deployable"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be kwargs
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, since the merge that broke out render_kwargs
and kwargs
into two separate things. It looks like there is a missing test case, i'll add one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this was in the render_kwargs
, but i'll move it
sqlmesh/core/snapshot/evaluator.py
Outdated
snapshot: Snapshot = kwargs["snapshot"] | ||
is_snapshot_deployable = deployability_index.is_deployable(snapshot) | ||
|
||
if is_first_insert and is_snapshot_deployable and self.adapter.table_exists(table_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do we handle the case when the deployable (prod) table doesn't exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic has been adjusted to create the managed table if it doesnt exist, rather than upgrading an existing non-managed table that didnt need to be created in the first place
sqlmesh/core/snapshot/evaluator.py
Outdated
table_description=model.description, | ||
column_descriptions=model.column_descriptions, | ||
) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else not is_table_deployable:
sqlmesh/core/snapshot/evaluator.py
Outdated
@@ -652,6 +652,8 @@ def _create_snapshot( | |||
**common_render_kwargs, | |||
# Refers to self as non-deployable to successfully create self-referential tables / views. | |||
deployability_index=deployability_index.with_non_deployable(snapshot), | |||
# However, it can still be useful to know if the snapshot was actually deployable | |||
is_snapshot_deployable=deployability_index.is_deployable(snapshot), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deployability index is already in the render kwargs which is why I put is_snapshot_deployable
right next to it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to kwargs
4307201
to
cb81b48
Compare
sqlmesh/core/model/kind.py
Outdated
@@ -755,6 +765,15 @@ def to_expression( | |||
) | |||
|
|||
|
|||
class ManagedKind(_ModelKind): | |||
name: Literal[ModelKindName.MANAGED] = ModelKindName.MANAGED | |||
disable_restatement: SQLGlotBool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make it t.Literal[True] = True
so that the value can't be overridden by a user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or make it a property method which returns True
|
||
- Creating a Virtual Environment creates a pointer to the current model snapshot | ||
- Modifying the model causes a new snapshot to be created | ||
- Breaking upstream changes cause a new snapshot to be created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] Any upstream changes cause a new snapshot to be created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more smaller comments. LGTM otherwise!
Initial implementation of Managed models. This defines the general semantics and contains an implementation for Snowflake, based on Dynamic Tables.
As a side effect, this PR also fixes issue 1049.
Follow-up PR's:
sqlmesh clean
to drop any objects that aren't referenced in a specific environment. The assumption up until now is that storage is cheap so keeping old snapshots around for a while isn't a problem. However, costs could add up quickly with managed models because users would be paying for models to be refreshed that aren't actually being referenced