From db3da1c9f8b36ba7b25fccc9dc9856d11cdf98e1 Mon Sep 17 00:00:00 2001 From: Trey Spiller <1831878+treysp@users.noreply.github.com> Date: Mon, 30 Sep 2024 08:09:39 -0500 Subject: [PATCH] Docs: add start/end options to plan concepts doc (#3197) --- docs/concepts/plans.md | 215 ++++++++++++++++++++++++++++++++++++----- 1 file changed, 192 insertions(+), 23 deletions(-) diff --git a/docs/concepts/plans.md b/docs/concepts/plans.md index 1063c9195..882f4873d 100644 --- a/docs/concepts/plans.md +++ b/docs/concepts/plans.md @@ -39,12 +39,16 @@ Choose this option when a change has been made to a model's logic that has a fun ### Non-breaking change A directly-modified model that is classified as non-breaking will be backfilled, but its downstream dependencies will not. -This is a common choice in scenarios such as an addition of a new column, an action which doesn't affect downstream models, as new columns can't be used by downstream models without modifying them directly to select the column. If any downstream models contain a `select *` from the model, SQLMesh attempts to infer breaking status on a best-effort basis. We recommend explicitly specifying a query's columns to avoid unnecessary recomputation. +This is a common choice in scenarios such as an addition of a new column, an action which doesn't affect downstream models, as new columns can't be used by downstream models without modifying them directly to select the column. + +If any downstream models contain a `select *` from the model, SQLMesh attempts to infer breaking status on a best-effort basis. We recommend explicitly specifying a query's columns to avoid unnecessary recomputation. ### Forward-only change A modified (either directly or indirectly) model that is categorized as forward-only will continue to use the existing physical table once the change is deployed to production (the `prod` environment). This means that no backfill will take place. -While iterating on forward-only changes in the development environment, the model's output will be stored in either a temporary table or a shallow clone of the production table if supported by the engine. In either case the data produced this way in the development environment can only be used for preview and will **not** be reused once the change is deployed to production. See [Forward-only Plans](#forward-only-plans) for more details. +While iterating on forward-only changes in the development environment, the model's output will be stored in either a temporary table or a shallow clone of the production table if supported by the engine. + +In either case the data produced this way in the development environment can only be used for preview and will **not** be reused once the change is deployed to production. See [Forward-only Plans](#forward-only-plans) for more details. This category is assigned by SQLMesh automatically either when a user opts into using a [forward-only plan](#forward-only-plans) or when a model is explicitly configured to be forward-only. @@ -68,12 +72,18 @@ When a plan is applied to an environment, the environment gets associated with t *Each model variant gets its own physical table while environments only contain references to these tables.* -This unique approach to understanding and applying changes is what enables SQLMesh's Virtual Environments. This technology allows SQLMesh to ensure complete isolation between environments while allowing it to share physical data assets between environments when appropriate and safe to do so. Additionally, since each model change is captured in a separate physical table, reverting to a previous version becomes a simple and quick operation (refer to [Virtual Update](#virtual-update)) as long as its physical table hasn't been garbage collected by the janitor process. SQLMesh makes it easy to be correct and really hard to accidentally and irreversibly break things. +This unique approach to understanding and applying changes is what enables SQLMesh's Virtual Environments. It allows SQLMesh to ensure complete isolation between environments while allowing it to share physical data assets between environments when appropriate and safe to do so. + +Additionally, since each model change is captured in a separate physical table, reverting to a previous version becomes a simple and quick operation (refer to [Virtual Update](#virtual-update)) as long as its physical table hasn't been garbage collected by the janitor process. + +SQLMesh makes it easy to be correct and really hard to accidentally and irreversibly break things. ### Backfilling -Despite all the benefits, the approach described above is not without trade-offs. When a new model version is just created, a physical table assigned to it is empty. Therefore, SQLMesh needs to re-apply the logic of the new model version to the entire date range of this model in order to populate the new version's physical table. This process is called backfilling. +Despite all the benefits, the approach described above is not without trade-offs. -At the moment, we are using the term backfilling broadly to describe any situation in which a model is updated. That includes these operations: +When a new model version is just created, a physical table assigned to it is empty. Therefore, SQLMesh needs to re-apply the logic of the new model version to the entire date range of this model in order to populate the new version's physical table. This process is called backfilling. + +We use the term backfilling broadly to describe any situation in which a model is updated. That includes these operations: * When a VIEW model is created * When a FULL model is built @@ -81,16 +91,171 @@ At the moment, we are using the term backfilling broadly to describe any situati * When an INCREMENTAL model has recent data appended to it * When an INCREMENTAL model has older data inserted (i.e., resolving a data gap or prepending historical data) -We will be iterating on terminology to better capture the nuances of each type in future versions. +Note for incremental models: despite the fact that backfilling can happen incrementally (see `batch_size` parameter on models), there is an extra cost associated with this operation due to additional runtime involved. If the runtime cost is a concern, use a [forward-only plan](#forward-only-plans) instead. + +### Virtual Update +A benefit of SQLMesh's approach is that data for a new model version can be fully pre-built while still in a development environment. That way all changes and their downstream dependencies can be fully previewed before they are promoted to the production environment. + +With this approach, the process of promoting a change to production is reduced to reference swapping. + +If during plan creation no data gaps have been detected and only references to new model versions need to be updated, then the update is referred to as a Virtual Update. Virtual Updates impose no additional runtime overhead or cost. + +### Start and end dates + +The `plan` command provides two temporal options: `--start` and `--end`. These options are only applicable to plans for non-prod environments. + +For context, every model has a start date. The start can be specified in [the model definition](./models/overview.md#start), in the [project configuration's `model_defaults`](../guides/configuration.md#model-defaults), or by SQLMesh's default value of yesterday. + +Because the prod environment supports business operations, prod plans ensure every model is backfilled from its start date until the most recent completed time interval. Due to that restriction, the `plan` command's `--start` and `--end` options are not supported for regular plans against prod. The options are supported for [restatement plans](#restatement-plans) against prod to allow re-processing a subset of existing data. + +Non-prod plans are typically used for development, so their models can optionally be backfilled for any date range with the `--start` and `--end` options. Limiting the date range makes backfills faster and development more efficient, especially for incremental models using large tables. + +#### Model kind limitations + +Some model kinds do not support backfilling a limited date range. + +For context, SQLMesh strives to make models _idempotent_, meaning that if we ran them multiple times we would get the same correct result every time. + +However, some model kinds are inherently non-idempotent: + +- [INCREMENTAL_BY_UNIQUE_KEY](models/model_kinds.md#incremental_by_unique_key) +- [INCREMENTAL_BY_PARTITION](models/model_kinds.md#incremental_by_partition) +- [SCD_TYPE_2_BY_TIME](models/model_kinds.md#scd-type-2-by-time-recommended) +- [SCD_TYPE_2_BY_COLUMN](models/model_kinds.md#scd-type-2-by-column) +- Any model whose query is self-referential (i.e., the contents of new data rows are affected by the data rows already present in the table) + +Those model kinds will behave as follows in a non-prod plan that specifies a limited date range: + +- If the `--start` option date is the same as or before the model's start date, the model is fully refreshed for all of time +- If the `--start` option date is after the model's start date, the model is ignored by the plan and is not backfilled + +#### Example + +Consider a SQLMesh project with a default start date of 2024-09-20. + +It contains the following `INCREMENTAL_BY_UNIQUE_KEY` model that specifies an explicit start date of 2024-09-23: + +```sql linenums="1" hl_lines="6" +MODEL ( + name sqlmesh_example.start_end_model, + kind INCREMENTAL_BY_UNIQUE_KEY ( + unique_key item_id + ), + start '2024-09-23' +); + +SELECT + item_id, + num_orders +FROM + sqlmesh_example.full_model +``` + +When we run the project's first plan, we see that SQLMesh correctly detected a different start date for our `start_end_model` than the other models (which have the project default start of 2024-09-20): + +```bash linenums="1" hl_lines="17" +❯ sqlmesh plan +====================================================================== +Successfully Ran 1 tests against duckdb +---------------------------------------------------------------------- +New environment `prod` will be created from `prod` +Summary of differences against `prod`: +Models: +└── Added: + ├── sqlmesh_example.full_model + ├── sqlmesh_example.incremental_model + ├── sqlmesh_example.seed_model + └── sqlmesh_example.start_end_model +Models needing backfill (missing dates): +├── sqlmesh_example.full_model: 2024-09-20 - 2024-09-26 +├── sqlmesh_example.incremental_model: 2024-09-20 - 2024-09-26 +├── sqlmesh_example.seed_model: 2024-09-20 - 2024-09-26 +└── sqlmesh_example.start_end_model: 2024-09-23 - 2024-09-26 +Apply - Backfill Tables [y/n]: +``` + +After executing that plan, we add columns to both the `incremental_model` and `start_end_model` queries. -Note for incremental models: despite the fact that backfilling can happen incrementally (see `batch_size` parameter on models), there is an extra cost associated with this operation due to additional runtime involved. If the runtime cost is a concern, a [forward-only plan](#forward-only-plans) can be used instead. +We then execute `sqlmesh plan dev` to create the new `dev` environment: -#### Data preview -As mentioned earlier, the data output produced by [forward-only changes](#forward-only-change) in the development environment can only be used for preview and will not be reused upon deployment to production. +```bash linenums="1" hl_lines="23-26" + +❯ sqlmesh plan dev +====================================================================== +Successfully Ran 1 tests against duckdb +---------------------------------------------------------------------- +New environment `dev` will be created from `prod` +Summary of differences against `dev`: +Models: +├── Directly Modified: +│ ├── sqlmesh_example__dev.start_end_model +│ └── sqlmesh_example__dev.incremental_model +└── Indirectly Modified: + └── sqlmesh_example__dev.full_model + +[...model diff omitted...] + +Directly Modified: sqlmesh_example__dev.incremental_model (Non-breaking) +└── Indirectly Modified Children: + └── sqlmesh_example__dev.full_model (Indirect Non-breaking) + +[...model diff omitted...] + +Directly Modified: sqlmesh_example__dev.start_end_model (Non-breaking) +Models needing backfill (missing dates): +├── sqlmesh_example__dev.incremental_model: 2024-09-20 - 2024-09-26 +└── sqlmesh_example__dev.start_end_model: 2024-09-23 - 2024-09-26 +Enter the backfill start date (eg. '1 year', '2020-01-01') or blank to backfill from the beginning of history: +``` + +Note two things about the output: + +1. As before, SQLMesh displays the complete backfill time range for each model, using the project default start of 2024-09-20 for `incremental_model` and 2024-09-23 for `start_end_model` +2. SQLMesh prompted us for a backfill start date because we didn't pass the `--start` option to the `sqlmesh plan dev` command + +Let's cancel that plan and start a new one, passing a start date of 2024-09-24. + +The `start_end_model` is of kind `INCREMENTAL_BY_UNIQUE_KEY`, which is non-idempotent and cannot be backfilled for a limited time range. + +Because the command's `--start` of 2024-09-24 is after `start_end_model`'s start date 2024-09-23, `start_end_model` is ignored: + +``` bash linenums="1" hl_lines="12-13 20-21" +❯ sqlmesh plan dev --start 2024-09-24 +====================================================================== +Successfully Ran 1 tests against duckdb +---------------------------------------------------------------------- +New environment `dev` will be created from `prod` +Summary of differences against `dev`: +Models: +├── Directly Modified: +│ └── sqlmesh_example__dev.incremental_model +├── Indirectly Modified: +│ └── sqlmesh_example__dev.full_model +└── Ignored Models (Expected Plan Start): + └── sqlmesh_example__dev.start_end_model (2024-09-23 00:00:00+00:00) + +[...model diff omitted...] + +Directly Modified: sqlmesh_example__dev.incremental_model (Non-breaking) +└── Indirectly Modified Children: + └── sqlmesh_example__dev.full_model (Indirect Non-breaking) +Models needing backfill (missing dates): +└── sqlmesh_example__dev.incremental_model: 2024-09-24 - 2024-09-26 +Enter the backfill end date (eg. '1 month ago', '2020-01-01') or blank to backfill up until '2024-09-27 00:00:00': +``` + +The plan output contains a new `Ignored Models` entry telling us that `sqlmesh_example__dev.start_end_model` was ignored. + +It also displays the expected plan start date `2024-09-23`, which is the date our `--start` should include if we want the model to be included in the plan. + +Because it is ignored, `start_end_model` is not in the list of `Models needing backfill` at the bottom. + +### Data preview for forward-only changes +As mentioned earlier, the data output produced by [forward-only changes](#forward-only-change) in a development environment can only be used for preview and will not be reused in production. The same holds true for any subsequent changes that depend on undeployed forward-only changes - data can be previewed but can't be reused in production. -Backfills that are exclusively for preview purposes and will not be reused upon deployment to production are explicitly labeled as such in the plan summary: +Backfills that are exclusively for preview purposes and will not be reused upon deployment to production are explicitly labeled with `(preview)` in the plan summary: ```bash Models needing backfill (missing dates): ├── sushi__dev.customers: 2023-12-22 - 2023-12-28 (preview) @@ -99,19 +264,18 @@ Models needing backfill (missing dates): └── sushi__dev.waiter_as_customer_by_day: 2023-12-22 - 2023-12-28 (preview) ``` -### Virtual Update -Another benefit of the SQLMesh approach is that data for a new model version can be fully pre-built while still in a development environment. That way all changes and their downstream dependencies can be fully previewed before they are promoted to the production environment. - -With this approach, the process of promoting a change to production is reduced to reference swapping. If during plan creation no data gaps have been detected and only references to new model versions need to be updated, then the update is referred to as a Virtual Update. Virtual Updates impose no additional runtime overhead or cost. - ## Forward-only plans Sometimes the runtime cost associated with rebuilding an entire physical table is too high and outweighs the benefits a separate table provides. This is when a forward-only plan comes in handy. -When a forward-only plan is applied to the `prod` environment, none of the plan's changed models will have new physical tables created for them. Instead, physical tables from previous model versions are reused. The benefit of this is that no backfilling is required, so there is no runtime overhead or cost. The drawback is that reverting to a previous version is no longer simple and requires a combination of additional forward-only changes and [restatements](#restatement-plans). +When a forward-only plan is applied to the `prod` environment, none of the plan's changed models will have new physical tables created for them. Instead, physical tables from previous model versions are reused. + +The benefit of this is that no backfilling is required, so there is no runtime overhead or cost. The drawback is that reverting to a previous version is no longer simple and requires a combination of additional forward-only changes and [restatements](#restatement-plans). Note that once a forward-only change is applied to `prod`, all development environments that referred to the previous versions of the updated models will be impacted. -A core component of the development process is to execute code and verify its behavior. To enable this while preserving isolation between environments, `sqlmesh plan [environment name]` evaluates code in non-`prod` environments while targeting shallow (a.k.a. "zero-copy") clones of production tables for engines that support them or newly created temporary physical tables for engines that don't. This means that only a limited preview of changes is available in the development environment before the change is promoted to `prod`. The date range of the preview is provided as part of plan creation command. +A core component of the development process is to execute code and verify its behavior. To enable this while preserving isolation between environments, `sqlmesh plan [environment name]` evaluates code in non-`prod` environments while targeting shallow (a.k.a. "zero-copy") clones of production tables for engines that support them or newly created temporary physical tables for engines that don't. + +This means that only a limited preview of changes is available in the development environment before the change is promoted to `prod`. The date range of the preview is provided as part of plan creation command. Engines for which table cloning is supported include: @@ -126,7 +290,10 @@ To create a forward-only plan, add the `--forward-only` option to the `plan` com sqlmesh plan [environment name] --forward-only ``` -**Note:** The `--forward-only` flag is not required when applying changes to models that have been explicitly configured as [forward-only](models/overview.md#forward_only). Use it only if you need to provide a time range for the preview window or the [effective date](#effective-date). +!!! note + The `--forward-only` flag is not required when applying changes to models that have been explicitly configured as [forward-only](models/overview.md#forward_only). + + Use it only if you need to provide a time range for the preview window or the [effective date](#effective-date). ### Destructive changes @@ -140,25 +307,27 @@ If you want to temporarily allow destructive changes to models that don't allow ### Effective date Changes that are part of the forward-only plan can also be applied retroactively to the production environment by specifying the effective date: + ```bash sqlmesh plan --forward-only --effective-from 2023-01-01 ``` + This way SQLMesh will know to recompute data intervals starting from the specified date once forward-only changes are deployed to production. ## Restatement plans -There are cases when models need to be re-evaluated for a given time range, even though changes may not have been made to those model definitions. This could be due to an upstream issue with a dataset defined outside the SQLMesh platform, or when a [forward-only plan](#forward-only-plans) change needs to be applied retroactively to a bounded interval of historical data. -For this reason, the `plan` command supports the `--restate-model`, which allows users to specify one or more names of a model or model tag (using `tag:` syntax) to be reprocessed. These can also refer to an external table defined outside SQLMesh. +Models sometimes need to be re-evaluated for a given time range, even though the model definition has not changed. + +This could be due to an upstream issue with a dataset defined outside of SQLMesh, or when a [forward-only plan](#forward-only-plans) change needs to be applied retroactively to a bounded interval of historical data. -Application of a plan will trigger a cascading backfill for all specified models (other than external tables), as well as all models downstream from them. The plan's date range determines the data intervals that will be affected. +For this reason, the `plan` command supports the `--restate-model` selector, which allows specifying one or more model names or tags (using `tag:` syntax) to be reprocessed. These can also refer to an external table defined outside SQLMesh. -Please note that models of kinds [INCREMENTAL_BY_UNIQUE_KEY](models/model_kinds.md#INCREMENTAL_BY_UNIQUE_KEY), [SCD_TYPE_2_BY_TIME](models/model_kinds.md#scd-type-2), and [SCD_TYPE_2_BY_COLUMN](models/model_kinds.md#scd-type-2) cannot be partially restated. Therefore, such models will be fully refreshed regardless of the start/end dates provided by a user in the plan. +Applying a plan will trigger a cascading backfill for all specified models (other than external tables), as well as all models downstream from them. The plan's date range determines the data intervals that will be affected (learn more about the limitations of some model kinds [below](#model-kind-limitations)). To prevent models from ever being restated, set the [disable_restatement](models/overview.md#disable_restatement) attribute to `true`. See examples below for how to restate both based on model names and model tags. - === "Names Only" ```bash