Add Support for Consistent SQL Query Generation #1015

plypaul · 2024-01-30T16:28:37Z

Resolves #1020

Description

This PR includes updates to make the SQL generated by MetricFlowEnigne more consistent. Previously, they were not consistent because there were SQL elements (e.g. sub-query aliases) were generated via a counter that carried state from query to query. This PR adds facilities for resetting the ID generation counters, and also streamlines / simplifies how IDs are handled. Please view by commit.

tlento · 2024-01-30T18:03:13Z

Nit: I think we should have a changelog entry for this since it was a community request. Also it affects output, however slightly.

tlento

For the most part the changes look reasonable apart from a few things that don't seem right (the enum.value call, the semantic_model.name removal, and the info log with no corresponding action).

Given the extent of the snapshot changes due to renames (several files will simply move to new locations and then get updated) I don't see how we can evaluate the output difference from them.

Do you have concrete examples you can point to/selectively add to this PR that illustrate the expected nature of these changes? It'd be nice to double-check things like "most prefixes aren't changing" and "semantic model names aren't really important" and "numbering is/is not shifting".

tlento · 2024-01-30T20:01:01Z

metricflow/dag/id_prefix.py

@@ -9,9 +9,61 @@ class IdPrefix(Enum):
    TODO: Move all ID prefixes here.
    """

+    DATAFLOW_NODE_AGGREGATE_MEASURES_ID_PREFIX = "am"


This commit is a thing of beauty.

tlento · 2024-01-30T20:02:56Z

metricflow/dataset/convert_semantic_model.py

@@ -423,8 +424,8 @@ def create_sql_source_data_set(self, semantic_model: SemanticModel) -> SemanticM
        all_entity_instances: List[EntityInstance] = []

        all_select_columns: List[SqlSelectColumn] = []
-        from_source_alias = IdGeneratorRegistry.for_class(self.__class__).create_id(f"{semantic_model.name}_src")
-
+        # from_source_alias = IdGeneratorRegistry.for_class(self.__class__).create_id(f"{semantic_model.name}_src")


Removed. For context, when I rewrite lines, I often comment out the old line, write the revised line, then remove the commented line. In this case, I forgot to remove the commented line.

tlento · 2024-01-30T20:03:49Z

metricflow/dag/id_prefix.py

@@ -67,3 +67,7 @@ class IdPrefix(Enum):
    EXEC_PLAN_PREFIX = "ep"

    MF_DAG = "mfd"
+
+    TIME_SPINE_SOURCE = "spine"


Why make this change? Wouldn't keeping it time_spine_src reduce snapshot thrash considerably?

I changed it in the interest of making the aliases shorter. I thought the snapshot thrash was fine.

tlento · 2024-01-30T20:05:12Z

metricflow/query/group_by_item/resolution_dag/dag.py

-            dag_id=DagId.from_str(
-                IdGeneratorRegistry.for_class(self.__class__).create_id(IdPrefix.GROUP_BY_ITEM_RESOLUTION_DAG.value)
-            ),
+            dag_id=DagId.from_str(PrefixIdGenerator.create_next_id(IdPrefix.GROUP_BY_ITEM_RESOLUTION_DAG.value)),


Is this right, or should it be create_next_id(IdPrefix.GROUP_BY_ITEM_RESOLUTION_DAG) instead?

Yeah, this is incorrect and has been fixed in a later commit that rolled in the fixes for the type-checker errors.

tlento · 2024-01-30T20:07:53Z

metricflow/dag/id_prefix.py

@@ -67,3 +67,7 @@ class IdPrefix(Enum):
    EXEC_PLAN_PREFIX = "ep"

    MF_DAG = "mfd"
+
+    TIME_SPINE_SOURCE = "spine"
+    SEMANTIC_MODEL_SOURCE = "src"


Hm. This is really nondescript. It's also not used as a prefix.

It's in a later commit: https://github.com/dbt-labs/metricflow/pull/1015/files/5d0fc8d7a83e646d55f076fc54906d03ba2670c5..e358b0278e5d6414f4aadfa7c53b485777155cb7#diff-30a568c2be1c7b36f5b505e3ddba538779991dbc2079d052c3cd8cbe62746dd7R428

tlento · 2024-01-30T20:08:29Z

metricflow/dataset/convert_semantic_model.py

-        from_source_alias = IdGeneratorRegistry.for_class(self.__class__).create_id(f"{semantic_model.name}_src")
-
+        # from_source_alias = IdGeneratorRegistry.for_class(self.__class__).create_id(f"{semantic_model.name}_src")
+        from_source_alias = PrefixIdGenerator.create_next_id(IdPrefix.SEMANTIC_MODEL_SOURCE).str_value


This is a big change since the constant now formats in src instead of {semantic_model.name}_src

Yeah, I thought it simplified the SQL output with shorter aliases. What do you think?

tlento · 2024-01-30T20:22:09Z

metricflow/engine/metricflow_engine.py

@@ -376,6 +388,7 @@ def __init__(
    @log_call(module_name=__name__, telemetry_reporter=_telemetry_reporter)
    def query(self, mf_request: MetricFlowQueryRequest) -> MetricFlowQueryResult:  # noqa: D
        logger.info(f"Starting query request:\n{indent(mf_pformat(mf_request))}")
+        logger.info(f"Setting ID generation to start at: {MetricFlowEngine._QUERY_ID_START_VALUE}")


There's no reset call anywhere that I can find. Either remove the logging (and the constant, which is otherwise unused) or add the reset, whichever is appropriate.

Good catch - I was shuffling some things around and did not complete the change. Please see the added commits.

plypaul

Actually, maybe it's simpler to see if I separated out the snapshot changes. I've added commits that revert all of them except the numbering change and we can take a look at the other changes in a later PR.

plypaul · 2024-01-31T17:13:49Z

metricflow/dataset/convert_semantic_model.py

@@ -423,8 +424,8 @@ def create_sql_source_data_set(self, semantic_model: SemanticModel) -> SemanticM
        all_entity_instances: List[EntityInstance] = []

        all_select_columns: List[SqlSelectColumn] = []
-        from_source_alias = IdGeneratorRegistry.for_class(self.__class__).create_id(f"{semantic_model.name}_src")
-
+        # from_source_alias = IdGeneratorRegistry.for_class(self.__class__).create_id(f"{semantic_model.name}_src")


Removed. For context, when I rewrite lines, I often comment out the old line, write the revised line, then remove the commented line. In this case, I forgot to remove the commented line.

plypaul · 2024-01-31T17:15:01Z

metricflow/dataset/convert_semantic_model.py

-        from_source_alias = IdGeneratorRegistry.for_class(self.__class__).create_id(f"{semantic_model.name}_src")
-
+        # from_source_alias = IdGeneratorRegistry.for_class(self.__class__).create_id(f"{semantic_model.name}_src")
+        from_source_alias = PrefixIdGenerator.create_next_id(IdPrefix.SEMANTIC_MODEL_SOURCE).str_value


Yeah, I thought it simplified the SQL output with shorter aliases. What do you think?

plypaul · 2024-01-31T17:25:19Z

metricflow/dag/id_prefix.py

@@ -67,3 +67,7 @@ class IdPrefix(Enum):
    EXEC_PLAN_PREFIX = "ep"

    MF_DAG = "mfd"
+
+    TIME_SPINE_SOURCE = "spine"


I changed it in the interest of making the aliases shorter. I thought the snapshot thrash was fine.

plypaul · 2024-01-31T17:26:24Z

metricflow/dag/id_prefix.py

@@ -67,3 +67,7 @@ class IdPrefix(Enum):
    EXEC_PLAN_PREFIX = "ep"

    MF_DAG = "mfd"
+
+    TIME_SPINE_SOURCE = "spine"
+    SEMANTIC_MODEL_SOURCE = "src"


It's in a later commit: https://github.com/dbt-labs/metricflow/pull/1015/files/5d0fc8d7a83e646d55f076fc54906d03ba2670c5..e358b0278e5d6414f4aadfa7c53b485777155cb7#diff-30a568c2be1c7b36f5b505e3ddba538779991dbc2079d052c3cd8cbe62746dd7R428

plypaul · 2024-01-31T17:29:14Z

metricflow/query/group_by_item/resolution_dag/dag.py

-            dag_id=DagId.from_str(
-                IdGeneratorRegistry.for_class(self.__class__).create_id(IdPrefix.GROUP_BY_ITEM_RESOLUTION_DAG.value)
-            ),
+            dag_id=DagId.from_str(PrefixIdGenerator.create_next_id(IdPrefix.GROUP_BY_ITEM_RESOLUTION_DAG.value)),


Yeah, this is incorrect and has been fixed in a later commit that rolled in the fixes for the type-checker errors.

plypaul · 2024-01-31T18:18:53Z

metricflow/engine/metricflow_engine.py

@@ -376,6 +388,7 @@ def __init__(
    @log_call(module_name=__name__, telemetry_reporter=_telemetry_reporter)
    def query(self, mf_request: MetricFlowQueryRequest) -> MetricFlowQueryResult:  # noqa: D
        logger.info(f"Starting query request:\n{indent(mf_pformat(mf_request))}")
+        logger.info(f"Setting ID generation to start at: {MetricFlowEngine._QUERY_ID_START_VALUE}")


Good catch - I was shuffling some things around and did not complete the change. Please see the added commits.

tlento

Nice, the updated form of this is a lot easier to reason about in terms of the changes I'm seeing.

tlento · 2024-02-09T01:02:29Z

metricflow/dag/id_prefix.py

+    pass
+
+
+class StaticIdPrefix(IdPrefix, Enum, metaclass=EnumMetaClassHelper):


Oh I see.

The other option is to make IdPrefix a protocol, but we've been over the tradeoffs there and I think having it be a superclass makes sense even though enum-type subclasses are a bit weird.

The majority of prefixes used for ID generation were previously listed as constants in a module. For improved encapsulation / structure, this commit moves them to the enum IdPrefix.

Instead of using IdGeneratorRegistry.for_class(), this commit updates those callsites to use PrefixIdGenerator instead. This also moves cases where strings were used to the IdPrefix enum.

There were some cases where the DagId could be passed into the initializer, but it was seldom used. This mades the ID generation more automated and also makes use of IdPrefix instead of strings.

plypaul added the Skip Changelog label Jan 30, 2024

cla-bot bot added the cla:yes label Jan 30, 2024

plypaul marked this pull request as ready for review January 30, 2024 16:37

tlento reviewed Jan 30, 2024

View reviewed changes

plypaul commented Jan 31, 2024

View reviewed changes

plypaul force-pushed the plypaul--88--id-generation2 branch 3 times, most recently from b32103a to 5b66b01 Compare January 31, 2024 22:53

plypaul changed the title ~~Add Support for Consistent SQL Query Generation via MetricFlowEngine~~ Add Support for Consistent SQL Query Generation Jan 31, 2024

plypaul force-pushed the plypaul--88--id-generation2 branch from 5b66b01 to 44236bd Compare January 31, 2024 23:51

plypaul removed the Skip Changelog label Jan 31, 2024

plypaul force-pushed the plypaul--88--id-generation2 branch 2 times, most recently from cfe58ae to f51b845 Compare February 6, 2024 22:39

plypaul changed the base branch from main to plypaul--88.2--improve-snapshot-id-consistency February 6, 2024 22:40

plypaul force-pushed the plypaul--88--id-generation2 branch 2 times, most recently from c1a817c to 0f903ff Compare February 8, 2024 21:57

plypaul changed the base branch from plypaul--88.2--improve-snapshot-id-consistency to plypaul--88.2.2--cache-source-node-output February 8, 2024 21:57

plypaul added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Feb 8, 2024

plypaul had a problem deploying to DW_INTEGRATION_TESTS February 8, 2024 22:14 — with GitHub Actions Failure

plypaul force-pushed the plypaul--88.2.2--cache-source-node-output branch from 4e569e8 to 5c652ff Compare February 9, 2024 00:02

plypaul force-pushed the plypaul--88--id-generation2 branch from 0f903ff to 05105e8 Compare February 9, 2024 00:04

plypaul added Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment and removed Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment labels Feb 9, 2024

plypaul temporarily deployed to DW_INTEGRATION_TESTS February 9, 2024 00:07 — with GitHub Actions Inactive

plypaul temporarily deployed to DW_INTEGRATION_TESTS February 9, 2024 00:21 — with GitHub Actions Inactive

tlento approved these changes Feb 9, 2024

View reviewed changes

tlento mentioned this pull request Feb 9, 2024

Improve ID Consistency in Fixtures / Snapshots. #1025

Merged

plypaul force-pushed the plypaul--88.2.2--cache-source-node-output branch 3 times, most recently from afe1d19 to 7840539 Compare February 15, 2024 19:28

Base automatically changed from plypaul--88.2.2--cache-source-node-output to main February 15, 2024 19:31

plypaul added 18 commits February 15, 2024 11:35

Move from constants to IdPrefix enum.

3ddac7f

The majority of prefixes used for ID generation were previously listed as constants in a module. For improved encapsulation / structure, this commit moves them to the enum IdPrefix.

Update calls to IdGeneratorRegistry.for_class() / str -> IdPrefix.

5dabed6

Instead of using IdGeneratorRegistry.for_class(), this commit updates those callsites to use PrefixIdGenerator instead. This also moves cases where strings were used to the IdPrefix enum.

Streamline DagId generation.

c861514

There were some cases where the DagId could be passed into the initializer, but it was seldom used. This mades the ID generation more automated and also makes use of IdPrefix instead of strings.

Change SqlQueryPlan() to use an optional plan ID.

1b1ca31

Change GroupByItemResolutionNode.id_prefix_enum() to .id_prefix().

72def15

Add PrefixIdGenerator.reset().

5c34b86

Update patching of ID generators.

775618f

Add repr to sequential ID.

8c62f31

Rename PrefixIdGenerator -> SequentialIdGenerator.

1906f54

Remove id_generation.py since it was replaced by sequential_id.py.

efa31d8

Reset ID generation for MetricFlowEngine queries.

5f20d39

Address comments.

952b648

Make IdPrefix an interface and add StaticIdPrefix, DynamicIdPrefix.

0751515

Revert changes that caused non-number changes in snapshots.

3d33b6a

Revert changes that caused snapshot file names to change.

c600de3

Add test for ID enumeration.

68dacde

Add change log for #1020.

6b8857a

Update snapshots (should only contain number changes).

f4b1f1e

plypaul force-pushed the plypaul--88--id-generation2 branch from 05105e8 to f4b1f1e Compare February 15, 2024 19:38

Rebase fixes.

ed8324d

plypaul merged commit 3486cfc into main Feb 15, 2024
9 checks passed

plypaul deleted the plypaul--88--id-generation2 branch February 15, 2024 19:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for Consistent SQL Query Generation #1015

Add Support for Consistent SQL Query Generation #1015

plypaul commented Jan 30, 2024 •

edited

Loading

tlento commented Jan 30, 2024

tlento left a comment •

edited

Loading

tlento Jan 30, 2024

tlento Jan 30, 2024

plypaul Jan 31, 2024

tlento Jan 30, 2024

plypaul Jan 31, 2024

tlento Jan 30, 2024

plypaul Jan 31, 2024

tlento Jan 30, 2024

plypaul Jan 31, 2024

tlento Jan 30, 2024

plypaul Jan 31, 2024

tlento Jan 30, 2024

plypaul Jan 31, 2024

plypaul left a comment

plypaul Jan 31, 2024

plypaul Jan 31, 2024

plypaul Jan 31, 2024

plypaul Jan 31, 2024

plypaul Jan 31, 2024

plypaul Jan 31, 2024

tlento left a comment

tlento Feb 9, 2024

		pass


		class StaticIdPrefix(IdPrefix, Enum, metaclass=EnumMetaClassHelper):

Add Support for Consistent SQL Query Generation #1015

Add Support for Consistent SQL Query Generation #1015

Conversation

plypaul commented Jan 30, 2024 • edited Loading

Description

tlento commented Jan 30, 2024

tlento left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plypaul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlento left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plypaul commented Jan 30, 2024 •

edited

Loading

tlento left a comment •

edited

Loading