Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Consistent SQL Query Generation #1015

Merged
merged 19 commits into from
Feb 15, 2024

Conversation

plypaul
Copy link
Contributor

@plypaul plypaul commented Jan 30, 2024

Resolves #1020

Description

This PR includes updates to make the SQL generated by MetricFlowEnigne more consistent. Previously, they were not consistent because there were SQL elements (e.g. sub-query aliases) were generated via a counter that carried state from query to query. This PR adds facilities for resetting the ID generation counters, and also streamlines / simplifies how IDs are handled. Please view by commit.

@cla-bot cla-bot bot added the cla:yes label Jan 30, 2024
@plypaul plypaul marked this pull request as ready for review January 30, 2024 16:37
@tlento
Copy link
Contributor

tlento commented Jan 30, 2024

Nit: I think we should have a changelog entry for this since it was a community request. Also it affects output, however slightly.

Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the most part the changes look reasonable apart from a few things that don't seem right (the enum.value call, the semantic_model.name removal, and the info log with no corresponding action).

Given the extent of the snapshot changes due to renames (several files will simply move to new locations and then get updated) I don't see how we can evaluate the output difference from them.

Do you have concrete examples you can point to/selectively add to this PR that illustrate the expected nature of these changes? It'd be nice to double-check things like "most prefixes aren't changing" and "semantic model names aren't really important" and "numbering is/is not shifting".

@@ -9,9 +9,61 @@ class IdPrefix(Enum):
TODO: Move all ID prefixes here.
"""

DATAFLOW_NODE_AGGREGATE_MEASURES_ID_PREFIX = "am"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit is a thing of beauty.

@@ -423,8 +424,8 @@ def create_sql_source_data_set(self, semantic_model: SemanticModel) -> SemanticM
all_entity_instances: List[EntityInstance] = []

all_select_columns: List[SqlSelectColumn] = []
from_source_alias = IdGeneratorRegistry.for_class(self.__class__).create_id(f"{semantic_model.name}_src")

# from_source_alias = IdGeneratorRegistry.for_class(self.__class__).create_id(f"{semantic_model.name}_src")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. For context, when I rewrite lines, I often comment out the old line, write the revised line, then remove the commented line. In this case, I forgot to remove the commented line.

@@ -67,3 +67,7 @@ class IdPrefix(Enum):
EXEC_PLAN_PREFIX = "ep"

MF_DAG = "mfd"

TIME_SPINE_SOURCE = "spine"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this change? Wouldn't keeping it time_spine_src reduce snapshot thrash considerably?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it in the interest of making the aliases shorter. I thought the snapshot thrash was fine.

dag_id=DagId.from_str(
IdGeneratorRegistry.for_class(self.__class__).create_id(IdPrefix.GROUP_BY_ITEM_RESOLUTION_DAG.value)
),
dag_id=DagId.from_str(PrefixIdGenerator.create_next_id(IdPrefix.GROUP_BY_ITEM_RESOLUTION_DAG.value)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right, or should it be create_next_id(IdPrefix.GROUP_BY_ITEM_RESOLUTION_DAG) instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is incorrect and has been fixed in a later commit that rolled in the fixes for the type-checker errors.

@@ -67,3 +67,7 @@ class IdPrefix(Enum):
EXEC_PLAN_PREFIX = "ep"

MF_DAG = "mfd"

TIME_SPINE_SOURCE = "spine"
SEMANTIC_MODEL_SOURCE = "src"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. This is really nondescript. It's also not used as a prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_source_alias = IdGeneratorRegistry.for_class(self.__class__).create_id(f"{semantic_model.name}_src")

# from_source_alias = IdGeneratorRegistry.for_class(self.__class__).create_id(f"{semantic_model.name}_src")
from_source_alias = PrefixIdGenerator.create_next_id(IdPrefix.SEMANTIC_MODEL_SOURCE).str_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big change since the constant now formats in src instead of {semantic_model.name}_src

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought it simplified the SQL output with shorter aliases. What do you think?

@@ -376,6 +388,7 @@ def __init__(
@log_call(module_name=__name__, telemetry_reporter=_telemetry_reporter)
def query(self, mf_request: MetricFlowQueryRequest) -> MetricFlowQueryResult: # noqa: D
logger.info(f"Starting query request:\n{indent(mf_pformat(mf_request))}")
logger.info(f"Setting ID generation to start at: {MetricFlowEngine._QUERY_ID_START_VALUE}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no reset call anywhere that I can find. Either remove the logging (and the constant, which is otherwise unused) or add the reset, whichever is appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - I was shuffling some things around and did not complete the change. Please see the added commits.

Copy link
Contributor Author

@plypaul plypaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, maybe it's simpler to see if I separated out the snapshot changes. I've added commits that revert all of them except the numbering change and we can take a look at the other changes in a later PR.

@@ -423,8 +424,8 @@ def create_sql_source_data_set(self, semantic_model: SemanticModel) -> SemanticM
all_entity_instances: List[EntityInstance] = []

all_select_columns: List[SqlSelectColumn] = []
from_source_alias = IdGeneratorRegistry.for_class(self.__class__).create_id(f"{semantic_model.name}_src")

# from_source_alias = IdGeneratorRegistry.for_class(self.__class__).create_id(f"{semantic_model.name}_src")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. For context, when I rewrite lines, I often comment out the old line, write the revised line, then remove the commented line. In this case, I forgot to remove the commented line.

from_source_alias = IdGeneratorRegistry.for_class(self.__class__).create_id(f"{semantic_model.name}_src")

# from_source_alias = IdGeneratorRegistry.for_class(self.__class__).create_id(f"{semantic_model.name}_src")
from_source_alias = PrefixIdGenerator.create_next_id(IdPrefix.SEMANTIC_MODEL_SOURCE).str_value
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought it simplified the SQL output with shorter aliases. What do you think?

@@ -67,3 +67,7 @@ class IdPrefix(Enum):
EXEC_PLAN_PREFIX = "ep"

MF_DAG = "mfd"

TIME_SPINE_SOURCE = "spine"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it in the interest of making the aliases shorter. I thought the snapshot thrash was fine.

@@ -67,3 +67,7 @@ class IdPrefix(Enum):
EXEC_PLAN_PREFIX = "ep"

MF_DAG = "mfd"

TIME_SPINE_SOURCE = "spine"
SEMANTIC_MODEL_SOURCE = "src"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dag_id=DagId.from_str(
IdGeneratorRegistry.for_class(self.__class__).create_id(IdPrefix.GROUP_BY_ITEM_RESOLUTION_DAG.value)
),
dag_id=DagId.from_str(PrefixIdGenerator.create_next_id(IdPrefix.GROUP_BY_ITEM_RESOLUTION_DAG.value)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is incorrect and has been fixed in a later commit that rolled in the fixes for the type-checker errors.

@@ -376,6 +388,7 @@ def __init__(
@log_call(module_name=__name__, telemetry_reporter=_telemetry_reporter)
def query(self, mf_request: MetricFlowQueryRequest) -> MetricFlowQueryResult: # noqa: D
logger.info(f"Starting query request:\n{indent(mf_pformat(mf_request))}")
logger.info(f"Setting ID generation to start at: {MetricFlowEngine._QUERY_ID_START_VALUE}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - I was shuffling some things around and did not complete the change. Please see the added commits.

@plypaul plypaul force-pushed the plypaul--88--id-generation2 branch 3 times, most recently from b32103a to 5b66b01 Compare January 31, 2024 22:53
@plypaul plypaul changed the title Add Support for Consistent SQL Query Generation via MetricFlowEngine Add Support for Consistent SQL Query Generation Jan 31, 2024
@plypaul plypaul force-pushed the plypaul--88--id-generation2 branch from 5b66b01 to 44236bd Compare January 31, 2024 23:51
@plypaul plypaul force-pushed the plypaul--88--id-generation2 branch 2 times, most recently from cfe58ae to f51b845 Compare February 6, 2024 22:39
@plypaul plypaul changed the base branch from main to plypaul--88.2--improve-snapshot-id-consistency February 6, 2024 22:40
@plypaul plypaul force-pushed the plypaul--88--id-generation2 branch 2 times, most recently from c1a817c to 0f903ff Compare February 8, 2024 21:57
@plypaul plypaul changed the base branch from plypaul--88.2--improve-snapshot-id-consistency to plypaul--88.2.2--cache-source-node-output February 8, 2024 21:57
@plypaul plypaul added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Feb 8, 2024
@plypaul plypaul had a problem deploying to DW_INTEGRATION_TESTS February 8, 2024 22:14 — with GitHub Actions Failure
@plypaul plypaul had a problem deploying to DW_INTEGRATION_TESTS February 8, 2024 22:14 — with GitHub Actions Failure
@plypaul plypaul had a problem deploying to DW_INTEGRATION_TESTS February 8, 2024 22:14 — with GitHub Actions Failure
@plypaul plypaul had a problem deploying to DW_INTEGRATION_TESTS February 8, 2024 22:14 — with GitHub Actions Failure
@plypaul plypaul force-pushed the plypaul--88.2.2--cache-source-node-output branch from 4e569e8 to 5c652ff Compare February 9, 2024 00:02
@plypaul plypaul force-pushed the plypaul--88--id-generation2 branch from 0f903ff to 05105e8 Compare February 9, 2024 00:04
@plypaul plypaul added Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment and removed Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment labels Feb 9, 2024
@plypaul plypaul temporarily deployed to DW_INTEGRATION_TESTS February 9, 2024 00:07 — with GitHub Actions Inactive
@plypaul plypaul temporarily deployed to DW_INTEGRATION_TESTS February 9, 2024 00:07 — with GitHub Actions Inactive
@plypaul plypaul temporarily deployed to DW_INTEGRATION_TESTS February 9, 2024 00:07 — with GitHub Actions Inactive
@plypaul plypaul temporarily deployed to DW_INTEGRATION_TESTS February 9, 2024 00:21 — with GitHub Actions Inactive
@plypaul plypaul temporarily deployed to DW_INTEGRATION_TESTS February 9, 2024 00:21 — with GitHub Actions Inactive
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, the updated form of this is a lot easier to reason about in terms of the changes I'm seeing.

pass


class StaticIdPrefix(IdPrefix, Enum, metaclass=EnumMetaClassHelper):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see.

The other option is to make IdPrefix a protocol, but we've been over the tradeoffs there and I think having it be a superclass makes sense even though enum-type subclasses are a bit weird.

@plypaul plypaul force-pushed the plypaul--88.2.2--cache-source-node-output branch 3 times, most recently from afe1d19 to 7840539 Compare February 15, 2024 19:28
Base automatically changed from plypaul--88.2.2--cache-source-node-output to main February 15, 2024 19:31
The majority of prefixes used for ID generation were previously listed as
constants in a module. For improved encapsulation / structure, this commit moves
them to the enum IdPrefix.
Instead of using IdGeneratorRegistry.for_class(), this commit updates those
callsites to use PrefixIdGenerator instead. This also moves cases where strings
were used to the IdPrefix enum.
There were some cases where the DagId could be passed into the initializer, but
it was seldom used. This mades the ID generation more automated and also makes
use of IdPrefix instead of strings.
@plypaul plypaul force-pushed the plypaul--88--id-generation2 branch from 05105e8 to f4b1f1e Compare February 15, 2024 19:38
@plypaul plypaul merged commit 3486cfc into main Feb 15, 2024
9 checks passed
@plypaul plypaul deleted the plypaul--88--id-generation2 branch February 15, 2024 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SL-1642] [Feature] Add Support for Consistent SQL Query Generation
2 participants