[SPARK-54403][SQL][Metric View] Add YAML serde infrastructure for metric views#53146
[SPARK-54403][SQL][Metric View] Add YAML serde infrastructure for metric views#53146linhongliu-db wants to merge 8 commits intoapache:masterfrom
Conversation
… for metric views This commit adds the complete serialization/deserialization infrastructure for parsing metric view YAML definitions: - Add Jackson YAML dependencies to pom.xml - Implement canonical model for metric views: - Column, Expression (Dimension/Measure), MetricView, Source - YAMLVersion validation and exception types - Implement version-specific serde (v0.1): - YAML deserializer/serializer - Base classes for extensibility - Add JSON utilities for metadata serialization
pom.xml
Outdated
| <dependency> | ||
| <groupId>com.fasterxml.jackson.dataformat</groupId> | ||
| <artifactId>jackson-dataformat-yaml</artifactId> | ||
| <version>${fasterxml.jackson.version}</version> |
There was a problem hiding this comment.
all jackson deps are managed by jackson-bom now, you don't need to declare it again here
|
|
||
| object JsonUtils { | ||
| // Singleton ObjectMapper that can be used across the project | ||
| private lazy val mapper: ObjectMapper = { |
There was a problem hiding this comment.
According to the comments, should we try to use this singleton as much as possible in the subsequent development?
There was a problem hiding this comment.
Yes, I'm surprised that there is no such util in the Spark repo. For now, I plan to use it for all the metric view development, but not sure how much other code needs this.
sql/catalyst/src/main/scala/org/apache/spark/sql/metricview/serde/MetricViewCanonical.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/metricview/serde/MetricViewSerDeBase.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/metricview/serde/MetricViewSerDeBase.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/metricview/serde/MetricViewSerDeV01.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/metricview/serde/MetricViewCanonical.scala
Outdated
Show resolved
Hide resolved
|
cc @cloud-fan to review |
sql/catalyst/src/main/scala/org/apache/spark/sql/metricview/serde/MetricViewCanonical.scala
Outdated
Show resolved
Hide resolved
…rde/MetricViewCanonical.scala
| } | ||
| } | ||
|
|
||
| object YamlMapperProvider extends YamlMapperProviderBase |
There was a problem hiding this comment.
do we need to add V01 postfix?
|
|
||
| // Trait representing the capability to validate an object | ||
| trait Validatable { | ||
| def validate(): Try[Unit] |
There was a problem hiding this comment.
why use Try? Can we fail directly?
| } | ||
| } | ||
|
|
||
| case class Column[T <: Expression]( |
There was a problem hiding this comment.
is this type parameter really useful?
|
|
||
| def validate(): Unit |
There was a problem hiding this comment.
| def validate(): Unit |
| expression: Expression, | ||
| ordinal: Int) extends Validatable { | ||
| override def validate(): Unit = { | ||
| // No validation needed |
There was a problem hiding this comment.
then shall we remove extends Validatable?
| } | ||
|
|
||
| def getColumnMetadata: ColumnMetadata = { | ||
| val truncatedExpr = expression.expr.take(Constants.MAXIMUM_PROPERTY_SIZE) |
There was a problem hiding this comment.
we can't parse back a truncated expr, shall we just fail here if it's too large?
|
|
||
| // Only parse the "version" field and ignore all others | ||
| @JsonIgnoreProperties(ignoreUnknown = true) | ||
| private[sql] case class YAMLVersion(version: String) extends Validatable { |
There was a problem hiding this comment.
does it need to be Validatable? We only create YAMLVersion in MetricViewFactory.fromYAML which already does validation.
There was a problem hiding this comment.
remove all Validatable
|
cc @cloud-fan updated the PR based on comments, could you please take another look? |
| import com.fasterxml.jackson.databind.ObjectMapper | ||
| import com.fasterxml.jackson.module.scala.DefaultScalaModule | ||
|
|
||
| private[sql] object JsonUtils { |
There was a problem hiding this comment.
sql/catalyst/src/main/scala/org/apache/spark/sql/metricview/serde/MetricViewCanonical.scala
Outdated
Show resolved
Hide resolved
| private[sql] object YAMLVersion { | ||
| private def validYAMLVersions: Set[String] = Set("0.1") | ||
|
|
||
| def apply(version: String): YAMLVersion = { |
There was a problem hiding this comment.
good catch. the actual version validation is at MetricViewFactory and this code is useless. There is also a test that confirm this: https://github.com/apache/spark/pull/53146/files#diff-23bf5ddc582ff6684f7cc8950a12f4d8e745ff3fb7b0142dd00015e1f159fc8aR144
| def dimensions: Seq[ColumnBase] | ||
| def measures: Seq[ColumnBase] | ||
|
|
||
| def toCanonical: MetricView = { |
There was a problem hiding this comment.
Can you explain a bit more about how the canonical entities can help with metric view version evolution?
There was a problem hiding this comment.
updated the comment
…rde/MetricViewCanonical.scala Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
|
@cloud-fan updated. :-) |
|
thanks, merging to master! |
…ution ### What changes were proposed in this pull request? This PR implements the command to create metric views and the analysis rule to resolve a metric view query: - CREATE Metric view - Add SQL grammar to support `WITH METRIC` when creating a view - Add dollar-quoted string support for YAML definitions - Implement CreateMetricViewCommand to analyze the view body - Use a table property to indicate that the View is a metric view since HIVE has no dedicated table type - SELECT Metric view - Update SessionCatalog to parse metric view definitions on read - Add MetricViewPlanner utility to parse the YAML definition and construct an unresolved plan - Add ResolveMetricView rule to substitute the dimensions and measures reference to actual expressions NOTE: This PR depends on #53146 This PR also marks `org.apache.spark.sql.metricview` as an internal package ### Why are the changes needed? [SPIP: Metrics & semantic modeling in Spark](https://docs.google.com/document/d/1xVTLijvDTJ90lZ_ujwzf9HvBJgWg0mY6cYM44Fcghl0/edit?tab=t.0#heading=h.4iogryr5qznc) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ``` build/sbt "hive/testOnly org.apache.spark.sql.execution.SimpleMetricViewSuite" build/sbt "hive/testOnly org.apache.spark.sql.hive.execution.HiveMetricViewSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #53158 from linhongliu-db/metric-view-create-and-select. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR adds the complete serialization/deserialization infrastructure for parsing metric view YAML definitions:
Why are the changes needed?
SPIP: Metrics & semantic modeling in Spark
Does this PR introduce any user-facing change?
No
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code
Co-Authored-By: Claude noreply@anthropic.com