Skip to content

[Feature] Support Spark expression: years #3131

@andygrove

Description

@andygrove

What is the problem the feature request solves?

Note: This issue was generated with AI assistance. The specification details have been extracted from Spark documentation and may need verification.

Comet does not currently support the Spark years function, causing queries using this function to fall back to Spark's JVM execution instead of running natively on DataFusion.

The Years expression is a v2 partition transform that extracts the year component from date/timestamp values for partitioning purposes. It converts temporal data into integer year values, enabling efficient time-based partitioning strategies in Spark SQL tables.

Supporting this expression would allow more Spark workloads to benefit from Comet's native acceleration.

Describe the potential solution

Spark Specification

Syntax:

YEARS(column_name)
// DataFrame API usage
Years(col("date_column"))

Arguments:

Argument Type Description
child Expression The input expression, typically a date or timestamp column

Return Type: IntegerType - Returns the year as an integer value.

Supported Data Types:

  • DateType
  • TimestampType
  • TimestampNTZType (timestamp without timezone)

Edge Cases:

  • Null handling: Returns null when the input expression is null
  • Invalid dates: Follows Spark's standard date parsing and validation rules
  • Year boundaries: Correctly handles leap years and year transitions
  • Timezone effects: For timestamp inputs, the year extraction respects the session timezone setting
  • Historical dates: Supports dates across the full range supported by Spark's date types

Examples:

-- Creating a table partitioned by years
CREATE TABLE events (
  id BIGINT,
  event_time TIMESTAMP,
  data STRING
) USING DELTA
PARTITIONED BY (YEARS(event_time))

-- Query that benefits from partition pruning
SELECT * FROM events 
WHERE event_time >= '2023-01-01' AND event_time < '2024-01-01'
// DataFrame API usage in partition transforms
import org.apache.spark.sql.catalyst.expressions.Years
import org.apache.spark.sql.functions.col

// Transform expression for partitioning
val yearTransform = Years(col("timestamp_col").expr)

// Usage in DataFrameWriter for partitioned writes
df.write
  .partitionBy("year_partition")
  .option("partitionOverwriteMode", "dynamic")
  .save("/path/to/table")

Implementation Approach

See the Comet guide on adding new expressions for detailed instructions.

  1. Scala Serde: Add expression handler in spark/src/main/scala/org/apache/comet/serde/
  2. Register: Add to appropriate map in QueryPlanSerde.scala
  3. Protobuf: Add message type in native/proto/src/proto/expr.proto if needed
  4. Rust: Implement in native/spark-expr/src/ (check if DataFusion has built-in support first)

Additional context

Difficulty: Medium
Spark Expression Class: org.apache.spark.sql.catalyst.expressions.Years

Related:

  • Months - Monthly partition transform
  • Days - Daily partition transform
  • Hours - Hourly partition transform
  • Bucket - Hash-based partition transform
  • PartitionTransformExpression - Base class for partition transforms

This issue was auto-generated from Spark reference documentation.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions