Skip to content

Commit

Permalink
[SPARK-50264][SQL][CONNECT] Add missing methods to DataStreamWriter
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?
We missed a couple of methods when we introduced the `DataStreamWriter` interface. This PR adds them back.

### Why are the changes needed?
`DataStreamWriter` interface must have all user facing methods.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48796 from hvanhovell/SPARK-50264.

Authored-by: Herman van Hovell <herman@databricks.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
  • Loading branch information
hvanhovell committed Nov 10, 2024
1 parent 655061a commit 6fd77b1
Showing 1 changed file with 37 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,43 @@ abstract class DataStreamWriter[T] extends WriteConfigMethods[DataStreamWriter[T
*/
def queryName(queryName: String): this.type

/**
* Specifies the underlying output data source.
*
* @since 2.0.0
*/
def format(source: String): this.type

/**
* Partitions the output by the given columns on the file system. If specified, the output is
* laid out on the file system similar to Hive's partitioning scheme. As an example, when we
* partition a dataset by year and then month, the directory layout would look like:
*
* <ul> <li> year=2016/month=01/</li> <li> year=2016/month=02/</li> </ul>
*
* Partitioning is one of the most widely used techniques to optimize physical data layout. It
* provides a coarse-grained index for skipping unnecessary data reads when queries have
* predicates on the partitioned columns. In order for partitioning to work well, the number of
* distinct values in each column should typically be less than tens of thousands.
*
* @since 2.0.0
*/
@scala.annotation.varargs
def partitionBy(colNames: String*): this.type

/**
* Clusters the output by the given columns. If specified, the output is laid out such that
* records with similar values on the clustering column are grouped together in the same file.
*
* Clustering improves query efficiency by allowing queries with predicates on the clustering
* columns to skip unnecessary data. Unlike partitioning, clustering can be used on very high
* cardinality columns.
*
* @since 4.0.0
*/
@scala.annotation.varargs
def clusterBy(colNames: String*): this.type

/**
* Sets the output of the streaming query to be processed using the provided writer object.
* object. See [[org.apache.spark.sql.ForeachWriter]] for more details on the lifecycle and
Expand Down

0 comments on commit 6fd77b1

Please sign in to comment.