[SPARK-15517][SQL][STREAMING] Add support for complete output mode in Structure Streaming #13286

tdas · 2016-05-24T23:02:44Z

What changes were proposed in this pull request?

Currently structured streaming only supports append output mode. This PR adds the following.

Added support for Complete output mode in the internal state store, analyzer and planner.
Added public API in Scala and Python for users to specify output mode
Added checks for unsupported combinations of output mode and DF operations
- Plans with no aggregation should support only Append mode
- Plans with aggregation should support only Update and Complete modes
- Default output mode is Append mode (Question: should we change this to automatically set to Complete mode when there is aggregation?)
Added support for Complete output mode in Memory Sink. So Memory Sink internally supports append and complete, update. But from public API only Complete and Append output modes are supported.

How was this patch tested?

Unit tests in various test suites

StreamingAggregationSuite: tests for complete mode
MemorySinkSuite: tests for checking behavior in Append and Complete modes.
UnsupportedOperationSuite: tests for checking unsupported combinations of DF ops and output modes
DataFrameReaderWriterSuite: tests for checking that output mode cannot be called on static DFs
Python doc test and existing unit tests modified to call write.outputMode.

…mplete mode

tdas · 2016-05-24T23:03:47Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala

-          if (moreStreamingAggregates.nonEmpty) {
-            throwError("Multiple streaming aggregations are not supported with " +
-              "streaming DataFrames/Datasets")
-          }


This has been moved around to better consolidate all the logic related to output modes and aggregations.

SparkQA · 2016-05-24T23:10:11Z

Test build #59232 has finished for PR 13286 at commit 61af057.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-05-24T23:18:47Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingAggregationSuite.scala

+        .agg(count("*"))
+        .as[(Int, Long)]
+
+    intercept[AnalysisException] {


add checks on message

SparkQA · 2016-05-24T23:39:52Z

Test build #59235 has finished for PR 13286 at commit a6e2bb5.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ListFilesCommand(files: Seq[String] = Seq.empty[String]) extends RunnableCommand
- case class ListJarsCommand(jars: Seq[String] = Seq.empty[String]) extends RunnableCommand

zsxwing · 2016-05-24T23:50:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala

+    if (latestBatchId.isEmpty || batchId > latestBatchId.get) {
+      logDebug(s"Committing batch $batchId to $this")
+      outputMode match {
+        case OutputMode.Append | OutputMode.Update =>


nit: Since we don't support OutputMode.Update, could you remove it? I think it will have a different logic even if we add it in future.

I am wondering whether we should support update mode for memory sink.

zsxwing · 2016-05-24T23:54:18Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/ContinuousQueryManagerSuite.scala

+            query =
+              df.write
+                .format("memory")
+                .option("checkpointLocation", "memory")


nit: "memory" -> metadataRoot

zsxwing · 2016-05-25T00:34:54Z

Finished my first round. Looks pretty good. Just some nits.

zsxwing · 2016-05-25T00:37:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/OutputMode.java

-
-case object Append extends OutputMode
-case object Update extends OutputMode
+public enum OutputMode {


nit: @Experimental

SparkQA · 2016-05-25T00:52:59Z

Test build #59236 has finished for PR 13286 at commit bb0314d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-25T01:02:23Z

Test build #59239 has finished for PR 13286 at commit 3a79d41.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-25T03:34:28Z

Test build #59253 has finished for PR 13286 at commit 074299c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-05-27T20:24:19Z

python/pyspark/sql/readwriter.py

+        * `append`:Only the new rows in the streaming DataFrame/Dataset will be written to
+           the sink
+        * `complete`:All the rows in the streaming DataFrame/Dataset will be written to the sink
+           every time these is some updates


each time the trigger fires?

I want to write something that makes sense generally, without understanding trigger and all. As is, since the trigger is optional, one does not need to know about triggers at all to start running stuff in structured streaming.

marmbrus · 2016-05-27T21:12:04Z

LGTM with a few comments.

SparkQA · 2016-05-28T00:48:22Z

Test build #59541 has finished for PR 13286 at commit 4784e18.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-28T02:17:05Z

Test build #59542 has finished for PR 13286 at commit e951798.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-05-28T03:00:12Z

@rxin @marmbrus Are you okay with current OutputMode design? Add unit test for Java compatibility of OutputMode.

marmbrus · 2016-05-31T22:52:42Z

LGTM

… Structure Streaming ## What changes were proposed in this pull request? Currently structured streaming only supports append output mode. This PR adds the following. - Added support for Complete output mode in the internal state store, analyzer and planner. - Added public API in Scala and Python for users to specify output mode - Added checks for unsupported combinations of output mode and DF operations - Plans with no aggregation should support only Append mode - Plans with aggregation should support only Update and Complete modes - Default output mode is Append mode (**Question: should we change this to automatically set to Complete mode when there is aggregation?**) - Added support for Complete output mode in Memory Sink. So Memory Sink internally supports append and complete, update. But from public API only Complete and Append output modes are supported. ## How was this patch tested? Unit tests in various test suites - StreamingAggregationSuite: tests for complete mode - MemorySinkSuite: tests for checking behavior in Append and Complete modes. - UnsupportedOperationSuite: tests for checking unsupported combinations of DF ops and output modes - DataFrameReaderWriterSuite: tests for checking that output mode cannot be called on static DFs - Python doc test and existing unit tests modified to call write.outputMode. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13286 from tdas/complete-mode. (cherry picked from commit 90b1143) Signed-off-by: Michael Armbrust <michael@databricks.com>

techaddict · 2016-06-02T06:35:05Z

@tdas @marmbrus this is failing dev/lint-java
So we should change Append and Complete to append and complete

[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.

I've created a PR to fix this #13464

tdas · 2016-06-02T09:58:53Z

The method naming with caps was intentional. We need to introduce exception
rule for lint-java in this case.
On Jun 2, 2016 7:36 AM, "Sandeep Singh" notifications@github.com wrote:

@tdas https://github.com/tdas @marmbrus https://github.com/marmbrus
this is failing dev/lint-java
So we should change Append and Complete to append and complete

[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:41,28 MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]$'.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:52,28 MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9]_$'.

I've created a PR to fix this #13464
#13464

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#13286 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAoerO0453BJM9uhP5Wb3fAVIDKHyv9fks5qHnnegaJpZM4ImAy-
.

techaddict · 2016-06-02T10:09:27Z

@tdas updated my PR with exclusions for Append and Complete

srowen · 2016-06-02T16:04:36Z

sql/catalyst/src/main/java/org/apache/spark/sql/OutputMode.java

+   *
+   * @since 2.0.0
+   */
+  public static OutputMode Append() {


See #13464 -- this fails Java lint. Can this be append() as would be conventional in Java? I don't see that it's there to implement some interface

## What changes were proposed in this pull request? revived #13464 Fix Java Lint errors introduced by #13286 and #13280 Before: ``` Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0 Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type. ``` ## How was this patch tested? ran `dev/lint-java` locally Author: Sandeep Singh <sandeep@techaddict.me> Closes #13559 from techaddict/minor-3.

## What changes were proposed in this pull request? revived #13464 Fix Java Lint errors introduced by #13286 and #13280 Before: ``` Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0 Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type. ``` ## How was this patch tested? ran `dev/lint-java` locally Author: Sandeep Singh <sandeep@techaddict.me> Closes #13559 from techaddict/minor-3. (cherry picked from commit f958c1c) Signed-off-by: Sean Owen <sowen@cloudera.com>

## What changes were proposed in this pull request? revived apache#13464 Fix Java Lint errors introduced by apache#13286 and apache#13280 Before: ``` Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0 Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type. ``` ## How was this patch tested? ran `dev/lint-java` locally Author: Sandeep Singh <sandeep@techaddict.me> Closes apache#13559 from techaddict/minor-3.

tdas added 5 commits May 23, 2016 16:40

First commit to support complete mode

469d69a

Add public API for output mode and upgraded memory sink to support co…

49746f4

…mplete mode

Added unit test for MemorySink

2786090

Added unit test to DataFrameReaderWriterSuite

02b10ac

Added python API for output mode

61af057

tdas reviewed May 24, 2016
View reviewed changes

Merge remote-tracking branch 'apache-github/master' into complete-mode

a6e2bb5

Fixed python style

bb0314d

zsxwing reviewed May 24, 2016
View reviewed changes

Refactored injection of output mode in StateStoreSaveExec

3a79d41

zsxwing reviewed May 24, 2016
View reviewed changes

zsxwing reviewed May 25, 2016
View reviewed changes

Fixed test bug

074299c

marmbrus reviewed May 27, 2016
View reviewed changes

Addressed comments

4784e18

Fixed RAT

e951798

asfgit closed this in 90b1143 May 31, 2016

techaddict mentioned this pull request Jun 2, 2016

[Minor] Fix Java Lint errors introduced by #13286 and 13280 #13464

Closed

srowen reviewed Jun 2, 2016
View reviewed changes

techaddict mentioned this pull request Jun 8, 2016

[Minor] Fix Java Lint errors introduced by #13286 and #13280 #13559

Closed

[SPARK-15517][SQL][STREAMING] Add support for complete output mode in Structure Streaming #13286

[SPARK-15517][SQL][STREAMING] Add support for complete output mode in Structure Streaming #13286

Uh oh!

Conversation

tdas commented May 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsxwing commented May 25, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 25, 2016

Uh oh!

SparkQA commented May 25, 2016

Uh oh!

SparkQA commented May 25, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marmbrus commented May 27, 2016

Uh oh!

SparkQA commented May 28, 2016

Uh oh!

SparkQA commented May 28, 2016

Uh oh!

tdas commented May 28, 2016

Uh oh!

marmbrus commented May 31, 2016

Uh oh!

techaddict commented Jun 2, 2016

Uh oh!

tdas commented Jun 2, 2016

Uh oh!

techaddict commented Jun 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

tdas commented May 24, 2016 •

edited

Loading