Skip to content

Conversation

@tdas
Copy link
Contributor

@tdas tdas commented May 24, 2016

What changes were proposed in this pull request?

Currently structured streaming only supports append output mode. This PR adds the following.

  • Added support for Complete output mode in the internal state store, analyzer and planner.
  • Added public API in Scala and Python for users to specify output mode
  • Added checks for unsupported combinations of output mode and DF operations
    • Plans with no aggregation should support only Append mode
    • Plans with aggregation should support only Update and Complete modes
    • Default output mode is Append mode (Question: should we change this to automatically set to Complete mode when there is aggregation?)
  • Added support for Complete output mode in Memory Sink. So Memory Sink internally supports append and complete, update. But from public API only Complete and Append output modes are supported.

How was this patch tested?

Unit tests in various test suites

  • StreamingAggregationSuite: tests for complete mode
  • MemorySinkSuite: tests for checking behavior in Append and Complete modes.
  • UnsupportedOperationSuite: tests for checking unsupported combinations of DF ops and output modes
  • DataFrameReaderWriterSuite: tests for checking that output mode cannot be called on static DFs
  • Python doc test and existing unit tests modified to call write.outputMode.

if (moreStreamingAggregates.nonEmpty) {
throwError("Multiple streaming aggregations are not supported with " +
"streaming DataFrames/Datasets")
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been moved around to better consolidate all the logic related to output modes and aggregations.

@SparkQA
Copy link

SparkQA commented May 24, 2016

Test build #59232 has finished for PR 13286 at commit 61af057.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.agg(count("*"))
.as[(Int, Long)]

intercept[AnalysisException] {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add checks on message

@SparkQA
Copy link

SparkQA commented May 24, 2016

Test build #59235 has finished for PR 13286 at commit a6e2bb5.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ListFilesCommand(files: Seq[String] = Seq.empty[String]) extends RunnableCommand
    • case class ListJarsCommand(jars: Seq[String] = Seq.empty[String]) extends RunnableCommand

if (latestBatchId.isEmpty || batchId > latestBatchId.get) {
logDebug(s"Committing batch $batchId to $this")
outputMode match {
case OutputMode.Append | OutputMode.Update =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since we don't support OutputMode.Update, could you remove it? I think it will have a different logic even if we add it in future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering whether we should support update mode for memory sink.

query =
df.write
.format("memory")
.option("checkpointLocation", "memory")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "memory" -> metadataRoot

@zsxwing
Copy link
Member

zsxwing commented May 25, 2016

Finished my first round. Looks pretty good. Just some nits.


case object Append extends OutputMode
case object Update extends OutputMode
public enum OutputMode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: @Experimental

@SparkQA
Copy link

SparkQA commented May 25, 2016

Test build #59236 has finished for PR 13286 at commit bb0314d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 25, 2016

Test build #59239 has finished for PR 13286 at commit 3a79d41.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 25, 2016

Test build #59253 has finished for PR 13286 at commit 074299c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* `append`:Only the new rows in the streaming DataFrame/Dataset will be written to
the sink
* `complete`:All the rows in the streaming DataFrame/Dataset will be written to the sink
every time these is some updates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each time the trigger fires?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to write something that makes sense generally, without understanding trigger and all. As is, since the trigger is optional, one does not need to know about triggers at all to start running stuff in structured streaming.

@marmbrus
Copy link
Contributor

LGTM with a few comments.

@SparkQA
Copy link

SparkQA commented May 28, 2016

Test build #59541 has finished for PR 13286 at commit 4784e18.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 28, 2016

Test build #59542 has finished for PR 13286 at commit e951798.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor Author

tdas commented May 28, 2016

@rxin @marmbrus Are you okay with current OutputMode design? Add unit test for Java compatibility of OutputMode.

@marmbrus
Copy link
Contributor

LGTM

asfgit pushed a commit that referenced this pull request May 31, 2016
… Structure Streaming

## What changes were proposed in this pull request?
Currently structured streaming only supports append output mode.  This PR adds the following.

- Added support for Complete output mode in the internal state store, analyzer and planner.
- Added public API in Scala and Python for users to specify output mode
- Added checks for unsupported combinations of output mode and DF operations
  - Plans with no aggregation should support only Append mode
  - Plans with aggregation should support only Update and Complete modes
  - Default output mode is Append mode (**Question: should we change this to automatically set to Complete mode when there is aggregation?**)
- Added support for Complete output mode in Memory Sink. So Memory Sink internally supports append and complete, update. But from public API only Complete and Append output modes are supported.

## How was this patch tested?
Unit tests in various test suites
- StreamingAggregationSuite: tests for complete mode
- MemorySinkSuite: tests for checking behavior in Append and Complete modes.
- UnsupportedOperationSuite: tests for checking unsupported combinations of DF ops and output modes
- DataFrameReaderWriterSuite: tests for checking that output mode cannot be called on static DFs
- Python doc test and existing unit tests modified to call write.outputMode.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13286 from tdas/complete-mode.

(cherry picked from commit 90b1143)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@asfgit asfgit closed this in 90b1143 May 31, 2016
@techaddict
Copy link
Contributor

@tdas @marmbrus this is failing dev/lint-java
So we should change Append and Complete to append and complete

[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.

I've created a PR to fix this #13464

@tdas
Copy link
Contributor Author

tdas commented Jun 2, 2016

The method naming with caps was intentional. We need to introduce exception
rule for lint-java in this case.
On Jun 2, 2016 7:36 AM, "Sandeep Singh" notifications@github.com wrote:

@tdas https://github.com/tdas @marmbrus https://github.com/marmbrus
this is failing dev/lint-java
So we should change Append and Complete to append and complete

[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:41,28 MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]$'.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:52,28 MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9
]_$'.

I've created a PR to fix this #13464
#13464


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#13286 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAoerO0453BJM9uhP5Wb3fAVIDKHyv9fks5qHnnegaJpZM4ImAy-
.

@techaddict
Copy link
Contributor

@tdas updated my PR with exclusions for Append and Complete

*
* @since 2.0.0
*/
public static OutputMode Append() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #13464 -- this fails Java lint. Can this be append() as would be conventional in Java? I don't see that it's there to implement some interface

asfgit pushed a commit that referenced this pull request Jun 8, 2016
## What changes were proposed in this pull request?

revived #13464

Fix Java Lint errors introduced by #13286 and #13280
Before:
```
Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType.
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type.
```

## How was this patch tested?
ran `dev/lint-java` locally

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13559 from techaddict/minor-3.
asfgit pushed a commit that referenced this pull request Jun 8, 2016
## What changes were proposed in this pull request?

revived #13464

Fix Java Lint errors introduced by #13286 and #13280
Before:
```
Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType.
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type.
```

## How was this patch tested?
ran `dev/lint-java` locally

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13559 from techaddict/minor-3.

(cherry picked from commit f958c1c)
Signed-off-by: Sean Owen <sowen@cloudera.com>
zjffdu pushed a commit to zjffdu/spark that referenced this pull request Jun 10, 2016


## What changes were proposed in this pull request?

revived apache#13464

Fix Java Lint errors introduced by apache#13286 and apache#13280
Before:
```
Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType.
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type.
```

## How was this patch tested?
ran `dev/lint-java` locally

Author: Sandeep Singh <sandeep@techaddict.me>

Closes apache#13559 from techaddict/minor-3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants