-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15517][SQL][STREAMING] Add support for complete output mode in Structure Streaming #13286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
469d69a
49746f4
2786090
02b10ac
61af057
a6e2bb5
bb0314d
3a79d41
074299c
58f88b8
1d0d13c
bbd6022
4973621
369e9d5
ab32567
85ce263
4784e18
e951798
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -500,6 +500,26 @@ def mode(self, saveMode): | |
| self._jwrite = self._jwrite.mode(saveMode) | ||
| return self | ||
|
|
||
| @since(2.0) | ||
| def outputMode(self, outputMode): | ||
| """Specifies how data of a streaming DataFrame/Dataset is written to a streaming sink. | ||
|
|
||
| Options include: | ||
|
|
||
| * `append`:Only the new rows in the streaming DataFrame/Dataset will be written to | ||
| the sink | ||
| * `complete`:All the rows in the streaming DataFrame/Dataset will be written to the sink | ||
| every time these is some updates | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. each time the trigger fires?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I want to write something that makes sense generally, without understanding trigger and all. As is, since the trigger is optional, one does not need to know about triggers at all to start running stuff in structured streaming. |
||
|
|
||
| .. note:: Experimental. | ||
|
|
||
| >>> writer = sdf.write.outputMode('append') | ||
| """ | ||
| if not outputMode or type(outputMode) != str or len(outputMode.strip()) == 0: | ||
| raise ValueError('The output mode must be a non-empty string. Got: %s' % outputMode) | ||
| self._jwrite = self._jwrite.outputMode(outputMode) | ||
| return self | ||
|
|
||
| @since(1.4) | ||
| def format(self, source): | ||
| """Specifies the underlying output data source. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql; | ||
|
|
||
| import org.apache.spark.annotation.Experimental; | ||
|
|
||
| /** | ||
| * :: Experimental :: | ||
| * | ||
| * OutputMode is used to what data will be written to a streaming sink when there is | ||
| * new data available in a streaming DataFrame/Dataset. | ||
| * | ||
| * @since 2.0.0 | ||
| */ | ||
| @Experimental | ||
| public class OutputMode { | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add docs
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doesn't have to be in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure. Just wanted to be extra-cautious for java-safety.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But there are no tests written in java :)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. one downside of writing this in java is that it doesn't show up in scaladocs
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We also need a trait OutputMode that InternalOutputModes.Append will extend. Should that be a Scala trait or Java interface?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe they are equivalent when there are no implemented methods.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, writing it in Scala does not work because we need define static methods on OutputMode ('Append()', etc.) as well as the OutputMode interface that singleton object Append will extend. To do this in Scala we have to do But this makes the OutputMode static object unusable from Java, as you will have to call So doing this in Java is the cleanest way for it to be usable in both Java and Scala. I am including a JavaOutputModeSuite as well. |
||
|
|
||
| /** | ||
| * OutputMode in which only the new rows in the streaming DataFrame/Dataset will be | ||
| * written to the sink. This output mode can be only be used in queries that do not | ||
| * contain any aggregation. | ||
| * | ||
| * @since 2.0.0 | ||
| */ | ||
| public static OutputMode Append() { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See #13464 -- this fails Java lint. Can this be |
||
| return InternalOutputModes.Append$.MODULE$; | ||
| } | ||
|
|
||
| /** | ||
| * OutputMode in which all the rows in the streaming DataFrame/Dataset will be written | ||
| * to the sink every time these is some updates. This output mode can only be used in queries | ||
| * that contain aggregations. | ||
| * | ||
| * @since 2.0.0 | ||
| */ | ||
| public static OutputMode Complete() { | ||
| return InternalOutputModes.Complete$.MODULE$; | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql | ||
|
|
||
| /** | ||
| * Internal helper class to generate objects representing various [[OutputMode]]s, | ||
| */ | ||
| private[sql] object InternalOutputModes { | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add docs.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this is going to be internal we should probably just move it to
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They are in catalyst. Still move it to execution.streaming?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh, mostly I don't think they need to be in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Moving them to org.apache.spark.streaming in catalyst.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Based on offline discussion with @rxin, will do this move in a later PR once we decide globally which classes should be moved to what package. |
||
|
|
||
| /** | ||
| * OutputMode in which only the new rows in the streaming DataFrame/Dataset will be | ||
| * written to the sink. This output mode can be only be used in queries that do not | ||
| * contain any aggregation. | ||
| */ | ||
| case object Append extends OutputMode | ||
|
|
||
| /** | ||
| * OutputMode in which all the rows in the streaming DataFrame/Dataset will be written | ||
| * to the sink every time these is some updates. This output mode can only be used in queries | ||
| * that contain aggregations. | ||
| */ | ||
| case object Complete extends OutputMode | ||
|
|
||
| /** | ||
| * OutputMode in which only the rows in the streaming DataFrame/Dataset that were updated will be | ||
| * written to the sink every time these is some updates. This output mode can only be used in | ||
| * queries that contain aggregations. | ||
| */ | ||
| case object Update extends OutputMode | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add
.. note:: Experimental.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.