[SPARK-14214][SQL] Update state to provide get/put interface #12013

tdas · 2016-03-28T21:03:46Z

What changes were proposed in this pull request?

The goal is to make the state store more flexible by giving it a more key-value store like interface of gets and puts. This is to allow streaming aggregates physical plan to easily do only-gets instead of only-updates that the current is designed for.

How was this patch tested?

Unit tests.

tdas · 2016-03-28T21:11:32Z

@marmbrus @zsxwing

tdas · 2016-03-28T21:14:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDD.scala

+    store = StateStore.get(
+      storeId, keySchema, valueSchema, storeVersion, storeConf, confBroadcast.value.value)
+    val inputIter = dataRDD.iterator(partition, ctxt)
+    storeUpdateFunction(store, inputIter)


The finally was removed to allow commits to be done lazily.

SparkQA · 2016-03-28T21:18:40Z

Test build #54360 has finished for PR 12013 at commit 3646878.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-03-28T21:55:36Z

Not sure about what this mima test failure is about. Thats anyways not on the latest commit, so I will wait for the tests on the latest commit to completed (build # 2703).

tdas · 2016-03-28T21:56:07Z

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

    /** Commit all the updates that have been made to the store, and return the new version. */
    override def commit(): Long = {
-      verify(state == UPDATING, "Cannot commit again after already committed or cancelled")
+      verify(state == UPDATING, "Cannot commit after already committed or cancelled")


nit: cancelled -> aborted

SparkQA · 2016-03-28T23:00:51Z

Test build #2703 has finished for PR 12013 at commit ccb323f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-28T23:10:26Z

Test build #54361 has finished for PR 12013 at commit ccb323f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-03-28T23:15:48Z

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

+    }
+
+    override def put(key: UnsafeRow, value: UnsafeRow): Unit = {
+      verify(state == UPDATING, "Cannot remove after already committed or cancelled")


nit: remove -> put

zsxwing · 2016-03-28T23:26:44Z

LGTM except one nit

SparkQA · 2016-03-28T23:38:51Z

Test build #54369 has finished for PR 12013 at commit fb22c8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-03-30T19:53:19Z

Can we close this now?

This PR adds the ability to perform aggregations inside of a `ContinuousQuery`. In order to implement this feature, the planning of aggregation has augmented with a new `StatefulAggregationStrategy`. Unlike batch aggregation, stateful-aggregation uses the `StateStore` (introduced in #11645) to persist the results of partial aggregation across different invocations. The resulting physical plan performs the aggregation using the following progression: - Partial Aggregation - Shuffle - Partial Merge (now there is at most 1 tuple per group) - StateStoreRestore (now there is 1 tuple from this batch + optionally one from the previous) - Partial Merge (now there is at most 1 tuple per group) - StateStoreSave (saves the tuple for the next batch) - Complete (output the current result of the aggregation) The following refactoring was also performed to allow us to plug into existing code: - The get/put implementation is taken from #12013 - The logic for breaking down and de-duping the physical execution of aggregation has been move into a new pattern `PhysicalAggregation` - The `AttributeReference` used to identify the result of an `AggregateFunction` as been moved into the `AggregateExpression` container. This change moves the reference into the same object as the other intermediate references used in aggregation and eliminates the need to pass around a `Map[(AggregateFunction, Boolean), Attribute]`. Further clean up (using a different aggregation container for logical/physical plans) is deferred to a followup. - Some planning logic is moved from the `SessionState` into the `QueryExecution` to make it easier to override in the streaming case. - The ability to write a `StreamTest` that checks only the output of the last batch has been added to simulate the future addition of output modes. Author: Michael Armbrust <michael@databricks.com> Closes #12048 from marmbrus/statefulAgg.

tdas · 2016-04-04T23:47:17Z

closing this as #12048 already absorbed this change.

tdas added 2 commits March 25, 2016 20:00

Updated State Store API

ec6a7c1

Added test for testing iterators with mapPartitionWithStateStore

ccb323f

tdas force-pushed the state-store-update branch from 3646878 to ccb323f Compare March 28, 2016 21:09

tdas reviewed Mar 28, 2016
View reviewed changes

Minor doc updates

fb22c8a

zsxwing reviewed Mar 28, 2016
View reviewed changes

marmbrus mentioned this pull request Mar 29, 2016

[SPARK-14255] [SQL] Streaming Aggregation #12048

Closed

tdas closed this Apr 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14214][SQL] Update state to provide get/put interface #12013

[SPARK-14214][SQL] Update state to provide get/put interface #12013

Uh oh!

tdas commented Mar 28, 2016

Uh oh!

tdas commented Mar 28, 2016

Uh oh!

tdas Mar 28, 2016

Uh oh!

SparkQA commented Mar 28, 2016

Uh oh!

tdas commented Mar 28, 2016

Uh oh!

tdas Mar 28, 2016

Uh oh!

SparkQA commented Mar 28, 2016

Uh oh!

SparkQA commented Mar 28, 2016

Uh oh!

zsxwing Mar 28, 2016

Uh oh!

zsxwing commented Mar 28, 2016

Uh oh!

SparkQA commented Mar 28, 2016

Uh oh!

marmbrus commented Mar 30, 2016

Uh oh!

tdas commented Apr 4, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-14214][SQL] Update state to provide get/put interface #12013

[SPARK-14214][SQL] Update state to provide get/put interface #12013

Uh oh!

Conversation

tdas commented Mar 28, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

tdas commented Mar 28, 2016

Uh oh!

tdas Mar 28, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 28, 2016

Uh oh!

tdas commented Mar 28, 2016

Uh oh!

tdas Mar 28, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 28, 2016

Uh oh!

SparkQA commented Mar 28, 2016

Uh oh!

zsxwing Mar 28, 2016

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Mar 28, 2016

Uh oh!

SparkQA commented Mar 28, 2016

Uh oh!

marmbrus commented Mar 30, 2016

Uh oh!

tdas commented Apr 4, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants