Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating SQL Queries #138

Merged
merged 2 commits into from
Jun 15, 2023
Merged

Updating SQL Queries #138

merged 2 commits into from
Jun 15, 2023

Conversation

jacksonrnewhouse
Copy link
Contributor

@jacksonrnewhouse jacksonrnewhouse commented May 26, 2023

This adds updating queries to the SQL frontend. This allows Arroyo to read Debezium sources, write Debezium to sinks, and build new types of pipelines that intelligently compute how a record has changed, if it in fact has. This is rather large PR, so I'd recommend reviewing it in the following sections:

UpdatingData

This is the central struct for handling updates inside the dataflow. The map() method lets us cleanly apply map functions, potentially eliminating the record entirely if the values are the same. Similarly, filter() applies a predicate to the update and determines if downstream requires an update.

We also have DebeziumData, which wraps updates into the debezium format, similar to what Flink does when format = 'debezium-json.

New Operators

Handling Updates requires new operators:

UpdatingAggregateOperator

This operator is for non-windowed aggregates. Every incoming record is aggregated into a single intermediate value according to the key. There are checks for if either the outgoing value or the internal state has changed, and work is only done when necessary. It supports both Updating and Append inputs, with the SQL layer providing different methods in each case. Currently SQL only supports aggregates that have compact intermediate forms (everything except count distinct). Because DataFusion currently unrolls single count(distinct field) computations into two non-distinct aggregates, this doesn't happen very often.

This is the only new operator that requires state. We reuse the KeyTimeMap backend, where the time is the record timestamp. Similar to flink, this functions as an expiration, with a default expiration of 24 hours. There is not currently any eviction of stale data within the running processor, although expired data will be compacted away and not restored from a checkpoint.

KeyMapUpdatingOperator

This operator is necessary to recompute the key on an UpdatingData input. Because the key could be different between the old and new field, we may need to split a UpdatingData::Update into a UpdatingData::Retract and UpdatingData::Append and collect both of them.

UpdatingData

This is only an addition to datastream::Operator, as it ends up being compiled to an OptionMapOperator, but it applies an optional function on T to UpdatingData via the map() method mentioned above.

JoinWithExpiration

This operator already supported inner joins with expirations. It is now extended to the four main join types, with everything except inner joins emitting an update stream. In particular, the first record on a side that was allowed to be null by the join type will now retract any previously emitted records.

SQL

The majority of the implementation is within arroyo-sql, in particular the portions concerned with SqlOperator and PlanOperators.

Remove PlanType from PlanEdge

In order to convert a PlanNode to a datastream::Operator the node generally needed to have type information. Introducing updates increased this need, as the operators can be quite different depending on whether the node produces updating data. Having this duplicated on the edges and node of the PlanGraph only complicated things, so it was removed. When converting to the Program graph we just look at the source node's type.

PlanType::UpdatingData

Nodes can now have a return type that is updating data, with an inner PlanType giving more details. These will be converted to UpdatingData types in the datastream:Program, and the compiler is, except for a few of the specific operators, unaware of Updatingdata.

Sources and Sinks

Sources and sinks can support updating data, currently through the "debezium_json" serialization mode.

Known Issues

Query LImitations

Updating tables only support a subset of queries. In particular, the following are not supported

  • SQL Window functions (e.g. row_number()) can't have updating data as an input.
  • Joins can't have updating data as inputs.
  • Window aggregates can't have updating data as an input.
  • Updating sources can't have virtual columns or override the watermark or timestamp.
  • COUNT DISTINCT on updating inputs is not supported.

Out of order retractions

For any forward sequence of nodes a retraction should always occur after the record it is retracting. However, I think that there is a sequence of shuffles that could result in a retraction arriving at a downstream node.

State Performance

The state works but is not highly optimized. In particular there are two main performance inefficiencies:

  1. There is no buffering of the intermediate aggregate representation. This means if you run select count(*) from input then there will be a state entry for every row. Because we only rely on expiration for compaction, recovering from this checkpoint will be slow.
  2. Inefficient Max and Min implementations when the input is updating. The aggregation logic when your input is itself updating uses the memory representations from the sliding window aggregator work, with every row behaving as a time bucket. This is fine for most of the aggregates, as they are still fixed size, but max and min are backed by a BTree. Combined with the former point SELECT max(bid.price) from input will write increasingly large records with every incoming record.

Despite these limitations, I still think this is worth trying to merge so we can continue to iterate in this direction.

@jacksonrnewhouse jacksonrnewhouse requested a review from mwylde May 26, 2023 18:08
@jacksonrnewhouse jacksonrnewhouse force-pushed the updating_tables branch 3 times, most recently from cbfed21 to 7097a49 Compare June 9, 2023 17:45
@jacksonrnewhouse jacksonrnewhouse changed the title first pass at updating tables, focused on joins. Updating SQL Queries Jun 9, 2023
@jacksonrnewhouse jacksonrnewhouse force-pushed the updating_tables branch 5 times, most recently from 20e9273 to f58e73b Compare June 9, 2023 21:14
@jacksonrnewhouse jacksonrnewhouse marked this pull request as ready for review June 9, 2023 21:30
arroyo-types/src/lib.rs Outdated Show resolved Hide resolved
arroyo-types/src/lib.rs Outdated Show resolved Hide resolved
arroyo-datastream/src/lib.rs Outdated Show resolved Hide resolved
@jacksonrnewhouse jacksonrnewhouse enabled auto-merge (squash) June 15, 2023 18:47
@jacksonrnewhouse jacksonrnewhouse merged commit 81e42e0 into master Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants