Skip to content

Commit

Permalink
Add Streaming Specs in PROTOCOL in branch 0.6 (#414)
Browse files Browse the repository at this point in the history
* Add Streaming Specs

* resolve comments

* add a
  • Loading branch information
linzhou-db authored Oct 4, 2023
1 parent 120a44a commit d863b6e
Showing 1 changed file with 19 additions and 0 deletions.
19 changes: 19 additions & 0 deletions PROTOCOL.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
- [Per-file Statistics](#per-file-statistics)
- [SQL Expressions for Filtering](#sql-expressions-for-filtering)
- [JSON predicates for Filtering](#json-predicates-for-filtering)
- [Delta Sharing Streaming Specs](#delta-sharing-streaming-specs)
- [Profile File Format](#profile-file-format)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->
Expand Down Expand Up @@ -2913,6 +2914,24 @@ Examples
}
```

## Delta Sharing Streaming Specs
Delta Sharing Streaming is supported starting from delta-sharing-spark 0.6.0. As it's implemented
based on spark structured streaming, it leverages a pull model to consume the new data of the shared
table from the delta sharing server. In addition to most options supported in delta streaming,
there are two options/spark configs for delta sharing streaming.

- spark config **spark.delta.sharing.streaming.queryTableVersionIntervalSeconds**: DeltaSharingSource
leverages [getTableVersion](#query-table-version) rpc to check whether there is new data available
to consume. In order to reduce the traffic burden to the delta sharing server, there's a minimum 30
seconds interval between two getTableVersion rpcs to the delta sharing server. Though, if you are ok
with less freshness on the data and want to reduce the traffic to the server, you can set this
config to a larger number, for example: 60s or 120s. An error will be thrown if it's set less than 30 seconds.
- option **maxVersionsPerRpc**: DeltaSharingSource leverages [QueryTable](#read-data-from-a-table)
rpc to continuously read new data from the delta sharing server. There might be too much
new data to be returned from the server if the streaming has paused for a while on the recipient
side. Its default value is 100, a smaller number is recommended such as `.option("maxVersionsPerRpc", 10)`
to reduce the traffic load for each rpc. This shouldn't affect the freshness of the data significantly
assuming the process time of the delta sharing server grows linearly with `maxVersionsPerRpc`.

# Profile File Format

Expand Down

0 comments on commit d863b6e

Please sign in to comment.