You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[MINOR][STREAMING][DOCS] Minor changes on kinesis integration
## What changes were proposed in this pull request?
Some minor changes for documentation page "Spark Streaming + Kinesis Integration".
Moved "streaming-kinesis-arch.png" before the bullet list, not in between the bullets.
## How was this patch tested?
Tested manually, on my local machine.
Author: Xin Ren <iamshrek@126.com>
Closes#14097 from keypointt/kinesisDoc.
Copy file name to clipboardExpand all lines: docs/streaming-kinesis-integration.md
+13-13Lines changed: 13 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -111,7 +111,7 @@ A Kinesis stream can be set up at one of the valid Kinesis endpoints with 1 or m
111
111
112
112
- `[checkpoint interval]`: The interval (e.g., Duration(2000) = 2 seconds) at which the Kinesis Client Library saves its position in the stream. For starters, set it to the same as the batch interval of the streaming application.
113
113
114
-
- `[initial position]`: Can be either `InitialPositionInStream.TRIM_HORIZON` or `InitialPositionInStream.LATEST` (see Kinesis Checkpointing section and Amazon Kinesis API documentation for more details).
114
+
- `[initial position]`: Can be either `InitialPositionInStream.TRIM_HORIZON` or `InitialPositionInStream.LATEST` (see [`Kinesis Checkpointing`](#kinesis-checkpointing) section and [`Amazon Kinesis API documentation`](http://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html) for more details).
115
115
116
116
- `[message handler]`: A function that takes a Kinesis `Record` and outputs generic `T`.
117
117
@@ -128,14 +128,6 @@ A Kinesis stream can be set up at one of the valid Kinesis endpoints with 1 or m
128
128
Alternatively, you can also download the JAR of the Maven artifact `spark-streaming-kinesis-asl-assembly` from the
129
129
[Maven repository](http://search.maven.org/#search|ga|1|a%3A%22spark-streaming-kinesis-asl-assembly_{{site.SCALA_BINARY_VERSION}}%22%20AND%20v%3A%22{{site.SPARK_VERSION_SHORT}}%22) and add it to `spark-submit` with `--jars`.
130
130
131
-
*Points to remember at runtime:*
132
-
133
-
- Kinesis data processing is ordered per partition and occurs at-least once per message.
134
-
135
-
- Multiple applications can read from the same Kinesis stream. Kinesis will maintain the application-specific shard and checkpoint info in DynamoDB.
136
-
137
-
- A single Kinesis stream shard is processed by one input DStream at a time.
138
-
139
131
<p style="text-align: center;">
140
132
<img src="img/streaming-kinesis-arch.png"
141
133
title="Spark Streaming Kinesis Architecture"
@@ -145,6 +137,14 @@ A Kinesis stream can be set up at one of the valid Kinesis endpoints with 1 or m
145
137
<!-- Images are downsized intentionally to improve quality on retina displays -->
146
138
</p>
147
139
140
+
*Points to remember at runtime:*
141
+
142
+
- Kinesis data processing is ordered per partition and occurs at-least once per message.
143
+
144
+
- Multiple applications can read from the same Kinesis stream. Kinesis will maintain the application-specific shard and checkpoint info in DynamoDB.
145
+
146
+
- A single Kinesis stream shard is processed by one input DStream at a time.
147
+
148
148
- A single Kinesis input DStream can read from multiple shards of a Kinesis stream by creating multiple KinesisRecordProcessor threads.
149
149
150
150
- Multiple input DStreams running in separate processes/instances can read from a Kinesis stream.
@@ -173,7 +173,7 @@ To run the example,
173
173
174
174
- Set up Kinesis stream (see earlier section) within AWS. Note the name of the Kinesis stream and the endpoint URL corresponding to the region where the stream was created.
175
175
176
-
- Set up the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_KEY with your AWS credentials.
176
+
- Set up the environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_KEY` with your AWS credentials.
177
177
178
178
- In the Spark root directory, run the example as
179
179
@@ -216,6 +216,6 @@ de-aggregate records during consumption.
216
216
217
217
- Checkpointing too frequently will cause excess load on the AWS checkpoint storage layer and may lead to AWS throttling. The provided example handles this throttling with a random-backoff-retry strategy.
218
218
219
-
- If no Kinesis checkpoint info exists when the input DStream starts, it will start either from the oldest record available (InitialPositionInStream.TRIM_HORIZON) or from the latest tip (InitialPositionInStream.LATEST). This is configurable.
220
-
-InitialPositionInStream.LATEST could lead to missed records if data is added to the stream while no input DStreams are running (and no checkpoint info is being stored).
221
-
-InitialPositionInStream.TRIM_HORIZON may lead to duplicate processing of records where the impact is dependent on checkpoint frequency and processing idempotency.
219
+
- If no Kinesis checkpoint info exists when the input DStream starts, it will start either from the oldest record available (`InitialPositionInStream.TRIM_HORIZON`) or from the latest tip (`InitialPositionInStream.LATEST`). This is configurable.
220
+
-`InitialPositionInStream.LATEST` could lead to missed records if data is added to the stream while no input DStreams are running (and no checkpoint info is being stored).
221
+
-`InitialPositionInStream.TRIM_HORIZON` may lead to duplicate processing of records where the impact is dependent on checkpoint frequency and processing idempotency.
0 commit comments