Skip to content

Commit 05d7151

Browse files
keypointttdas
authored andcommitted
[MINOR][STREAMING][DOCS] Minor changes on kinesis integration
## What changes were proposed in this pull request? Some minor changes for documentation page "Spark Streaming + Kinesis Integration". Moved "streaming-kinesis-arch.png" before the bullet list, not in between the bullets. ## How was this patch tested? Tested manually, on my local machine. Author: Xin Ren <iamshrek@126.com> Closes #14097 from keypointt/kinesisDoc.
1 parent 9e2c763 commit 05d7151

File tree

1 file changed

+13
-13
lines changed

1 file changed

+13
-13
lines changed

docs/streaming-kinesis-integration.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@ A Kinesis stream can be set up at one of the valid Kinesis endpoints with 1 or m
111111

112112
- `[checkpoint interval]`: The interval (e.g., Duration(2000) = 2 seconds) at which the Kinesis Client Library saves its position in the stream. For starters, set it to the same as the batch interval of the streaming application.
113113

114-
- `[initial position]`: Can be either `InitialPositionInStream.TRIM_HORIZON` or `InitialPositionInStream.LATEST` (see Kinesis Checkpointing section and Amazon Kinesis API documentation for more details).
114+
- `[initial position]`: Can be either `InitialPositionInStream.TRIM_HORIZON` or `InitialPositionInStream.LATEST` (see [`Kinesis Checkpointing`](#kinesis-checkpointing) section and [`Amazon Kinesis API documentation`](http://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html) for more details).
115115

116116
- `[message handler]`: A function that takes a Kinesis `Record` and outputs generic `T`.
117117

@@ -128,14 +128,6 @@ A Kinesis stream can be set up at one of the valid Kinesis endpoints with 1 or m
128128
Alternatively, you can also download the JAR of the Maven artifact `spark-streaming-kinesis-asl-assembly` from the
129129
[Maven repository](http://search.maven.org/#search|ga|1|a%3A%22spark-streaming-kinesis-asl-assembly_{{site.SCALA_BINARY_VERSION}}%22%20AND%20v%3A%22{{site.SPARK_VERSION_SHORT}}%22) and add it to `spark-submit` with `--jars`.
130130

131-
*Points to remember at runtime:*
132-
133-
- Kinesis data processing is ordered per partition and occurs at-least once per message.
134-
135-
- Multiple applications can read from the same Kinesis stream. Kinesis will maintain the application-specific shard and checkpoint info in DynamoDB.
136-
137-
- A single Kinesis stream shard is processed by one input DStream at a time.
138-
139131
<p style="text-align: center;">
140132
<img src="img/streaming-kinesis-arch.png"
141133
title="Spark Streaming Kinesis Architecture"
@@ -145,6 +137,14 @@ A Kinesis stream can be set up at one of the valid Kinesis endpoints with 1 or m
145137
<!-- Images are downsized intentionally to improve quality on retina displays -->
146138
</p>
147139

140+
*Points to remember at runtime:*
141+
142+
- Kinesis data processing is ordered per partition and occurs at-least once per message.
143+
144+
- Multiple applications can read from the same Kinesis stream. Kinesis will maintain the application-specific shard and checkpoint info in DynamoDB.
145+
146+
- A single Kinesis stream shard is processed by one input DStream at a time.
147+
148148
- A single Kinesis input DStream can read from multiple shards of a Kinesis stream by creating multiple KinesisRecordProcessor threads.
149149

150150
- Multiple input DStreams running in separate processes/instances can read from a Kinesis stream.
@@ -173,7 +173,7 @@ To run the example,
173173

174174
- Set up Kinesis stream (see earlier section) within AWS. Note the name of the Kinesis stream and the endpoint URL corresponding to the region where the stream was created.
175175

176-
- Set up the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_KEY with your AWS credentials.
176+
- Set up the environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_KEY` with your AWS credentials.
177177

178178
- In the Spark root directory, run the example as
179179

@@ -216,6 +216,6 @@ de-aggregate records during consumption.
216216

217217
- Checkpointing too frequently will cause excess load on the AWS checkpoint storage layer and may lead to AWS throttling. The provided example handles this throttling with a random-backoff-retry strategy.
218218

219-
- If no Kinesis checkpoint info exists when the input DStream starts, it will start either from the oldest record available (InitialPositionInStream.TRIM_HORIZON) or from the latest tip (InitialPositionInStream.LATEST). This is configurable.
220-
- InitialPositionInStream.LATEST could lead to missed records if data is added to the stream while no input DStreams are running (and no checkpoint info is being stored).
221-
- InitialPositionInStream.TRIM_HORIZON may lead to duplicate processing of records where the impact is dependent on checkpoint frequency and processing idempotency.
219+
- If no Kinesis checkpoint info exists when the input DStream starts, it will start either from the oldest record available (`InitialPositionInStream.TRIM_HORIZON`) or from the latest tip (`InitialPositionInStream.LATEST`). This is configurable.
220+
- `InitialPositionInStream.LATEST` could lead to missed records if data is added to the stream while no input DStreams are running (and no checkpoint info is being stored).
221+
- `InitialPositionInStream.TRIM_HORIZON` may lead to duplicate processing of records where the impact is dependent on checkpoint frequency and processing idempotency.

0 commit comments

Comments
 (0)