@@ -51,8 +51,9 @@ different languages.
5151** Note:** Python API for Spark Streaming has been introduced in Spark 1.2. It has all the DStream
5252transformations and almost all the output operations available in Scala and Java interfaces.
5353However, it has only support for basic sources like text files and text data over sockets.
54- API for creating more sources like Kafka, and Flume will be available in future.
55- Further information about available features in Python API are mentioned throughout this
54+ <<<<<<< HEAD
55+ APIs for additional sources, like Kafka and Flume, will be available in the future.
56+ Further information about available features in the Python API are mentioned throughout this
5657document; look out for the tag
5758<span class =" badge " style =" background-color : grey " >Python API</span >.
5859
@@ -622,7 +623,7 @@ as well as, to run the receiver(s).
622623 a input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will
623624 be used to run the receiver, leaving no thread for processing the received data. Hence, when
624625 running locally, always use "local[ * n* ] " as the master URL where * n* > number of receivers to run
625- (see [ Spark Properties] (configuration.html#spark-properties.html for information on how to set
626+ (see [ Spark Properties] ( configuration.html#spark-properties.html ) for information on how to set
626627 the master).
627628
628629- Extending the logic to running on a cluster, the number of cores allocated to the Spark Streaming
@@ -676,7 +677,7 @@ For more details on streams from sockets, files, and actors,
676677see the API documentations of the relevant functions in
677678[ StreamingContext] ( api/scala/index.html#org.apache.spark.streaming.StreamingContext ) for
678679Scala, [ JavaStreamingContext] ( api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html )
679- for Java, and [ StreamingContext] .
680+ for Java, and [ StreamingContext] ( api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext ) for Python .
680681
681682### Advanced Sources
682683{:.no_toc}
@@ -1517,7 +1518,7 @@ sliding interval of a DStream is good setting to try.
15171518***
15181519
15191520## Deploying Applications
1520- This section discussed the steps to deploy a Spark Streaming applications .
1521+ This section discusses the steps to deploy a Spark Streaming application .
15211522
15221523### Requirements
15231524{:.no_toc}
@@ -1571,7 +1572,7 @@ To run a Spark Streaming applications, you need to have the following.
15711572 feature of write ahead logs. If enabled, all the data received from a receiver gets written into
15721573 a write ahead log in the configuration checkpoint directory. This prevents data loss on driver
15731574 recovery, thus allowing zero data loss guarantees which is discussed in detail in the
1574- [ Fault-tolerant Semantics] ( #fault-tolerant -semantics ) section. Enable this by setting the
1575+ [ Fault-tolerance Semantics] ( #fault-tolerance -semantics ) section. Enable this by setting the
15751576 [ configuration parameter] ( configuration.html#spark-streaming )
15761577 ` spark.streaming.receiver.writeAheadLogs.enable ` to ` true ` .
15771578
@@ -1617,7 +1618,7 @@ receivers are active, number of records received, receiver error, etc.)
16171618and completed batches (batch processing times, queueing delays, etc.). This can be used to
16181619monitor the progress of the streaming application.
16191620
1620- The following two metrics in web UI are particularly important -
1621+ The following two metrics in web UI are particularly important:
16211622
16221623- * Processing Time* - The time to process each batch of data.
16231624- * Scheduling Delay* - the time a batch waits in a queue for the processing of previous batches
@@ -1710,12 +1711,12 @@ before further processing.
17101711{:.no_toc}
17111712Cluster resources can be under-utilized if the number of parallel tasks used in any stage of the
17121713computation is not high enough. For example, for distributed reduce operations like ` reduceByKey `
1713- and ` reduceByKeyAndWindow ` , the default number of parallel tasks is decided by the
1714- [ config property] ( configuration.html#spark-properties ) ` spark.default.parallelism ` .
1715- You can pass the level of parallelism as an argument (see
1714+ and ` reduceByKeyAndWindow ` , the default number of parallel tasks is controlled by
1715+ the ` spark.default.parallelism ` [ configuration property] ( configuration.html#spark-properties ) . You
1716+ can pass the level of parallelism as an argument (see
17161717[ ` PairDStreamFunctions ` ] ( api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions )
1717- documentation), or set the [ config property ] ( configuration.html# spark-properties )
1718- ` spark.default.parallelism ` to change the default.
1718+ documentation), or set the ` spark.default.parallelism `
1719+ [ configuration property ] ( configuration.html#spark-properties ) to change the default.
17191720
17201721### Data Serialization
17211722{:.no_toc}
@@ -1811,72 +1812,73 @@ consistent batch processing times.
18111812***************************************************************************************************
18121813
18131814# Fault-tolerance Semantics
1814- In this section, we will discuss the behavior of Spark Streaming application in the event
1815- of a node failure . To understand this, let us remember the basic fault-tolerance semantics of
1815+ In this section, we will discuss the behavior of Spark Streaming applications in the event
1816+ of node failures . To understand this, let us remember the basic fault-tolerance semantics of
18161817Spark's RDDs.
18171818
181818191 . An RDD is an immutable, deterministically re-computable, distributed dataset. Each RDD
18191820remembers the lineage of deterministic operations that were used on a fault-tolerant input
18201821dataset to create it.
182118221 . If any partition of an RDD is lost due to a worker node failure, then that partition can be
18221823re-computed from the original fault-tolerant dataset using the lineage of operations.
1823- 1 . Assuming all the RDD transformations are deterministic, the data in the final transformed RDD
1824- will always be the same irrespective of failures in Spark cluster.
1824+ 1 . Assuming that all of the RDD transformations are deterministic, the data in the final transformed
1825+ RDD will always be the same irrespective of failures in the Spark cluster.
18251826
18261827Spark operates on data on fault-tolerant file systems like HDFS or S3. Hence,
1827- all the RDDs generated from the fault-tolerant data are also fault-tolerant. However, this is not
1828+ all of the RDDs generated from the fault-tolerant data are also fault-tolerant. However, this is not
18281829the case for Spark Streaming as the data in most cases is received over the network (except when
1829- ` fileStream ` is used). To achieve the same fault-tolerance properties for all the generated RDDs,
1830+ ` fileStream ` is used). To achieve the same fault-tolerance properties for all of the generated RDDs,
18301831the received data is replicated among multiple Spark executors in worker nodes in the cluster
18311832(default replication factor is 2). This leads to two kinds of data in the
1832- system that needs to recovered in the event of a failure.
1833+ system that needs to recovered in the event of failures:
18331834
183418351 . * Data received and replicated* - This data survives failure of a single worker node as a copy
18351836 of it exists on one of the nodes.
183618371 . * Data received but buffered for replication* - Since this is not replicated,
18371838 the only way to recover that data is to get it again from the source.
18381839
1839- Furthermore, there are two kinds of failures that we should be concerned about.
1840+ Furthermore, there are two kinds of failures that we should be concerned about:
18401841
1841- 1 . * Failure of a Worker Node* - Any of the workers in the cluster can fail,
1842- and all in-memory data on that node will be lost. If there are any receiver running on that
1843- node, all buffered data will be lost.
1842+ 1 . * Failure of a Worker Node* - Any of the worker nodes running executors can fail,
1843+ and all in-memory data on those nodes will be lost. If any receivers were running on failed
1844+ nodes, then their buffered data will be lost.
184418451 . * Failure of the Driver Node* - If the driver node running the Spark Streaming application
1845- fails, then obviously the SparkContext is lost, as well as all executors with their in-memory
1846+ fails, then obviously the SparkContext is lost, and all executors with their in-memory
18461847 data are lost.
18471848
18481849With this basic knowledge, let us understand the fault-tolerance semantics of Spark Streaming.
18491850
18501851## Semantics with files as input source
18511852{:.no_toc}
1852- In this case, since all the input data is already present in a fault-tolerant files system like
1853+ If all of the input data is already present in a fault-tolerant files system like
18531854HDFS, Spark Streaming can always recover from any failure and process all the data. This gives
18541855* exactly-once* semantics, that all the data will be processed exactly once no matter what fails.
18551856
18561857## Semantics with input sources based on receivers
18571858{:.no_toc}
1858- Here we will first discuss the semantics in the context of different types of failures. As we
1859- discussed [ earlier] ( #receiver-reliability ) , there are two kinds of receivers.
1859+ For input sources based on receivers, the fault-tolerance semantics depend on both the failure
1860+ scenario and the type of receiver.
1861+ As we discussed [ earlier] ( #receiver-reliability ) , there are two types of receivers:
18601862
186118631 . * Reliable Receiver* - These receivers acknowledge reliable sources only after ensuring that
18621864 the received data has been replicated. If such a receiver fails,
18631865 the buffered (unreplicated) data does not get acknowledged to the source. If the receiver is
1864- restarted, the source would resend the data, and so no data will be lost due to the failure.
1866+ restarted, the source will resend the data, and therefore no data will be lost due to the failure.
186518671 . * Unreliable Receiver* - Such receivers can lose data when they fail due to worker
18661868 or driver failures.
18671869
18681870Depending on what type of receivers are used we achieve the following semantics.
18691871If a worker node fails, then there is no data loss with reliable receivers. With unreliable
18701872receivers, data received but not replicated can get lost. If the driver node fails,
1871- then besides these losses, all the past data that were received and replicated in memory will be
1873+ then besides these losses, all the past data that was received and replicated in memory will be
18721874lost. This will affect the results of the stateful transformations.
18731875
1874- To avoid this loss of past received data, Spark 1.2 introduces an experimental feature of write
1875- ahead logs, that saves the received data to a fault-tolerant storage. With the [ write ahead logs
1876+ To avoid this loss of past received data, Spark 1.2 introduces an experimental feature of _ write
1877+ ahead logs _ which saves the received data to fault-tolerant storage. With the [ write ahead logs
18761878enabled] ( #deploying-applications ) and reliable receivers, there is zero data loss and
18771879exactly-once semantics.
18781880
1879- The following table summarizes the semantics under failures.
1881+ The following table summarizes the semantics under failures:
18801882
18811883<table class =" table " >
18821884 <tr >
@@ -2006,5 +2008,5 @@ package and renamed for better clarity.
20062008
20072009* More examples in [ Scala] ( {{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples/streaming )
20082010 and [ Java] ( {{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples/streaming )
2009- and [ Python] ({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python/streaming)
2011+ and [ Python] ( {{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python/streaming )
20102012* [ Paper] ( http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf ) and [ video] ( http://youtu.be/g171ndOHgJ0 ) describing Spark Streaming.
0 commit comments