[SPARK-18682][SS] Batch Source for Kafka #16686

ghost · 2017-01-24T01:16:58Z

What changes were proposed in this pull request?

Today, you can start a stream that reads from kafka. However, given kafka's configurable retention period, it seems like sometimes you might just want to read all of the data that is available now. As such we should add a version that works with spark.read as well.
The options should be the same as the streaming kafka source, with the following differences:
startingOffsets should default to earliest, and should not allow latest (which would always be empty).
endingOffsets should also be allowed and should default to latest. the same assign json format as startingOffsets should also be accepted.
It would be really good, if things like .limit(n) were enough to prevent all the data from being read (this might just work).

How was this patch tested?

KafkaRelationSuite was added for testing batch queries via KafkaUtils.

HyukjinKwon · 2017-01-24T02:18:25Z

...-10-sql/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister

@@ -1 +1 @@
-org.apache.spark.sql.kafka010.KafkaSourceProvider
+org.apache.spark.sql.kafka010.KafkaProvider


Hi @tcondie, I just happened to look at this PR. I just wonder if this breaks existing codes that use .format("org.apache.spark.sql.kafka010.KafkaSourceProvider") although almost no users use this by that name.

That's true, but revised Provider not only provides a Source but also a Relation, hence the decision to rename to something more general. Not clear if this outweighs the risks you've pointed out. @tdas @zsxwing

The cost of keeping the class name is pretty low. Just discussed with @marmbrus @tdas offline and we agreed to not change the name.

SparkQA · 2017-01-24T03:43:42Z

Test build #71894 has finished for PR 16686 at commit f8fd34c.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2017-01-24T19:18:43Z

Test build #71942 has finished for PR 16686 at commit 74d96fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

Made one pass. There are two major issues:

KafkaRelation may be reused (e.g., df.union(df)) and break CachedKafkaConsumer's assumptions. We can add a flag to not use the cached consumer.
Don't change the KafkaSourceProvider name.

zsxwing · 2017-01-24T19:12:33Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala

+
+  def close()
+
+  def fetchSpecificStartingOffsets(


nit: could you add comments for these methods?

zsxwing · 2017-01-24T19:14:42Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala

+
+/**
+ * The Kafka Consumer must be called in an UninterruptibleThread. This naturally occurs
+ * in Spark Streaming, but not in Spark SQL, which will use this call to communicate


nit: Spark Streaming -> Structured Streaming

zsxwing · 2017-01-24T19:20:02Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala

+private[kafka010] class UninterruptibleKafkaOffsetReader(kafkaOffsetReader: KafkaOffsetReader)
+  extends KafkaOffsetReader with Logging {
+
+  private class KafkaOffsetReaderThread extends UninterruptibleThread("Kafka Offset Reader") {


This must be a daemon thread.

Actually, you can create the ExecutionContext using the following simple codes:

val kafkaReaderThread = Executors.newSingleThreadExecutor(new ThreadFactory { override def newThread(r: Runnable): Thread = { val t = new UninterruptibleThread("Kafka Offset Reader") t.setDaemon(true) t } }) val execContext = ExecutionContext.fromExecutorService(kafkaReaderThread) // Close kafkaReaderThread.shutdownNow()

zsxwing · 2017-01-24T19:25:40Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaProvider.scala


-  import KafkaSourceProvider._
+  // Used to check parameters for different source modes
+  private sealed trait Mode


nit: could you move these classes and deserClassName to object KafkaProvider?

zsxwing · 2017-01-24T19:57:27Z

...-10-sql/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister

@@ -1 +1 @@
-org.apache.spark.sql.kafka010.KafkaSourceProvider
+org.apache.spark.sql.kafka010.KafkaProvider


The cost of keeping the class name is pretty low. Just discussed with @marmbrus @tdas offline and we agreed to not change the name.

zsxwing · 2017-01-24T21:42:44Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala

+      .option("startingOffsets", "earliest")
+      .option("endingOffsets", "latest")
+      .load()
+    assert(reader.count() === 21)


You can extend QueryTest rather than SparkFunSuite to use checkAnswer like this:

var df = spark .read .format("kafka") .option("kafka.bootstrap.servers", testUtils.brokerAddress) .option("subscribe", topic) .option("startingOffsets", "earliest") .option("endingOffsets", "latest") .load() .selectExpr("CAST(value AS STRING)") checkAnswer(df, (0 to 20).map(_.toString).toDF)

zsxwing · 2017-01-24T21:59:19Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaRelation.scala

+      offsetRanges.sortBy(_.topicPartition.toString).mkString(", "))
+
+    // Create an RDD that reads from Kafka and get the (key, value) pair as byte arrays.
+    val rdd = new KafkaSourceRDD(


I found df.union(df) will just union the same RDD which breaks the group id assumption. The same CachedKafkaConsumer will be used by two different tasks. For batch queries, caching consumers is not necessary. Could you add a flag to KafkaSourceRDD to not use the cached consumer? It's better to also write a test to cover this case. In addition, this test should one use one partition in order to launch two tasks from different RDDs at the same time: TestSparkSession uses local[2], so it can only run two tasks at the same time.

zsxwing · 2017-01-24T22:16:34Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaRelation.scala

+      val preferredLoc = if (numExecutors > 0) {
+        // This allows cached KafkaConsumers in the executors to be re-used to read the same
+        // partition in every batch.
+        Some(sortedExecutors(KafkaUtils.floorMod(tp.hashCode, numExecutors)))


You don't need to set the preferred locations after changing to not use the cached consumers.

zsxwing · 2017-01-24T22:22:00Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaRelation.scala

+    val untilPartitionOffsets = getPartitionOffsets(endingOffsets)
+    // Obtain topicPartitions in both from and until partition offset, ignoring
+    // topic partitions that were added and/or deleted between the two above calls.
+    val topicPartitions = fromPartitionOffsets.keySet.intersect(untilPartitionOffsets.keySet)


It's better to throw an exception rather than ignoring the deleted partitions.

zsxwing · 2017-01-24T22:23:38Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaRelation.scala

+    endingOffsets: KafkaOffsets)
+  extends BaseRelation with TableScan with Logging {
+
+  require(startingOffsets != LatestOffsets,


nit: changed it to assert since the parameters have already been validated.

SparkQA · 2017-01-25T01:30:48Z

Test build #71954 has finished for PR 16686 at commit d31fc81.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

Looks good overall except nits.

zsxwing · 2017-01-25T01:10:43Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala

+    testUtils.sendMessages(topic, (0 to 10).map(_.toString).toArray, Some(0))
+
+    // Ensure local[2] so that two tasks will execute the query on one partition
+    val testSession = new TestSparkSession(sparkContext)


nit: you don't need to create TestSparkSession. I meant this test uses TestSparkSession and it uses local[2].

zsxwing · 2017-01-25T01:11:25Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala

+      .load()
+    var df = reader.selectExpr("CAST(value AS STRING)")
+    checkAnswer(df.union(df),
+      (0 to 10).map(_.toString).toDF.union((0 to 10).map(_.toString).toDF))


nit: ((0 to 10) ++ (0 to 10)).map(_.toString).toDF.

zsxwing · 2017-01-25T01:11:59Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaRelation.scala

+        "for starting and ending offsets")
+    }
+
+    val sortedExecutors = KafkaUtils.getSortedExecutorList(sqlContext.sparkContext)


nit: not used any more (these 3 lines)

zsxwing · 2017-01-25T01:13:57Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala

+
+  val kafkaReaderThread = Executors.newSingleThreadExecutor(new ThreadFactory {
+    override def newThread(r: Runnable): Thread = {
+      logInfo("NEW UNINTERRUPTIBLE THREAD KAFKA OFFSET")


nit: remove this debug log.

zsxwing · 2017-01-25T01:14:16Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala

+      kafkaOffsetReader.fetchNewPartitionEarliestOffsets(newPartitions)
+    }(execContext)
+    ThreadUtils.awaitResult(future, Duration.Inf)
+


nit: empty line

zsxwing · 2017-01-25T01:18:29Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaRelation.scala

+    // Create an RDD that reads from Kafka and get the (key, value) pair as byte arrays.
+    val rdd = new KafkaSourceRDD(
+      sqlContext.sparkContext, executorKafkaParams, offsetRanges,
+      pollTimeoutMs, failOnDataLoss, false).map { cr =>


nit: false -> reuseKafkaConsumer = false

zsxwing · 2017-01-25T01:19:36Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaRelation.scala

+    // Obtain topicPartitions in both from and until partition offset, ignoring
+    // topic partitions that were added and/or deleted between the two above calls.
+    if (fromPartitionOffsets.keySet.size != untilPartitionOffsets.keySet.size) {
+      throw new IllegalStateException("Kafka return different topic partitions " +


nit: please include fromPartitionOffsets and untilPartitionOffsets to the exception message so that it's easy to debug such failure.

zsxwing · 2017-01-25T01:25:04Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala

    // Create an RDD that reads from Kafka and get the (key, value) pair as byte arrays.
    val rdd = new KafkaSourceRDD(
-      sc, executorKafkaParams, offsetRanges, pollTimeoutMs, failOnDataLoss).map { cr =>
+      sc, executorKafkaParams, offsetRanges, pollTimeoutMs, failOnDataLoss, true).map { cr =>


nit: true -> reuseKafkaConsumer = true

SparkQA · 2017-01-25T01:44:10Z

Test build #71957 has finished for PR 16686 at commit 1db1649.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…8682

SparkQA · 2017-01-25T18:46:34Z

Test build #71997 has finished for PR 16686 at commit 3b0d48b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-26T22:01:34Z

Test build #72045 has finished for PR 16686 at commit a5b0269.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2017-01-26T22:31:43Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala

+    checkAnswer(df, (0 to 20).map(_.toString).toDF)
+
+    // "latest" should late bind to the current (latest) offset in the reader
+    testUtils.sendMessages(topic, (21 to 29).map(_.toString).toArray, Some(2))


nit: could you add the following test below this line to make the semantics clear?

// The same DataFrame instance should return the same result checkAnswer(df, (0 to 20).map(_.toString).toDF)

This no longer holds now that we're binding in the executor, right?

SparkQA · 2017-01-27T17:08:02Z

Test build #72077 has finished for PR 16686 at commit c08c01f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-27T19:34:31Z

Test build #72081 has finished for PR 16686 at commit 79d335e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-27T21:14:58Z

Test build #72085 has finished for PR 16686 at commit b597cf1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ghost · 2017-01-27T21:28:09Z

jenkins retest this please

zsxwing

Could you also add LATEST and EARLIEST to KafkaUtils and replace the magic number -1 and -2? Sorry that I didn't bring it up early.

zsxwing · 2017-01-31T19:11:17Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala

    props.put("log.flush.interval.messages", "1")
    props.put("replica.socket.timeout.ms", "1500")
    props.put("delete.topic.enable", "true")
+    withBrokerProps.map { p =>


nit: you can change the type of withBrokerProps to Map[String, Object]. Then here you can just use props.putAll(withBrokerProps.asJava).

zsxwing · 2017-01-31T19:11:35Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala

+
+  def getEarliestOffsets(topics: Set[String]): Map[TopicPartition, Long] = {
+    val kc = new KafkaConsumer[String, String](consumerConfiguration)
+    logInfo("Created consumer to get latest offsets")


nit: please fix the log

zsxwing · 2017-01-31T19:11:38Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala

+    kc.seekToBeginning(partitions)
+    val offsets = partitions.asScala.map(p => p -> kc.position(p)).toMap
+    kc.close()
+    logInfo("Closed consumer to get latest offsets")


nit: please fix the log

zsxwing · 2017-01-31T19:24:13Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceRDD.scala

+        if (range.fromOffset < 0 || range.untilOffset < 0) {
+          // Late bind the offset range
+          val fromOffset = if (range.fromOffset < 0) {
+            consumer.rawConsumer.seekToBeginning(ju.Arrays.asList(range.topicPartition))


nit: add assert(range.fromOffset == -2) to avoid breaking it in future.

zsxwing · 2017-01-31T19:24:26Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceRDD.scala

+            range.fromOffset
+          }
+          val untilOffset = if (range.untilOffset < 0) {
+            consumer.rawConsumer.seekToEnd(ju.Arrays.asList(range.topicPartition))


nit: add assert(range.fromOffset == -1) to avoid breaking it in future.

zsxwing · 2017-01-31T19:59:49Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala

+    assert(Thread.currentThread().isInstanceOf[UninterruptibleThread])
+    // Poll to get the latest assigned partitions
+    consumer.poll(0)
+    consumer.assignment().asScala.toSet


nit: please also call pause like this to avoid fetching the real data when reusing the relation.

val partitions = consumer.assignment() consumer.pause(partitions) partitions.asScala.toSet

SparkQA · 2017-01-31T22:40:56Z

Test build #72210 has finished for PR 16686 at commit 2487a72.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-31T23:42:22Z

Test build #72211 has finished for PR 16686 at commit 789d3af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

LGTM

tdas

Overall its looks fine, but needs some work with the code organization. I believe we can reduce the number of classes and LOCs quite a bit.

tdas · 2017-02-02T18:55:09Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

  private val groupId = kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG).asInstanceOf[String]

-  private var consumer = createConsumer
+  var rawConsumer = createConsumer


exposing internal var is generally not a good idea. A better approach is be to add the necessary methods (for which you need the consumer) in the class CachedKafkaConsumer.

and why renamed?

tdas · 2017-02-02T18:58:07Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceRDD.scala

        val consumer = CachedKafkaConsumer.getOrCreate(
-          range.topic, range.partition, executorKafkaParams)
+            range.topic, range.partition, executorKafkaParams, reuseKafkaConsumer)
+        if (range.fromOffset < 0 || range.untilOffset < 0) {


nit: Does this piece of code need to resolve the range need to be inside the NextIterator? This is cause a lot of unnecessary nesting. Instead of making the range var, you can resolve the range above and then create the NextIterator.

Furthermore, why use rawConsumer directly and expose it? Why not use CachedKafkaConsumer.getAvailableOffsetRange()?

Reworked it. Let me know what you think.

tdas · 2017-02-02T19:08:15Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

      partition: Int,
-      kafkaParams: ju.Map[String, Object]): CachedKafkaConsumer = synchronized {
+      kafkaParams: ju.Map[String, Object],
+      reuse: Boolean): CachedKafkaConsumer = synchronized {


Does this mean reuse existing one, OR allow reuse in future?

reuse existing. I changed the name to reuseExistingIfPresent.

tdas · 2017-02-02T19:08:34Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala

+import org.apache.spark.util.{ThreadUtils, UninterruptibleThread}
+
+
+private[kafka010] trait KafkaOffsetReader {


scala docs.

this trait a little weird. fetchTopicPartitions() fetches topic and partitions of what?
clarifying these in the docs would be good.

tdas · 2017-02-02T19:12:07Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala

+  def fetchTopicPartitions(): Set[TopicPartition]
+
+  /**
+   * Set consumer position to specified offsets, making sure all assignments are set.


This docs seems wrong. Name says it should fetch offsets, but docs says it sets something?

and whats the difference between earliest and starting offsets?

tdas · 2017-02-02T21:44:29Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala

+   *       by the Map that is passed to the function.
+   */
+  override def createRelation(
+    sqlContext: SQLContext,


incorrect indentation.

tdas · 2017-02-02T21:44:53Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala

+      .build()
+
+  private def kafkaParamsForExecutors(
+    specifiedKafkaParams: Map[String, String], uniqueGroupId: String) =


incorrect indentation.

also convention is to have each param in different line.

tdas · 2017-02-02T21:46:24Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala

  private val FAIL_ON_DATA_LOSS_OPTION_KEY = "failondataloss"
+
+  // Used to check parameters for different source modes
+  private sealed trait Mode


Commented elsewhere, Mode should not be required.

tdas · 2017-02-02T21:48:35Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaUtils.scala

+private[kafka010] object KafkaUtils {
+
+  // Used to denote unbounded offset positions
+  val LATEST = -1L


Having these constants here does not make sense. Better to have them in an object KafkaOffsets and put these numbers in them.

tdas · 2017-02-02T21:53:08Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaUtils.scala

+import org.apache.spark.SparkContext
+import org.apache.spark.scheduler.ExecutorCacheTaskLocation
+
+private[kafka010] object KafkaUtils {


I dont see the need for this class. LATEST and EARLIEST is better put in object KafkaOffsets (trait already exists), and the other methods used to be part KafkaSource and may continue to be in their (unless anybody else uses it)

SparkQA · 2017-02-03T01:51:58Z

Test build #72294 has finished for PR 16686 at commit 5b48fc6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas

Last refactoring looks pretty good. just a few more nits.

tdas · 2017-02-03T12:25:08Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

      val warningMessage =
        s"""
-          |The current available offset range is [$earliestOffset, $latestOffset).
+          |The current available offset range is [${range.earliest}, ${range.latest}).


nit: offset range is $range

tdas · 2017-02-03T12:25:22Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

        s"""
-           |The current available offset range is [$earliestOffset, $latestOffset).
-           | Offset ${offset} is out of range, and records in [$offset, $earliestOffset) will be
+           |The current available offset range is [${range.earliest}, ${range.latest}).


nit: same as above.

tdas · 2017-02-03T12:28:54Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsets.scala


-private[kafka010] case object EarliestOffsets extends StartingOffsets
+/**
+ * Bind to the earliest offsets in Kafka


nit: better docs. this is object, not a method. say what the object represents. "Bind to earliest offsets..." is like docs for a method

tdas · 2017-02-03T12:30:04Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsets.scala

+    partitionOffsets: Map[TopicPartition, Long]) extends KafkaOffsets
+
+private[kafka010] object KafkaOffsets {
+  // Used to denote unbounded offset positions


nit: Used to represent unresolved offset limits as longs
"unbounded" sounds like its infinite, or something.

tdas · 2017-02-03T12:31:47Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaRelation.scala

+
+
+private[kafka010] class KafkaRelation(
+                                       override val sqlContext: SQLContext,


incorrect indents

tdas · 2017-02-03T12:56:38Z

...-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaTopicPartitionOffsetReader.scala

+    }
+
+  /**
+   * Fetch the earliest offsets of partitions.


can you specify which partitions? maybe "offsets of all partitions to be consumed according the consumer strategy"

same for docs of other methods that do not take a specific list of partitions.

tdas · 2017-02-03T12:59:06Z

...-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaTopicPartitionOffsetReader.scala

+    }
+  }
+
+  private def runUninterruptibly[T](body: => T): T = {


nit: add docs.

tdas · 2017-02-03T13:01:00Z

...-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaTopicPartitionOffsetReader.scala

+  }
+
+  /**
+   * Helper function that does multiple retries on the a body of code that returns offsets.


nit: on ~~the~~ a body

tdas · 2017-02-03T13:05:31Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala

+        "log.retention.bytes" -> 1.asInstanceOf[AnyRef], // retain nothing
+        "log.retention.ms" -> 1.asInstanceOf[AnyRef]     // no wait time
+      )
+      testUtils = new KafkaTestUtils(withBrokerProps = brokerProps)


why disturb testUtils? Why not assign to another local var? then you dont have to tear down and setup all this stuff.

tdas · 2017-02-03T13:06:55Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala

+    }
+  }
+
+  private def createDF(topic: String,


nit: should be

private def createDF( topic: String, withOptions: ...

SparkQA · 2017-02-03T20:17:01Z

Test build #72316 has finished for PR 16686 at commit 5776009.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-04T00:31:36Z

Test build #72333 has finished for PR 16686 at commit aef89bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2017-02-07T05:16:50Z

LGTM!

tdas · 2017-02-07T05:17:27Z

@zsxwing please merge if you think your concerns were addressed correctly.

zsxwing

LGTM expect missing synchronized.

zsxwing · 2017-02-07T19:35:01Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+    val topicPartition = new TopicPartition(topic, partition)
+    val key = CacheKey(groupId, topicPartition)
+
+    val removedConsumer = cache.remove(key)


nit: please add synchronized.

zsxwing · 2017-02-07T19:36:37Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala

+    val topicPartition = new TopicPartition(topic, partition)
+    val key = CacheKey(groupId, topicPartition)
+
+    val consumer = cache.get(key)


nit: please add synchronized

SparkQA · 2017-02-07T20:39:57Z

Test build #72534 has finished for PR 16686 at commit 4e56f8c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-07T20:51:32Z

Test build #72536 has finished for PR 16686 at commit 3bc7c4c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2017-02-07T22:28:36Z

LGTM. Merging to master and 2.1.

Today, you can start a stream that reads from kafka. However, given kafka's configurable retention period, it seems like sometimes you might just want to read all of the data that is available now. As such we should add a version that works with spark.read as well. The options should be the same as the streaming kafka source, with the following differences: startingOffsets should default to earliest, and should not allow latest (which would always be empty). endingOffsets should also be allowed and should default to latest. the same assign json format as startingOffsets should also be accepted. It would be really good, if things like .limit(n) were enough to prevent all the data from being read (this might just work). KafkaRelationSuite was added for testing batch queries via KafkaUtils. Author: Tyson Condie <tcondie@gmail.com> Closes #16686 from tcondie/SPARK-18682. (cherry picked from commit 8df4444) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

kayousterhout · 2017-02-11T01:59:42Z

I recently have noticed a few flaky test failures of KafkaSourceSuite.subscribing topic by pattern with topic deletions (e.g., https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72725/testReport/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/). Is it possible those were caused by this PR? (Filed this JIRA: https://issues.apache.org/jira/browse/SPARK-19559)

lw-lin · 2017-02-12T15:18:12Z

Hi @kayousterhout, #16902 is a fix to the flaky KafkaSourceSuite.subscribing topic by pattern with topic deletions test

## What changes were proposed in this pull request? Today, you can start a stream that reads from kafka. However, given kafka's configurable retention period, it seems like sometimes you might just want to read all of the data that is available now. As such we should add a version that works with spark.read as well. The options should be the same as the streaming kafka source, with the following differences: startingOffsets should default to earliest, and should not allow latest (which would always be empty). endingOffsets should also be allowed and should default to latest. the same assign json format as startingOffsets should also be accepted. It would be really good, if things like .limit(n) were enough to prevent all the data from being read (this might just work). ## How was this patch tested? KafkaRelationSuite was added for testing batch queries via KafkaUtils. Author: Tyson Condie <tcondie@gmail.com> Closes apache#16686 from tcondie/SPARK-18682.

Tyson Condie added 7 commits January 19, 2017 17:53

add kafka relation and refactor kafka source

d371758

update

b6c3055

update

4c81812

single kafka provider for both stream and batch

ab02a4c

added uninterruptible thread version of kafka offset reader

e6b57ed

added uninterruptible thread version of kafka offset reader

ff94ed8

update tests

f8fd34c

HyukjinKwon reviewed Jan 24, 2017

View reviewed changes

Tyson Condie added 2 commits January 24, 2017 10:32

resolve conflicts in KafakSource

41271e2

update comments

74d96fc

zsxwing requested changes Jan 24, 2017

View reviewed changes

Tyson Condie added 2 commits January 24, 2017 16:53

address comments from @zsxwing

d31fc81

update

1db1649

zsxwing requested changes Jan 25, 2017

View reviewed changes

Merge branch 'master' of https://github.com/apache/spark into SPARK-1…

3b0d48b

…8682

address comments from @zsxwing

a5b0269

zsxwing reviewed Jan 26, 2017

View reviewed changes

late binding offsets

c08c01f

update to late binding logic

79d335e

remove kafka log4j debug

b597cf1

zsxwing requested changes Jan 31, 2017

View reviewed changes

address comments from @zsxwing

2487a72

update

789d3af

zsxwing approved these changes Feb 1, 2017

View reviewed changes

tdas suggested changes Feb 2, 2017

View reviewed changes

address comments from @tdas

5b48fc6

tdas suggested changes Feb 3, 2017

View reviewed changes

address feedback from @tdas and @sxwing

5776009

fix indent

aef89bc

zsxwing requested changes Feb 7, 2017

View reviewed changes

Tyson Condie added 2 commits February 7, 2017 12:07

address comments from @zsxwing

4e56f8c

address comments from @zsxwing

3bc7c4c

asfgit closed this in 8df4444 Feb 7, 2017

		@@ -1 +1 @@
		org.apache.spark.sql.kafka010.KafkaSourceProvider
		org.apache.spark.sql.kafka010.KafkaProvider

		import org.apache.spark.util.{ThreadUtils, UninterruptibleThread}


		private[kafka010] trait KafkaOffsetReader {



		private[kafka010] class KafkaRelation(
		override val sqlContext: SQLContext,

[SPARK-18682][SS] Batch Source for Kafka #16686

[SPARK-18682][SS] Batch Source for Kafka #16686

Uh oh!

Conversation

ghost commented Jan 24, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 24, 2017

Uh oh!

SparkQA commented Jan 24, 2017

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 25, 2017

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 25, 2017

Uh oh!

SparkQA commented Jan 25, 2017

Uh oh!

SparkQA commented Jan 26, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 27, 2017

Uh oh!

SparkQA commented Jan 27, 2017

Uh oh!

SparkQA commented Jan 27, 2017

Uh oh!

ghost commented Jan 27, 2017

Uh oh!