address comments

WeichenXu123 · WeichenXu123 · commit 8231cb25c30f · 2018-10-23T09:55:24.000+08:00
diff --git a/docs/ml-datasource.md b/docs/ml-datasource.md
@@ -5,7 +5,7 @@ displayTitle: Data sources
 ---
 
 In this section, we introduce how to use data source in ML to load data.
-Beside some general data sources such as Parquet, CSV, JSON and JDBC, we also provide some specific data source for ML.
+Beside some general data sources such as Parquet, CSV, JSON and JDBC, we also provide some specific data sources for ML.
 
 **Table of Contents**
 
@@ -20,7 +20,7 @@ The schema of the `image` column is:
  - origin: `StringType` (represents the file path of the image)
  - height: `IntegerType` (height of the image)
  - width: `IntegerType` (width of the image)
- - nChannels: `IntegerType` (number of the image channels)
+ - nChannels: `IntegerType` (number of image channels)
  - mode: `IntegerType` (OpenCV-compatible type)
  - data: `BinaryType` (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)
 
@@ -31,7 +31,7 @@ The schema of the `image` column is:
 implements a Spark SQL data source API for loading image data as a DataFrame.
 
 {% highlight scala %}
-scala> val df = spark.read.format("image").load("data/mllib/images/origin/kittens")
+scala> val df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens")
 df: org.apache.spark.sql.DataFrame = [image: struct<origin: string, height: int ... 4 more fields>]
 
 scala> df.select("image.origin", "image.width", "image.height").show(truncate=false)
@@ -42,7 +42,6 @@ scala> df.select("image.origin", "image.width", "image.height").show(truncate=fa
 |file:///spark/data/mllib/images/origin/kittens/DP802813.jpg            |199  |313   |
 |file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300  |200   |
 |file:///spark/data/mllib/images/origin/kittens/DP153539.jpg            |300  |296   |
-|file:///spark/data/mllib/images/origin/kittens/not-image.txt           |-1   |-1    |
 +-----------------------------------------------------------------------+-----+------+
 {% endhighlight %}
 </div>
@@ -52,7 +51,7 @@ scala> df.select("image.origin", "image.width", "image.height").show(truncate=fa
 implements Spark SQL data source API for loading image data as DataFrame.
 
 {% highlight java %}
-Dataset<Row> imagesDF = spark.read().format("image").load("data/mllib/images/origin/kittens");
+Dataset<Row> imagesDF = spark.read().format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens");
 imageDF.select("image.origin", "image.width", "image.height").show(false);
 /*
 Will output:
@@ -63,7 +62,6 @@ Will output:
 |file:///spark/data/mllib/images/origin/kittens/DP802813.jpg            |199  |313   |
 |file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300  |200   |
 |file:///spark/data/mllib/images/origin/kittens/DP153539.jpg            |300  |296   |
-|file:///spark/data/mllib/images/origin/kittens/not-image.txt           |-1   |-1    |
 +-----------------------------------------------------------------------+-----+------+
 */
 {% endhighlight %}
@@ -73,7 +71,7 @@ Will output:
 In PySpark we provide Spark SQL data source API for loading image data as DataFrame.
 
 {% highlight python %}
->>> df = spark.read.format("image").load("data/mllib/images/origin/kittens")
+>>> df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens")
 >>> df.select("image.origin", "image.width", "image.height").show(truncate=False)
 +-----------------------------------------------------------------------+-----+------+
 |origin                                                                 |width|height|
@@ -82,7 +80,6 @@ In PySpark we provide Spark SQL data source API for loading image data as DataFr
 |file:///spark/data/mllib/images/origin/kittens/DP802813.jpg            |199  |313   |
 |file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300  |200   |
 |file:///spark/data/mllib/images/origin/kittens/DP153539.jpg            |300  |296   |
-|file:///spark/data/mllib/images/origin/kittens/not-image.txt           |-1   |-1    |
 +-----------------------------------------------------------------------+-----+------+
 {% endhighlight %}
 </div>
@@ -98,13 +95,11 @@ In SparkR we provide Spark SQL data source API for loading image data as DataFra
 2            file:///spark/data/mllib/images/origin/kittens/DP802813.jpg
 3 file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg
 4            file:///spark/data/mllib/images/origin/kittens/DP153539.jpg
-5           file:///spark/data/mllib/images/origin/kittens/not-image.txt
   width height
 1   300    311
 2   199    313
 3   300    200
 4   300    296
-5    -1     -1
 
 {% endhighlight %}
 </div>
diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala b/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala
@@ -27,7 +27,7 @@ package org.apache.spark.ml.source.image
  *  - origin: `StringType` (represents the file path of the image)
  *  - height: `IntegerType` (height of the image)
  *  - width: `IntegerType` (width of the image)
- *  - nChannels: `IntegerType` (number of the image channels)
+ *  - nChannels: `IntegerType` (number of image channels)
  *  - mode: `IntegerType` (OpenCV-compatible type)
  *  - data: `BinaryType` (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)
  *

Original file line number	Diff line number	Diff line change
`@@ -27,7 +27,7 @@ package org.apache.spark.ml.source.image`
`27`	`27`	* - origin: `StringType` (represents the file path of the image)
`28`	`28`	* - height: `IntegerType` (height of the image)
`29`	`29`	* - width: `IntegerType` (width of the image)
`30`		- * - nChannels: `IntegerType` (number of the image channels)
	`30`	+ * - nChannels: `IntegerType` (number of image channels)
`31`	`31`	* - mode: `IntegerType` (OpenCV-compatible type)
`32`	`32`	* - data: `BinaryType` (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)
`33`	`33`	`*`