Skip to content

Commit 8231cb2

Browse files
committed
address comments
1 parent 6de0c28 commit 8231cb2

File tree

2 files changed

+6
-11
lines changed

2 files changed

+6
-11
lines changed

docs/ml-datasource.md

Lines changed: 5 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ displayTitle: Data sources
55
---
66

77
In this section, we introduce how to use data source in ML to load data.
8-
Beside some general data sources such as Parquet, CSV, JSON and JDBC, we also provide some specific data source for ML.
8+
Beside some general data sources such as Parquet, CSV, JSON and JDBC, we also provide some specific data sources for ML.
99

1010
**Table of Contents**
1111

@@ -20,7 +20,7 @@ The schema of the `image` column is:
2020
- origin: `StringType` (represents the file path of the image)
2121
- height: `IntegerType` (height of the image)
2222
- width: `IntegerType` (width of the image)
23-
- nChannels: `IntegerType` (number of the image channels)
23+
- nChannels: `IntegerType` (number of image channels)
2424
- mode: `IntegerType` (OpenCV-compatible type)
2525
- data: `BinaryType` (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)
2626

@@ -31,7 +31,7 @@ The schema of the `image` column is:
3131
implements a Spark SQL data source API for loading image data as a DataFrame.
3232

3333
{% highlight scala %}
34-
scala> val df = spark.read.format("image").load("data/mllib/images/origin/kittens")
34+
scala> val df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens")
3535
df: org.apache.spark.sql.DataFrame = [image: struct<origin: string, height: int ... 4 more fields>]
3636

3737
scala> df.select("image.origin", "image.width", "image.height").show(truncate=false)
@@ -42,7 +42,6 @@ scala> df.select("image.origin", "image.width", "image.height").show(truncate=fa
4242
|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg |199 |313 |
4343
|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300 |200 |
4444
|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg |300 |296 |
45-
|file:///spark/data/mllib/images/origin/kittens/not-image.txt |-1 |-1 |
4645
+-----------------------------------------------------------------------+-----+------+
4746
{% endhighlight %}
4847
</div>
@@ -52,7 +51,7 @@ scala> df.select("image.origin", "image.width", "image.height").show(truncate=fa
5251
implements Spark SQL data source API for loading image data as DataFrame.
5352

5453
{% highlight java %}
55-
Dataset<Row> imagesDF = spark.read().format("image").load("data/mllib/images/origin/kittens");
54+
Dataset<Row> imagesDF = spark.read().format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens");
5655
imageDF.select("image.origin", "image.width", "image.height").show(false);
5756
/*
5857
Will output:
@@ -63,7 +62,6 @@ Will output:
6362
|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg |199 |313 |
6463
|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300 |200 |
6564
|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg |300 |296 |
66-
|file:///spark/data/mllib/images/origin/kittens/not-image.txt |-1 |-1 |
6765
+-----------------------------------------------------------------------+-----+------+
6866
*/
6967
{% endhighlight %}
@@ -73,7 +71,7 @@ Will output:
7371
In PySpark we provide Spark SQL data source API for loading image data as DataFrame.
7472

7573
{% highlight python %}
76-
>>> df = spark.read.format("image").load("data/mllib/images/origin/kittens")
74+
>>> df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens")
7775
>>> df.select("image.origin", "image.width", "image.height").show(truncate=False)
7876
+-----------------------------------------------------------------------+-----+------+
7977
|origin |width|height|
@@ -82,7 +80,6 @@ In PySpark we provide Spark SQL data source API for loading image data as DataFr
8280
|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg |199 |313 |
8381
|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300 |200 |
8482
|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg |300 |296 |
85-
|file:///spark/data/mllib/images/origin/kittens/not-image.txt |-1 |-1 |
8683
+-----------------------------------------------------------------------+-----+------+
8784
{% endhighlight %}
8885
</div>
@@ -98,13 +95,11 @@ In SparkR we provide Spark SQL data source API for loading image data as DataFra
9895
2 file:///spark/data/mllib/images/origin/kittens/DP802813.jpg
9996
3 file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg
10097
4 file:///spark/data/mllib/images/origin/kittens/DP153539.jpg
101-
5 file:///spark/data/mllib/images/origin/kittens/not-image.txt
10298
width height
10399
1 300 311
104100
2 199 313
105101
3 300 200
106102
4 300 296
107-
5 -1 -1
108103

109104
{% endhighlight %}
110105
</div>

mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ package org.apache.spark.ml.source.image
2727
* - origin: `StringType` (represents the file path of the image)
2828
* - height: `IntegerType` (height of the image)
2929
* - width: `IntegerType` (width of the image)
30-
* - nChannels: `IntegerType` (number of the image channels)
30+
* - nChannels: `IntegerType` (number of image channels)
3131
* - mode: `IntegerType` (OpenCV-compatible type)
3232
* - data: `BinaryType` (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)
3333
*

0 commit comments

Comments
 (0)