@@ -5,7 +5,7 @@ displayTitle: Data sources
55---
66
77In this section, we introduce how to use data source in ML to load data.
8- Beside some general data sources such as Parquet, CSV, JSON and JDBC, we also provide some specific data source for ML.
8+ Beside some general data sources such as Parquet, CSV, JSON and JDBC, we also provide some specific data sources for ML.
99
1010** Table of Contents**
1111
@@ -20,7 +20,7 @@ The schema of the `image` column is:
2020 - origin: ` StringType ` (represents the file path of the image)
2121 - height: ` IntegerType ` (height of the image)
2222 - width: ` IntegerType ` (width of the image)
23- - nChannels: ` IntegerType ` (number of the image channels)
23+ - nChannels: ` IntegerType ` (number of image channels)
2424 - mode: ` IntegerType ` (OpenCV-compatible type)
2525 - data: ` BinaryType ` (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)
2626
@@ -31,7 +31,7 @@ The schema of the `image` column is:
3131implements a Spark SQL data source API for loading image data as a DataFrame.
3232
3333{% highlight scala %}
34- scala> val df = spark.read.format("image").load("data/mllib/images/origin/kittens")
34+ scala> val df = spark.read.format("image").option("dropInvalid", true). load("data/mllib/images/origin/kittens")
3535df: org.apache.spark.sql.DataFrame = [ image: struct<origin: string, height: int ... 4 more fields>]
3636
3737scala> df.select("image.origin", "image.width", "image.height").show(truncate=false)
@@ -42,7 +42,6 @@ scala> df.select("image.origin", "image.width", "image.height").show(truncate=fa
4242| file:///spark/data/mllib/images/origin/kittens/DP802813.jpg | 199 | 313 |
4343| file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg | 300 | 200 |
4444| file:///spark/data/mllib/images/origin/kittens/DP153539.jpg | 300 | 296 |
45- | file:///spark/data/mllib/images/origin/kittens/not-image.txt | -1 | -1 |
4645+-----------------------------------------------------------------------+-----+------+
4746{% endhighlight %}
4847</div >
@@ -52,7 +51,7 @@ scala> df.select("image.origin", "image.width", "image.height").show(truncate=fa
5251implements Spark SQL data source API for loading image data as DataFrame.
5352
5453{% highlight java %}
55- Dataset<Row > imagesDF = spark.read().format("image").load("data/mllib/images/origin/kittens");
54+ Dataset<Row > imagesDF = spark.read().format("image").option("dropInvalid", true). load("data/mllib/images/origin/kittens");
5655imageDF.select("image.origin", "image.width", "image.height").show(false);
5756/*
5857Will output:
@@ -63,7 +62,6 @@ Will output:
6362| file:///spark/data/mllib/images/origin/kittens/DP802813.jpg | 199 | 313 |
6463| file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg | 300 | 200 |
6564| file:///spark/data/mllib/images/origin/kittens/DP153539.jpg | 300 | 296 |
66- | file:///spark/data/mllib/images/origin/kittens/not-image.txt | -1 | -1 |
6765+-----------------------------------------------------------------------+-----+------+
6866* /
6967{% endhighlight %}
@@ -73,7 +71,7 @@ Will output:
7371In PySpark we provide Spark SQL data source API for loading image data as DataFrame.
7472
7573{% highlight python %}
76- >>> df = spark.read.format("image").load("data/mllib/images/origin/kittens")
74+ >>> df = spark.read.format("image").option("dropInvalid", true). load("data/mllib/images/origin/kittens")
7775 >>> df.select("image.origin", "image.width", "image.height").show(truncate=False)
7876 +-----------------------------------------------------------------------+-----+------+
7977| origin | width| height|
@@ -82,7 +80,6 @@ In PySpark we provide Spark SQL data source API for loading image data as DataFr
8280| file:///spark/data/mllib/images/origin/kittens/DP802813.jpg | 199 | 313 |
8381| file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg | 300 | 200 |
8482| file:///spark/data/mllib/images/origin/kittens/DP153539.jpg | 300 | 296 |
85- | file:///spark/data/mllib/images/origin/kittens/not-image.txt | -1 | -1 |
8683+-----------------------------------------------------------------------+-----+------+
8784{% endhighlight %}
8885</div >
@@ -98,13 +95,11 @@ In SparkR we provide Spark SQL data source API for loading image data as DataFra
98952 file:///spark/data/mllib/images/origin/kittens/DP802813.jpg
99963 file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg
100974 file:///spark/data/mllib/images/origin/kittens/DP153539.jpg
101- 5 file:///spark/data/mllib/images/origin/kittens/not-image.txt
10298 width height
103991 300 311
1041002 199 313
1051013 300 200
1061024 300 296
107- 5 -1 -1
108103
109104{% endhighlight %}
110105</div >
0 commit comments