Skip to content
This repository was archived by the owner on Dec 31, 2020. It is now read-only.
Eron Wright edited this page Dec 23, 2015 · 3 revisions

WikiFormatsIRIS

Using the Data Reader

Package

The IRIS reader is defined in the ai.cookie.spark.sql.sources.iris package.

Classes

Class Description
DefaultSource The default Spark Data Source (parameters below).
IrisDataFrameReader A convenience class providing a read function based on the default data source.

Data Source Parameters

Parameter Description
path An absolute (or relative) URI to a IRIS data file (reference).
format A enum value of 'csv' or 'libsvm', indicating the format of the data file.

Schema

Column Description
label A nominal label indicating a species of iris plant (0.0 -> setosa, 1.0 -> versicolor, 2.0 -> virginica).
features A 4-dimensional vector describing various characteristics of the plant.

Feature Data

The feature data is encoded in a Vector for interoperability with Spark ML. The four dimensions are:

  1. sepal length (cm)
  2. sepal width (cm)
  3. petal length (cm)
  4. petal width (cm)

Metadata

The label and features columns contain relevant metadata, based on the IRIS dataset definition. See the walkthrough for details.

Walkthrough

Here's a quick walkthrough based on examples/iris.script.scala.

Note: the example script is compatible with Mac OS X and Linux only, due to the use of the tar command.

Running the Example

The example script is intended to be used with spark-shell. For help with installing Spark, see the Installation page. This section also assumes you've checked out the cookie-datasets project to a working directory.

From the working directory, launch the example script with spark-shell. The script will download the IRIS dataset to a temporary directory, then load and manipulate a few DataFrames based on it.

$ spark-shell --packages "ai.cookie:cookie-datasets_2.10:0.1.0" -i examples/iris.script.scala

When the script completes, the shell stays running. Press CTRL-D to quit.

Loading a DataFrame

Scala

For Scala code, a convenient read method is provided:

import ai.cookie.spark.sql.sources.iris._
val df = sqlContext.read.iris("iris.data")
df.sample(false, 0.1).show()

Which outputs:

df: org.apache.spark.sql.DataFrame = [label: double, features: vector]
+-----+-----------------+
|label|         features|
+-----+-----------------+
|  0.0|[5.8,4.0,1.2,0.2]|
|  0.0|[5.7,4.4,1.5,0.4]|
|  0.0|[5.0,3.4,1.6,0.4]|
|  0.0|[5.2,3.5,1.5,0.2]|
|  0.0|[5.1,3.4,1.5,0.2]|
|  0.0|[5.0,3.5,1.6,0.6]|
|  1.0|[5.7,2.8,4.5,1.3]|
|  1.0|[4.9,2.4,3.3,1.0]|
|  1.0|[6.6,2.9,4.6,1.3]|
|  1.0|[5.4,3.0,4.5,1.5]|
|  1.0|[5.7,2.8,4.1,1.3]|
|  2.0|[7.7,2.6,6.9,2.3]|
|  2.0|[6.9,3.1,5.1,2.3]|
|  2.0|[6.5,3.0,5.2,2.0]|
+-----+-----------------+

SQL

For Spark SQL, a default datasource is provided, with which to register a temporary table:

CREATE TEMPORARY TABLE iris
USING ai.cookie.spark.sql.sources.iris
OPTIONS (path "iris.data", format "csv")

Leveraging Metadata

The loaded DataFrame contains useful metadata for interoperability with Spark ML.

Labels

The IRIS dataset labels each example as to its species. The labels are encoded as nominal values, meaning indices that refer to a lookup table. Spark ML provides support for storing the lookup table as schema metadata.

Given an IRIS dataset (loaded as df here), the lookup table may be obtained as follows:

scala> val labelAttrs = Attribute.fromStructField(df.schema("label"))
labelAttrs: org.apache.spark.ml.attribute.Attribute = {"vals":["I. setosa","I. versicolor",
"I. virginica"],"type":"nominal","name":"label"}

The metadata may be used to render a more readable view of the data:

import org.apache.spark.ml.feature.IndexToString
val i2s = new IndexToString().setInputCol("label").setOutputCol("labelString")
i2s.transform(df).select("labelString", "features").sample(false, 0.1).show

Which outputs:

+-------------+-----------------+
|  labelString|         features|
+-------------+-----------------+
|    I. setosa|[4.6,3.1,1.5,0.2]|
|    I. setosa|[4.9,3.1,1.5,0.1]|
|    I. setosa|[4.7,3.2,1.6,0.2]|
|    I. setosa|[4.9,3.1,1.5,0.1]|
|I. versicolor|[7.0,3.2,4.7,1.4]|
|I. versicolor|[6.3,3.3,4.7,1.6]|
|I. versicolor|[6.0,2.9,4.5,1.5]|
|I. versicolor|[6.0,3.4,4.5,1.6]|
| I. virginica|[6.7,2.5,5.8,1.8]|
| I. virginica|[7.7,2.8,6.7,2.0]|
| I. virginica|[6.4,2.8,5.6,2.1]|
| I. virginica|[7.2,3.0,5.8,1.6]|
| I. virginica|[5.9,3.0,5.1,1.8]|
+-------------+-----------------+

Features

The features column contains useful metadata too - a label for each dimension:

scala> val featureAttrs = AttributeGroup.fromStructField(df.schema("features")).attributes.get
featureAttrs: Array[org.apache.spark.ml.attribute.Attribute] = Array(
{"type":"numeric","idx":0,"name":"sepal length (cm)"}, {"type":"numeric","idx":1,"name":"sepal width (cm)"},
{"type":"numeric","idx":2,"name":"petal length (cm)"}, {"type":"numeric","idx":3,"name":"petal width (cm)"})