-
Notifications
You must be signed in to change notification settings - Fork 3
IRIS
Wiki ▸ Formats ▸ IRIS
The IRIS reader is defined in the ai.cookie.spark.sql.sources.iris package.
Class | Description |
---|---|
DefaultSource | The default Spark Data Source (parameters below). |
IrisDataFrameReader | A convenience class providing a read function based on the default data source. |
Parameter | Description |
---|---|
path | An absolute (or relative) URI to a IRIS data file (reference). |
format | A enum value of 'csv' or 'libsvm', indicating the format of the data file. |
Column | Description |
---|---|
label | A nominal label indicating a species of iris plant (0.0 -> setosa, 1.0 -> versicolor, 2.0 -> virginica). |
features | A 4-dimensional vector describing various characteristics of the plant. |
The feature data is encoded in a Vector for interoperability with Spark ML. The four dimensions are:
- sepal length (cm)
- sepal width (cm)
- petal length (cm)
- petal width (cm)
The label
and features
columns contain relevant metadata, based on the IRIS dataset definition. See the walkthrough for details.
Here's a quick walkthrough based on examples/iris.script.scala.
Note: the example script is compatible with Mac OS X and Linux only, due to the use of the tar
command.
The example script is intended to be used with spark-shell
. For help with installing Spark, see the Installation page. This section also assumes you've checked out the cookie-datasets project to a working directory.
From the working directory, launch the example script with spark-shell
. The script will download the IRIS dataset to a temporary directory, then load and manipulate a few DataFrames based on it.
$ spark-shell --packages "ai.cookie:cookie-datasets_2.10:0.1.0" -i examples/iris.script.scala
When the script completes, the shell stays running. Press CTRL-D to quit.
For Scala code, a convenient read
method is provided:
import ai.cookie.spark.sql.sources.iris._
val df = sqlContext.read.iris("iris.data")
df.sample(false, 0.1).show()
Which outputs:
df: org.apache.spark.sql.DataFrame = [label: double, features: vector]
+-----+-----------------+
|label| features|
+-----+-----------------+
| 0.0|[5.8,4.0,1.2,0.2]|
| 0.0|[5.7,4.4,1.5,0.4]|
| 0.0|[5.0,3.4,1.6,0.4]|
| 0.0|[5.2,3.5,1.5,0.2]|
| 0.0|[5.1,3.4,1.5,0.2]|
| 0.0|[5.0,3.5,1.6,0.6]|
| 1.0|[5.7,2.8,4.5,1.3]|
| 1.0|[4.9,2.4,3.3,1.0]|
| 1.0|[6.6,2.9,4.6,1.3]|
| 1.0|[5.4,3.0,4.5,1.5]|
| 1.0|[5.7,2.8,4.1,1.3]|
| 2.0|[7.7,2.6,6.9,2.3]|
| 2.0|[6.9,3.1,5.1,2.3]|
| 2.0|[6.5,3.0,5.2,2.0]|
+-----+-----------------+
For Spark SQL, a default datasource is provided, with which to register a temporary table:
CREATE TEMPORARY TABLE iris
USING ai.cookie.spark.sql.sources.iris
OPTIONS (path "iris.data", format "csv")
The loaded DataFrame contains useful metadata for interoperability with Spark ML.
The IRIS dataset labels each example as to its species. The labels are encoded as nominal values, meaning indices that refer to a lookup table. Spark ML provides support for storing the lookup table as schema metadata.
Given an IRIS dataset (loaded as df
here), the lookup table may be obtained as follows:
scala> val labelAttrs = Attribute.fromStructField(df.schema("label"))
labelAttrs: org.apache.spark.ml.attribute.Attribute = {"vals":["I. setosa","I. versicolor",
"I. virginica"],"type":"nominal","name":"label"}
The metadata may be used to render a more readable view of the data:
import org.apache.spark.ml.feature.IndexToString
val i2s = new IndexToString().setInputCol("label").setOutputCol("labelString")
i2s.transform(df).select("labelString", "features").sample(false, 0.1).show
Which outputs:
+-------------+-----------------+
| labelString| features|
+-------------+-----------------+
| I. setosa|[4.6,3.1,1.5,0.2]|
| I. setosa|[4.9,3.1,1.5,0.1]|
| I. setosa|[4.7,3.2,1.6,0.2]|
| I. setosa|[4.9,3.1,1.5,0.1]|
|I. versicolor|[7.0,3.2,4.7,1.4]|
|I. versicolor|[6.3,3.3,4.7,1.6]|
|I. versicolor|[6.0,2.9,4.5,1.5]|
|I. versicolor|[6.0,3.4,4.5,1.6]|
| I. virginica|[6.7,2.5,5.8,1.8]|
| I. virginica|[7.7,2.8,6.7,2.0]|
| I. virginica|[6.4,2.8,5.6,2.1]|
| I. virginica|[7.2,3.0,5.8,1.6]|
| I. virginica|[5.9,3.0,5.1,1.8]|
+-------------+-----------------+
The features column contains useful metadata too - a label for each dimension:
scala> val featureAttrs = AttributeGroup.fromStructField(df.schema("features")).attributes.get
featureAttrs: Array[org.apache.spark.ml.attribute.Attribute] = Array(
{"type":"numeric","idx":0,"name":"sepal length (cm)"}, {"type":"numeric","idx":1,"name":"sepal width (cm)"},
{"type":"numeric","idx":2,"name":"petal length (cm)"}, {"type":"numeric","idx":3,"name":"petal width (cm)"})
Getting Started
Formats
Development