Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 22 additions & 3 deletions docs/ml-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -585,7 +585,11 @@ for more details on the API.
## StringIndexer

`StringIndexer` encodes a string column of labels to a column of label indices.
The indices are in `[0, numLabels)`, ordered by label frequencies, so the most frequent label gets index `0`.
The indices are in `[0, numLabels)`, and four ordering options are supported:
"frequencyDesc": descending order by label frequency (most frequent label assigned 0),
"frequencyAsc": ascending order by label frequency (least frequent label assigned 0),
"alphabetDesc": descending alphabetical order, and "alphabetAsc": ascending alphabetical order
(default = "frequencyDesc").
The unseen labels will be put at index numLabels if user chooses to keep them.
If the input column is numeric, we cast it to string and index the string
values. When downstream pipeline components such as `Estimator` or
Expand Down Expand Up @@ -1593,10 +1597,25 @@ Suppose `a` and `b` are double columns, we use the following simple examples to
* `y ~ a + b + a:b - 1` means model `y ~ w1 * a + w2 * b + w3 * a * b` where `w1, w2, w3` are coefficients.

`RFormula` produces a vector column of features and a double or string column of label.
Like when formulas are used in R for linear regression, string input columns will be one-hot encoded, and numeric columns will be cast to doubles.
If the label column is of type string, it will be first transformed to double with `StringIndexer`.
Like when formulas are used in R for linear regression, numeric columns will be cast to doubles.
As to string input columns, they will first be transformed with [StringIndexer](ml-features.html#stringindexer) using ordering determined by `stringOrderType`,
and the last category after ordering is dropped, then the doubles will be one-hot encoded.

Suppose a string feature column containing values `{'b', 'a', 'b', 'a', 'c', 'b'}`, we set `stringOrderType` to control the encoding:
~~~
stringOrderType | Category mapped to 0 by StringIndexer | Category dropped by RFormula
----------------|---------------------------------------|---------------------------------
'frequencyDesc' | most frequent category ('b') | least frequent category ('c')
'frequencyAsc' | least frequent category ('c') | most frequent category ('b')
'alphabetDesc' | last alphabetical category ('c') | first alphabetical category ('a')
'alphabetAsc' | first alphabetical category ('a') | last alphabetical category ('c')
~~~

If the label column is of type string, it will be first transformed to double with [StringIndexer](ml-features.html#stringindexer) using `frequencyDesc` ordering.
If the label column does not exist in the DataFrame, the output label column will be created from the specified response variable in the formula.

**Note:** The ordering option `stringOrderType` is NOT used for the label column. When the label column is indexed, it uses the default descending frequency ordering in `StringIndexer`.

**Examples**

Assume that we have a DataFrame with the columns `id`, `country`, `hour`, and `clicked`:
Expand Down