Skip to content

Commit b938ca7

Browse files
yanboliangshivaram
authored andcommitted
[SPARKR][DOC] SparkR ML user guides update for 2.0
## What changes were proposed in this pull request? * Update SparkR ML section to make them consistent with SparkR API docs. * Since #13972 adds labelling support for the ```include_example``` Jekyll plugin, so that we can split the single ```ml.R``` example file into multiple line blocks with different labels, and include them in different algorithms/models in the generated HTML page. ## How was this patch tested? Only docs update, manually check the generated docs. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14011 from yanboliang/r-user-guide-update. (cherry picked from commit 2ad031b) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
1 parent aea33bf commit b938ca7

File tree

3 files changed

+41
-32
lines changed

3 files changed

+41
-32
lines changed

R/pkg/R/mllib.R

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -55,8 +55,9 @@ setClass("KMeansModel", representation(jobj = "jobj"))
5555

5656
#' Generalized Linear Models
5757
#'
58-
#' Fits generalized linear model against a Spark DataFrame. Users can print, make predictions on the
59-
#' produced model and save the model to the input path.
58+
#' Fits generalized linear model against a Spark DataFrame.
59+
#' Users can call \code{summary} to print a summary of the fitted model, \code{predict} to make
60+
#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models.
6061
#'
6162
#' @param data SparkDataFrame for training.
6263
#' @param formula A symbolic description of the model to be fitted. Currently only a few formula
@@ -270,7 +271,8 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
270271
#' K-Means Clustering Model
271272
#'
272273
#' Fits a k-means clustering model against a Spark DataFrame, similarly to R's kmeans().
273-
#' Users can print, make predictions on the produced model and save the model to the input path.
274+
#' Users can call \code{summary} to print a summary of the fitted model, \code{predict} to make
275+
#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models.
274276
#'
275277
#' @param data SparkDataFrame for training
276278
#' @param formula A symbolic description of the model to be fitted. Currently only a few formula

docs/sparkr.md

Lines changed: 25 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -355,32 +355,39 @@ head(teenagers)
355355

356356
# Machine Learning
357357

358-
SparkR supports the following Machine Learning algorithms.
358+
SparkR supports the following machine learning algorithms currently: `Generalized Linear Model`, `Accelerated Failure Time (AFT) Survival Regression Model`, `Naive Bayes Model` and `KMeans Model`.
359+
Under the hood, SparkR uses MLlib to train the model.
360+
Users can call `summary` to print a summary of the fitted model, [predict](api/R/predict.html) to make predictions on new data, and [write.ml](api/R/write.ml.html)/[read.ml](api/R/read.ml.html) to save/load fitted models.
361+
SparkR supports a subset of the available R formula operators for model fitting, including ‘~’, ‘.’, ‘:’, ‘+’, and ‘-‘.
359362

360-
* Generalized Linear Regression Model [spark.glm()](api/R/spark.glm.html)
361-
* Naive Bayes [spark.naiveBayes()](api/R/spark.naiveBayes.html)
362-
* KMeans [spark.kmeans()](api/R/spark.kmeans.html)
363-
* AFT Survival Regression [spark.survreg()](api/R/spark.survreg.html)
363+
## Algorithms
364364

365-
[Generalized Linear Regression](api/R/spark.glm.html) can be used to train a model from a specified family. Currently the Gaussian, Binomial, Poisson and Gamma families are supported. We support a subset of the available R formula operators for model fitting, including '~', '.', ':', '+', and '-'.
365+
### Generalized Linear Model
366366

367-
The [summary()](api/R/summary.html) function gives the summary of a model produced by different algorithms listed above.
368-
It produces the similar result compared with R summary function.
367+
[spark.glm()](api/R/spark.glm.html) or [glm()](api/R/glm.html) fits generalized linear model against a Spark DataFrame.
368+
Currently "gaussian", "binomial", "poisson" and "gamma" families are supported.
369+
{% include_example glm r/ml.R %}
369370

370-
## Model persistence
371+
### Accelerated Failure Time (AFT) Survival Regression Model
372+
373+
[spark.survreg()](api/R/spark.survreg.html) fits an accelerated failure time (AFT) survival regression model on a SparkDataFrame.
374+
Note that the formula of [spark.survreg()](api/R/spark.survreg.html) does not support operator '.' currently.
375+
{% include_example survreg r/ml.R %}
376+
377+
### Naive Bayes Model
371378

372-
* [write.ml](api/R/write.ml.html) allows users to save a fitted model in a given input path
373-
* [read.ml](api/R/read.ml.html) allows users to read/load the model which was saved using write.ml in a given path
379+
[spark.naiveBayes()](api/R/spark.naiveBayes.html) fits a Bernoulli naive Bayes model against a SparkDataFrame. Only categorical data is supported.
380+
{% include_example naiveBayes r/ml.R %}
374381

375-
Model persistence is supported for all Machine Learning algorithms for all families.
382+
### KMeans Model
376383

377-
The examples below show how to build several models:
378-
* GLM using the Gaussian and Binomial model families
379-
* AFT survival regression model
380-
* Naive Bayes model
381-
* K-Means model
384+
[spark.kmeans()](api/R/spark.kmeans.html) fits a k-means clustering model against a Spark DataFrame, similarly to R's kmeans().
385+
{% include_example kmeans r/ml.R %}
386+
387+
## Model persistence
382388

383-
{% include_example r/ml.R %}
389+
The following example shows how to save/load a MLlib model by SparkR.
390+
{% include_example read_write r/ml.R %}
384391

385392
# R Function Name Conflicts
386393

examples/src/main/r/ml.R

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,8 @@ library(SparkR)
2424
# Initialize SparkSession
2525
sparkR.session(appName = "SparkR-ML-example")
2626

27-
# $example on$
2827
############################ spark.glm and glm ##############################################
29-
28+
# $example on:glm$
3029
irisDF <- suppressWarnings(createDataFrame(iris))
3130
# Fit a generalized linear model of family "gaussian" with spark.glm
3231
gaussianDF <- irisDF
@@ -55,8 +54,9 @@ summary(binomialGLM)
5554
# Prediction
5655
binomialPredictions <- predict(binomialGLM, binomialTestDF)
5756
showDF(binomialPredictions)
58-
57+
# $example off:glm$
5958
############################ spark.survreg ##############################################
59+
# $example on:survreg$
6060
# Use the ovarian dataset available in R survival package
6161
library(survival)
6262

@@ -72,9 +72,9 @@ summary(aftModel)
7272
# Prediction
7373
aftPredictions <- predict(aftModel, aftTestDF)
7474
showDF(aftPredictions)
75-
75+
# $example off:survreg$
7676
############################ spark.naiveBayes ##############################################
77-
77+
# $example on:naiveBayes$
7878
# Fit a Bernoulli naive Bayes model with spark.naiveBayes
7979
titanic <- as.data.frame(Titanic)
8080
titanicDF <- createDataFrame(titanic[titanic$Freq > 0, -5])
@@ -88,9 +88,9 @@ summary(nbModel)
8888
# Prediction
8989
nbPredictions <- predict(nbModel, nbTestDF)
9090
showDF(nbPredictions)
91-
91+
# $example off:naiveBayes$
9292
############################ spark.kmeans ##############################################
93-
93+
# $example on:kmeans$
9494
# Fit a k-means model with spark.kmeans
9595
irisDF <- suppressWarnings(createDataFrame(iris))
9696
kmeansDF <- irisDF
@@ -107,9 +107,9 @@ showDF(fitted(kmeansModel))
107107
# Prediction
108108
kmeansPredictions <- predict(kmeansModel, kmeansTestDF)
109109
showDF(kmeansPredictions)
110-
110+
# $example off:kmeans$
111111
############################ model read/write ##############################################
112-
112+
# $example on:read_write$
113113
irisDF <- suppressWarnings(createDataFrame(iris))
114114
# Fit a generalized linear model of family "gaussian" with spark.glm
115115
gaussianDF <- irisDF
@@ -120,7 +120,7 @@ gaussianGLM <- spark.glm(gaussianDF, Sepal_Length ~ Sepal_Width + Species, famil
120120
modelPath <- tempfile(pattern = "ml", fileext = ".tmp")
121121
write.ml(gaussianGLM, modelPath)
122122
gaussianGLM2 <- read.ml(modelPath)
123-
# $example off$
123+
124124
# Check model summary
125125
summary(gaussianGLM2)
126126

@@ -129,7 +129,7 @@ gaussianPredictions <- predict(gaussianGLM2, gaussianTestDF)
129129
showDF(gaussianPredictions)
130130

131131
unlink(modelPath)
132-
132+
# $example off:read_write$
133133
############################ fit models with spark.lapply #####################################
134134

135135
# Perform distributed training of multiple models with spark.lapply

0 commit comments

Comments
 (0)