This doc focuses on GPU related Scala API interfaces. 7 new classes are introduced:
- CrossValidator
- GpuDataset
- GpuDataReader
- XGBoostClassifier
- XGBoostClassificationModel
- XGBoostRegressor
- XGBoostRegressionModel
The full name is ml.dmlc.xgboost4j.scala.spark.rapids.CrossValidator
, extending from the Spark's CrossValidator.
- CrossValidator()
Note: Only GPU related methods are listed below.
- fit(dataset: GpuDataset): Model[_]. This method triggers the corss validation for hyperparameter tuninng.
- dataset: a GpuDataset used for cross validation
- returns the best Model[_] for the given hyperparameters. Please note this model returned here is actually a XGBoostClassificationModel for XGBoostClassifier, or a XGBoostRegressionModel for XGBoostRegressor. You need to cast it to the right model for calling the GPU version
transform
(dataset: GpuDataset). - Note: For CPU version, you can still call
fit
(dataset: Dataset[_])
The full name is ml.dmlc.xgboost4j.scala.spark.rapids.GpuDataset
. A GpuDataset is an object that is produced by GpuDataReaders and consumed by XGBoostClassifiers and XGBoostRegressors. No constructors or methods are exposed for this class.
The full name is ml.dmlc.xgboost4j.scala.spark.rapids.GpuDataReader
. A GpuDataReader sets options and builds GpuDataset from data sources. The data loading is a lazy operation. It occurs when the data is processed later.
- GpuDataReader(sparkSession: SparkSession)
- sparkSession: spark session for data loading
- format(source: String): GpuDataReader. This method sets data format. Valid values include csv, parquet and orc.
- source: data format to set
- returns the data reader itself
- schema(schema: StructType): GpuDataReader. This method sets data schema.
- schema: data schema in StructType format
- returns the data reader itself
- schema(schemaString: String): GpuDataReader. This method sets data schema.
- schemaString: data schema in DDL-formatted String, e.g., a INT, b STRING, c DOUBLE
- returns the data reader itself
- option(key: String, value: String): GpuDataReader. This method sets an option.
- key: the option key
- value: the option value in string format
- returns the data reader itself
- option(key: String, value: Boolean): GpuDataReader. This method sets an option.
- key: the option key
- value: the Boolean option value
- returns the data reader itself
- option(key: String, value: Long): GpuDataReader. This method sets an option.
- key: the option key
- value: the Long option value
- returns the data reader itself
- option(key: String, value: Double): GpuDataReader. This method sets an option.
- key: the option key
- value: the Double option value
- returns the data reader itself
- options(options: scala.collection.Map[String, String]): GpuDataReader. This method sets options.
- options: the options Map to set
- returns the data reader itself
- options(options: java.util.Map[String, String]): GpuDataReader. This method sets options. It is designed for Java compatibility.
- options: the options Map to set
- returns the data reader itself
- load(): GpuDataset. This method builds a GpuDataset.
- returns a GpuDataset as the result
- load(path: String): GpuDataset. This method builds a GpuDataset.
- path: the data source path
- returns a GpuDataset as the result
- load(paths: String*): GpuDataset. This method builds a GpuDataset.
- paths: the data source paths
- returns a GpuDataset as the result
- csv(path: String): GpuDataset. This method builds a GpuDataset.
- path: the CSV data path
- returns a GpuDataset as the result
- csv(paths: String*): GpuDataset. This method builds a GpuDataset.
- paths: the CSV data paths
- returns a GpuDataset as the result
- parquet(path: String): GpuDataset. This method builds a GpuDataset.
- path: the Parquet data path
- returns a GpuDataset as the result
- parquet(paths: String*): GpuDataset. This method builds a GpuDataset.
- paths: the Parquet data paths
- returns a GpuDataset as the result
- orc(path: String): GpuDataset. This method builds a GpuDataset.
- path: the ORC data path
- returns a GpuDataset as the result
- orc(paths: String*): GpuDataset. This method builds a GpuDataset.
- paths: the ORC data paths
- returns a GpuDataset as the result
- Common options
- asFloats: A Boolean flag indicates whether cast all numeric values to floats. Default is true.
- maxRowsPerChunk: An Int specifies the max rows per chunk. Default is Int.MaxValue.
- Options for CSV
- comment: A single character used for skipping lines beginning with this character. Default is empty string. By default, it is disabled.
- header: A Boolean flag indicates whether the first line should be used as names of columns. Default is false.
- nullValue: The string representation of a null value. Default is empty string.
- quote: A single character used for escaping quoted values where the separator can be part of the value. Default is
"
. - sep: A single character as a separator between adjacent values. Default is
,
.
The full name is ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
. It extends ProbabilisticClassifier[Vector, XGBoostClassifier, XGBoostClassificationModel].
- XGBoostClassifier(xgboostParams: Map[String, Any])
- all standard xgboost parameters are supported
- eval_sets: Map[String, GpuDataset]. This parameter sets the eval sets for training. (For CPU training, the type of parameter eval_sets is Map[String, DataFrame])
Note: Only GPU related methods are listed below.
- setFeaturesCols(value: Seq[String]): XGBoostClassifier. This method sets the feature columns for training.
- value: a sequence of feature column names to set
- returns the classifier itself
- setEvalSets(evalSets: Map[String, GpuDataset]): XGBoostClassifier. This method sets eval sets for training.
- evalSets: eval sets for training (For CPU training, the type is Map[String, DataFrame])
- returns the classifier itself
- fit(dataset: GpuDataset): XGBoostClassificationModel. This method triggers the training.
- dataset: a GpuDataset to train
- returns the training result as a XGBoostClassificationModel
- Note: For CPU training, you can still call fit(dataset: Dataset[_])
The full name is ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel
. It extends ProbabilisticClassificationModel[Vector, XGBoostClassificationModel].
Note: Only GPU related methods are listed below.
- transform(dataset: GpuDataset): DataFrame. This method predicts results based on the model.
- dataset: a GpuDataset to predicate
- returns a DataFrame with the prediction
The full name is ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor
. It extends Predictor[Vector, XGBoostRegressor, XGBoostRegressionModel].
- XGBoostRegressor(xgboostParams: Map[String, Any])
- all standard xgboost parameters are supported
- eval_sets: Map[String, GpuDataset]. This parameter sets the eval sets for training. (For CPU training, the type of parameter eval_sets is Map[String, DataFrame])
Note: Only GPU related methods are listed below.
- setFeaturesCols(value: Seq[String]): XGBoostRegressor. This method sets the feature columns for training.
- value: a sequence of feature column names to set
- returns the regressor itself
- setEvalSets(evalSets: Map[String, GpuDataset]): XGBoostRegressor. This method sets eval sets for training.
- evalSets: eval sets for training (For CPU training, the type is Map[String, DataFrame])
- returns the regressor itself
- fit(dataset: GpuDataset): XGBoostRegressionModel. This method triggers the training.
- dataset: a GpuDataset to train
- returns the training result as a XGBoostRegressionModel
- Note: For CPU training, you can still call fit(dataset: Dataset[_])
The full name is ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel
. It extends PredictionModel[Vector, XGBoostRegressionModel].
Note: Only GPU related methods are listed below.
- transform(dataset: GpuDataset): DataFrame. This method predicts results based on the model.
- dataset: a GpuDataset to predicate
- returns a DataFrame with the prediction