-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add new doc #4
add new doc #4
Conversation
@yanboliang I screwed up the other PR, please continue your work on this one and help reviewing my parts |
@yanboliang it's ready for the further review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good overall, left some minor comments.
```xml | ||
<repository> | ||
<id>GitHub Repo</id> | ||
<name>GitHub Repo</name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XGBoost4J-Spark Snap-short Repo
or XGBoost4J-Spark GitHub Repo
should be better? In case users have dependency on other github repo.
|
||
To make Iris dataset be recognizable to XGBoost, we need to | ||
|
||
1. Transform String-typed label, i.e. "class", to Integer-typed label. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the output of StringIndexer
is Double type, to adapt both classification and regression.
|
||
1. Transform String-typed label, i.e. "class", to Integer-typed label. | ||
|
||
2. Assemble the feature columns as a vector to build XGBoost's internal data representation, DMatrix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I think the goal of assembling features into vector is to make it can fit into the ML pipeline, it's the same for all MLlib algorithms. I think we can mask the detail such as DMatrix
, as for XGBoost4J-Spark users, they never use DMatrix
explicitly.
|
||
2. Assemble the feature columns as a vector to build XGBoost's internal data representation, DMatrix. | ||
|
||
To convert String-typed label to Integer, we can use Spark's built-in feature transformer StringIndexer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integer -> Double
With a newly created StringIndexer instance: | ||
|
||
1. we set input column, i.e. the column containing String-typed label | ||
2. we set output column, i.e. the column to contain the Integer-typed label. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
3. Then we `fit` StringIndex with our input DataFrame, 'rawInput', so that Spark internals can get information like total number of distinct values, etc. | ||
|
||
Now we have a StringIndexer which is ready to be applied to our input DataFrame. To execute the transformation logic of StringIndexer, we `transform` the input DataFrame, 'rawInput' and to keep a concise DataFrame, | ||
we drop the column `class` and only keeps the feature columns and the transformed Integer-typed label column (in the last line of the above code snippet). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
``` | ||
|
||
Now, we have a DataFrame containing only two columns, "features" which contains vector-represented | ||
"sepal length", "sepal width", "petal length" and "petal width" and "classIndex" which has Integer-typed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
val xgbClassifier = new XGBoostClassifier(). | ||
setFeaturesCol("features"). | ||
setLabelCol("classIndex") | ||
xgbClassifier.setMaxDeltaStep(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we take maxDepth
as example in the context?
No description provided.