Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add new doc #4

Closed
wants to merge 26 commits into from
Closed

add new doc #4

wants to merge 26 commits into from

Conversation

CodingCat
Copy link
Owner

No description provided.

@CodingCat
Copy link
Owner Author

@yanboliang I screwed up the other PR, please continue your work on this one and help reviewing my parts

@CodingCat
Copy link
Owner Author

@yanboliang it's ready for the further review

Copy link

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good overall, left some minor comments.

```xml
<repository>
<id>GitHub Repo</id>
<name>GitHub Repo</name>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XGBoost4J-Spark Snap-short Repo or XGBoost4J-Spark GitHub Repo should be better? In case users have dependency on other github repo.


To make Iris dataset be recognizable to XGBoost, we need to

1. Transform String-typed label, i.e. "class", to Integer-typed label.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the output of StringIndexer is Double type, to adapt both classification and regression.


1. Transform String-typed label, i.e. "class", to Integer-typed label.

2. Assemble the feature columns as a vector to build XGBoost's internal data representation, DMatrix.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think the goal of assembling features into vector is to make it can fit into the ML pipeline, it's the same for all MLlib algorithms. I think we can mask the detail such as DMatrix, as for XGBoost4J-Spark users, they never use DMatrix explicitly.


2. Assemble the feature columns as a vector to build XGBoost's internal data representation, DMatrix.

To convert String-typed label to Integer, we can use Spark's built-in feature transformer StringIndexer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer -> Double

With a newly created StringIndexer instance:

1. we set input column, i.e. the column containing String-typed label
2. we set output column, i.e. the column to contain the Integer-typed label.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

3. Then we `fit` StringIndex with our input DataFrame, 'rawInput', so that Spark internals can get information like total number of distinct values, etc.

Now we have a StringIndexer which is ready to be applied to our input DataFrame. To execute the transformation logic of StringIndexer, we `transform` the input DataFrame, 'rawInput' and to keep a concise DataFrame,
we drop the column `class` and only keeps the feature columns and the transformed Integer-typed label column (in the last line of the above code snippet).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

```

Now, we have a DataFrame containing only two columns, "features" which contains vector-represented
"sepal length", "sepal width", "petal length" and "petal width" and "classIndex" which has Integer-typed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

val xgbClassifier = new XGBoostClassifier().
setFeaturesCol("features").
setLabelCol("classIndex")
xgbClassifier.setMaxDeltaStep(2)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we take maxDepth as example in the context?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants