add new doc #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

CodingCat wants to merge 26 commits into master from xgboost_spark_doc_new

Owner

CodingCat commented Jul 26, 2018

No description provided.

CodingCat and others added 9 commits

July 25, 2018 21:55


          add back train method but mark as deprecated

abccb2f


          fix scalastyle error

34346c9


          add back train method but mark as deprecated

9a77997


          fix scalastyle error

45f9dba


          add new

f3e4eb4


          update doc

ff84cf2


          finish Gang Scheduling

395b2a9


          more

8c713e5


          intro

dbd5d9f

Owner Author

CodingCat commented Jul 26, 2018

@yanboliang I screwed up the other PR, please continue your work on this one and help reviewing my parts

yanboliang and others added 6 commits

July 27, 2018 22:12


          Add sections: Prediction, Model persistence and ML pipeline.

fdec071


          Add XGBoost4j-Spark MLlib pipeline example

0e3e71d


          Merge pull request #5 from yanboliang/xgboost_spark_doc_new

28ef086

XGBoost4j-Spark new doc


          partial finished version

3a849d9


          finish the doc


          adjust code

c68f8a1

Owner Author

CodingCat commented Jul 29, 2018

@yanboliang it's ready for the further review

CodingCat mentioned this pull request

[jvm-packages] PySpark Support Checklist dmlc/xgboost#3370

Closed

5 tasks

yanboliang reviewed

View reviewed changes

yanboliang left a comment

Looks very good overall, left some minor comments.

jvm-packages/xgboost4j-spark/docs/index.md Outdated

+              ```xml
+              <repository>
+                <id>GitHub Repo</id>
+                <name>GitHub Repo</name>

yanboliang Jul 30, 2018

XGBoost4J-Spark Snap-short Repo or XGBoost4J-Spark GitHub Repo should be better? In case users have dependency on other github repo.

jvm-packages/xgboost4j-spark/docs/index.md Outdated


		To make Iris dataset be recognizable to XGBoost, we need to

		1. Transform String-typed label, i.e. "class", to Integer-typed label.

yanboliang Jul 30, 2018

Actually, the output of StringIndexer is Double type, to adapt both classification and regression.

jvm-packages/xgboost4j-spark/docs/index.md Outdated


		1. Transform String-typed label, i.e. "class", to Integer-typed label.

		2. Assemble the feature columns as a vector to build XGBoost's internal data representation, DMatrix.

yanboliang Jul 30, 2018

Here I think the goal of assembling features into vector is to make it can fit into the ML pipeline, it's the same for all MLlib algorithms. I think we can mask the detail such as DMatrix, as for XGBoost4J-Spark users, they never use DMatrix explicitly.

jvm-packages/xgboost4j-spark/docs/index.md Outdated


		2. Assemble the feature columns as a vector to build XGBoost's internal data representation, DMatrix.

		To convert String-typed label to Integer, we can use Spark's built-in feature transformer StringIndexer.

yanboliang Jul 30, 2018

Integer -> Double

jvm-packages/xgboost4j-spark/docs/index.md Outdated

+              With a newly created StringIndexer instance:
+. we set input column, i.e. the column containing String-typed label
+. we set output column, i.e. the column to contain the Integer-typed label.

yanboliang Jul 30, 2018

Ditto

jvm-packages/xgboost4j-spark/docs/index.md Outdated

+. Then we `fit` StringIndex with our input DataFrame, 'rawInput', so that Spark internals can get information like total number of distinct values, etc.
+              Now we have a StringIndexer which is ready to be applied to our input DataFrame. To execute the transformation logic of StringIndexer, we `transform` the input DataFrame, 'rawInput' and to keep a concise DataFrame,
+              we drop the column `class` and only keeps the feature columns and the transformed Integer-typed label column (in the last line of the above code snippet).

yanboliang Jul 30, 2018

Ditto

jvm-packages/xgboost4j-spark/docs/index.md Outdated

+              ```
+              Now, we have a DataFrame containing only two columns, "features" which contains vector-represented
+              "sepal length", "sepal width", "petal length" and "petal width" and "classIndex" which has Integer-typed

yanboliang Jul 30, 2018

Ditto

jvm-packages/xgboost4j-spark/docs/index.md Outdated

+                   val xgbClassifier = new XGBoostClassifier().
+                     setFeaturesCol("features").
+                     setLabelCol("classIndex")
+                   xgbClassifier.setMaxDeltaStep(2)

yanboliang Jul 30, 2018

Do we take maxDepth as example in the context?

Nan Zhu and others added 11 commits

July 30, 2018 14:42


          fix the doc

0a686d7


          use rst

2276a7e


          Convert XGBoost4J-Spark tutorial to reST

9bce7e7


          Bring XGBoost4J up to date

341a470


          add note about using hdfs

c217cf8


          remove duplicate file

17c65aa


          fix descriptions

e469c35


          update doc

69ed50f


          Wrap HDFS/S3 export support as a note

164ac48


          update


          wrap indexing_mode example in code block

9ee52f2

CodingCat force-pushed the master branch from 45f9dba to 3ae733a Compare

August 21, 2018 03:37

CodingCat force-pushed the master branch from f8ed9e8 to 57740ae Compare

October 5, 2018 05:45

CodingCat force-pushed the master branch from 4a4d6b2 to 755af6c Compare

November 15, 2018 05:07

CodingCat force-pushed the master branch from cb3cd2b to 4fe9b92 Compare

November 30, 2018 06:21

CodingCat closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet