Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][PySpark] Add XGBoost PySpark API support #7709

Closed
wants to merge 21 commits into from

Conversation

wbo4958
Copy link
Contributor

@wbo4958 wbo4958 commented Feb 28, 2022

This PR is to close #7578.

The XGBoost PySpark API is a wrapper of xgboost4j-spark and xgboost4j-spark-gpu. It will be packaged into the existing xgboost python-package.

@wbo4958 wbo4958 changed the title [PySpark] Add XGBoost PySpark API support [WIP][PySpark] Add XGBoost PySpark API support Feb 28, 2022
@trivialfis trivialfis marked this pull request as draft February 28, 2022 13:00
@candalfigomoro
Copy link

@wbo4958 Is there a way to set the Scala tracker instead of the Python tracker?

In Java/Scala I set:

xgbClassifier.set("trackerConf", new TrackerConf(0, "scala"));

@wbo4958
Copy link
Contributor Author

wbo4958 commented Mar 8, 2022

@wbo4958 Is there a way to set the Scala tracker instead of the Python tracker?

In Java/Scala I set:

xgbClassifier.set("trackerConf", new TrackerConf(0, "scala"));

Yeah, looks like we need to add it support in scala.

@candalfigomoro
Copy link

candalfigomoro commented Mar 8, 2022

This is a typical snippet I can have in Java:

        XGBoostClassifier xgbClassifier = new XGBoostClassifier();
        xgbClassifier.set("trackerConf", new TrackerConf(0, "scala"));
        xgbClassifier.setNumWorkers(8);
        xgbClassifier.setNthread(1);
        xgbClassifier.setEta(0.1);
        xgbClassifier.setMaxDepth(6);
        xgbClassifier.setNumRound(1000);
        xgbClassifier.setNumEarlyStoppingRounds(10);
        xgbClassifier.setObjective("binary:logistic");
        xgbClassifier.setScalePosWeight(1.0);
        xgbClassifier.setSubsample(0.5);
        xgbClassifier.setColsampleBytree(0.5);
        xgbClassifier.setTreeMethod("hist");
        xgbClassifier.setFeaturesCol("features");
        xgbClassifier.setLabelCol("label");
        xgbClassifier.setEvalMetric("auc");
        xgbClassifier.setMaximizeEvaluationMetrics(true);  // true for "auc"
        xgbClassifier.setEvalSets(new Map.Map1<>("validation", validation));

It looks like many parameters are missing (is there a way to set eval sets for early stopping?)

@wbo4958
Copy link
Contributor Author

wbo4958 commented Mar 8, 2022

@candalfigomoro Yeah, I have not ported all the scala params yet, for now, I am focusing on adding integration tests. we can have following PRs to add more params.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for revising the original PR! This looks much more aligned with Python and Pyspark ml.

Could you please write a tutorial in doc/tutorials/ for newbies like me to get started?

from pyspark import keyword_only
from pyspark.ml.common import inherit_doc

from xgboost.ml.dmlc.param import _XGBoostClassifierBase, _XGBoostClassificationModelBase
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more appropriate to use xgboost.pyspark.PySparkXGBClassifier instead of this java style module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not the pyspark convention. It's not pyspark.ml.apache.org.xxxx.

_spark = _spark__init()


def get_spark_i_know_what_i_am_doing():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what's the specific case that we should use this instead of with cpu session? Could you please document it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the API was copied from spark-rapids, and now I've changed it according to need of xgboost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wbo4958 Okay .... but I'm not sure how's that relevant to the question?

from xgboost.ml.dmlc.param.internal import _XGBoostClassifierBase, _XGBoostClassificationModelBase, \
_XGBoostRegressionModelBase, _XGBoostRegressorBase

__all__ = ['_XGBoostClassifierBase', '_XGBoostClassificationModelBase',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?

pass


@inherit_doc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure what's the result of combining this decorator with a custom doc string. Have you checked?

This was referenced Mar 23, 2022
"""
Java Regressor for regression tasks.

.. versionadded:: 3.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please double-check copied code.

@trivialfis trivialfis closed this Aug 9, 2022
@wbo4958 wbo4958 deleted the xgb.pyspark branch April 23, 2024 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Make XGBoost4j-Spark to support PySpark
3 participants