Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PySpark XGBoost integration #8020

Merged
merged 74 commits into from
Jul 13, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
0a15487
init
WeichenXu123 Jun 21, 2022
a04a1d0
fix
WeichenXu123 Jun 26, 2022
d2dbb8d
clean
WeichenXu123 Jun 26, 2022
34827ef
remove external mode
WeichenXu123 Jun 26, 2022
50ebb1f
update doc style
WeichenXu123 Jun 26, 2022
f386ee1
black
WeichenXu123 Jun 26, 2022
5d4122c
update
WeichenXu123 Jun 26, 2022
0424c19
refactor
WeichenXu123 Jul 1, 2022
f8f33bd
update params code
WeichenXu123 Jul 1, 2022
ba4787f
pyspark param alias
WeichenXu123 Jul 1, 2022
9447486
fix
WeichenXu123 Jul 1, 2022
016b1b7
add gpu param check test
WeichenXu123 Jul 1, 2022
72af029
update _repartition_needed
WeichenXu123 Jul 1, 2022
55fa052
merge fit/fit_distributed
WeichenXu123 Jul 1, 2022
7d9c37d
fix base margin support
WeichenXu123 Jul 2, 2022
14392a4
remove dump file code
WeichenXu123 Jul 2, 2022
5bee831
fix verbose param
WeichenXu123 Jul 2, 2022
a75ee88
update _unsupported_xgb_params
WeichenXu123 Jul 2, 2022
d71e7e0
set nthread to be spark.task.cpus
WeichenXu123 Jul 3, 2022
75cfe91
support feature_types and feature_names
WeichenXu123 Jul 3, 2022
7f68346
update _repartition_needed
WeichenXu123 Jul 3, 2022
dfffe8e
support use array as features column
WeichenXu123 Jul 3, 2022
9a87909
gpu mode support oss spark
WeichenXu123 Jul 3, 2022
2aeaee8
update comment
WeichenXu123 Jul 3, 2022
10bf6b2
avoid call pd.Series.to_list
WeichenXu123 Jul 3, 2022
53b1a5b
avoid data concatenation in predict_udf
WeichenXu123 Jul 3, 2022
c907164
update comment
WeichenXu123 Jul 3, 2022
3a92fac
fix
WeichenXu123 Jul 5, 2022
9e6ad55
fix tests
WeichenXu123 Jul 5, 2022
d24512d
forbid camel case param in setParams
WeichenXu123 Jul 5, 2022
b3fa185
rename 2 camel case params
WeichenXu123 Jul 5, 2022
8cee5cb
address comments
WeichenXu123 Jul 5, 2022
18956d1
update doc
WeichenXu123 Jul 5, 2022
d4f048d
remove feature_types pyspark param
WeichenXu123 Jul 5, 2022
9e66d2f
fix-test
WeichenXu123 Jul 5, 2022
c60ccad
update-doc
WeichenXu123 Jul 5, 2022
4f247ab
fix tests
WeichenXu123 Jul 6, 2022
40afa4f
fix lint, refactor
WeichenXu123 Jul 6, 2022
508a36b
fix test
WeichenXu123 Jul 6, 2022
39e2b45
fix test
WeichenXu123 Jul 6, 2022
e657a21
support feature weights
WeichenXu123 Jul 6, 2022
60e2561
fix-ci
WeichenXu123 Jul 6, 2022
70b2da2
update-ci-conda-env
WeichenXu123 Jul 6, 2022
b40bc14
add gpu test
WeichenXu123 Jul 6, 2022
6274aa0
clean test
WeichenXu123 Jul 6, 2022
3207256
ignore mypy
WeichenXu123 Jul 6, 2022
76c1f8d
ignore mypy
WeichenXu123 Jul 6, 2022
035ac68
update doc
WeichenXu123 Jul 6, 2022
a2ead7e
handle missing param
WeichenXu123 Jul 6, 2022
1472432
update doc
WeichenXu123 Jul 6, 2022
7b5afa1
[CI] Install PySpark in Python env
hcho3 Jul 7, 2022
f8aea44
fix spark-xgb-model params
WeichenXu123 Jul 7, 2022
c838437
add spark config for printing full error stack and avoid task retries
WeichenXu123 Jul 7, 2022
fac1c2a
clean test temp dir
WeichenXu123 Jul 7, 2022
62bf8bb
fix import
WeichenXu123 Jul 7, 2022
1b77f23
add pyspark env config in CI
WeichenXu123 Jul 7, 2022
0487223
update discoveryScript config path
WeichenXu123 Jul 7, 2022
37938d8
Skip tests.
trivialfis Jul 7, 2022
024fc9e
Missing dependency.
trivialfis Jul 7, 2022
4bb7404
black.
trivialfis Jul 7, 2022
b92959e
missing dependency.
trivialfis Jul 7, 2022
b7bb690
Rest of them.
trivialfis Jul 7, 2022
b7dbc41
Merge remote-tracking branch 'upstream/master' into xgb-spark-py
hcho3 Jul 7, 2022
8885247
Fix message
hcho3 Jul 7, 2022
22f2103
fix CI config
WeichenXu123 Jul 8, 2022
277984d
setup python-test action jdk
WeichenXu123 Jul 8, 2022
0f9ffa0
update macos ci config
WeichenXu123 Jul 8, 2022
3534c75
fix callback test
WeichenXu123 Jul 8, 2022
197c267
disable spark test on macos
WeichenXu123 Jul 8, 2022
ee78b56
remove disable lint
WeichenXu123 Jul 8, 2022
42d0ce5
address pylint errors
WeichenXu123 Jul 10, 2022
dc02023
address pylint errors
WeichenXu123 Jul 10, 2022
520d754
address pylint issues
WeichenXu123 Jul 10, 2022
5237d81
fix lint ci action
WeichenXu123 Jul 11, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ jobs:
- name: Install Python packages
run: |
python -m pip install wheel setuptools
python -m pip install pylint cpplint numpy scipy scikit-learn
python -m pip install pylint cpplint numpy scipy scikit-learn pyspark pandas cloudpickle
- name: Run lint
run: |
make lint
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/python_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ jobs:
python-tests-on-macos:
name: Test XGBoost Python package on ${{ matrix.config.os }}
runs-on: ${{ matrix.config.os }}
timeout-minutes: 90
strategy:
matrix:
config:
Expand Down
3 changes: 2 additions & 1 deletion python-package/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -351,7 +351,8 @@ def run(self) -> None:
'scikit-learn': ['scikit-learn'],
'dask': ['dask', 'pandas', 'distributed'],
'datatable': ['datatable'],
'plotting': ['graphviz', 'matplotlib']
'plotting': ['graphviz', 'matplotlib'],
"pyspark": ["pyspark", "scikit-learn", "cloudpickle"],
},
maintainer='Hyunsu Cho',
maintainer_email='chohyu01@cs.washington.edu',
Expand Down
22 changes: 22 additions & 0 deletions python-package/xgboost/spark/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# type: ignore
"""PySpark XGBoost integration interface
"""

try:
import pyspark
except ImportError as e:
raise ImportError("pyspark package needs to be installed to use this module") from e

from .estimator import (
SparkXGBClassifier,
SparkXGBClassifierModel,
SparkXGBRegressor,
SparkXGBRegressorModel,
)

__all__ = [
"SparkXGBClassifier",
"SparkXGBClassifierModel",
"SparkXGBRegressor",
"SparkXGBRegressorModel",
]
Loading