Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model builders API update #1320

Merged
merged 57 commits into from
Jul 12, 2023
Merged

Conversation

razdoburdin
Copy link
Contributor

@razdoburdin razdoburdin commented Jun 6, 2023

I want to start discussion about changing the Model builders API.
The main challenge is support of both scikit-learn style estimators and non-scikit style booster objects.
In the presented PR I added a new class GBTDAALModel for python interface in non-scikit style. It allow to hide all worldly syntax from the old API.

In case of using XGBoost, for regression tasks user-side code changes from:
d4p_model = daal4py.get_gbt_model_from_xgboost(booster)
d4p_prediction = daal4py.gbt_regression_prediction().compute(X_test, d4p_model).prediction
to
d4p_model = daal4py.mb.convert_model(booster)
d4p_prediction =d4p_model.predict()

In case of classification problem user-side code changes from:
d4p_model = daal4py.get_gbt_model_from_xgboost(booster)
d4p_prediction=daal4py.gbt_classification_prediction(nClasses=n_classes).compute(X_test, d4p_model).prediction
to
d4p_model = daal4py.mb.convert_model(booster)
d4p_prediction = d4p_model.predict()

For support of scikit-style estimators, I updated GBTDAALClassifier and GBTDAALRegressor classes.
One can use them like this (example for XGBoost, regression task):
from daal4py.sklearn.ensemble import GBTDAALRegressor
reg = xgb.XGBRegressor()
reg.fit(X, y)
d4p_predt = GBTDAALRegressor.convert_model(reg).predict(X)

@razdoburdin razdoburdin changed the title initial Model builders API update Jun 6, 2023
@napetrov
Copy link
Contributor

napetrov commented Jun 6, 2023

the only thing that i don't fully like - CamelCase on .GbtModel. Might be GBTModel would be better?

@razdoburdin
Copy link
Contributor Author

the only thing that i don't fully like - CamelCase on .GbtModel. Might be GBTModel would be better?

fixed

src/gbt_model_builder.pyx Outdated Show resolved Hide resolved
src/gbt_model_builder.pyx Outdated Show resolved Hide resolved
src/gbt_model_builder.pyx Outdated Show resolved Hide resolved
src/gbt_model_builder.pyx Outdated Show resolved Hide resolved
src/gbt_model_builder.pyx Outdated Show resolved Hide resolved
@inteldimitrius
Copy link
Contributor

The rest looks good to me!

@razdoburdin razdoburdin marked this pull request as draft June 9, 2023 13:58
src/gbt_model_builder.pyx Outdated Show resolved Hide resolved
@razdoburdin razdoburdin marked this pull request as ready for review June 12, 2023 12:39
@razdoburdin
Copy link
Contributor Author

Dear all,
I have updated the PR. Currently, the scikit-style estimators are supported. Please see the updated description for details.

@inteldimitrius
Copy link
Contributor

MacOS and Linux CI checks are failing. Is it okay?

@razdoburdin
Copy link
Contributor Author

Thanks for noting this!
I have reproduced the test failure locally. The problem is in NAN test in predict() methods for estimators. As far as NAN are now supported, I removed the NAN test. But the sklearn.check_estimator requires this check. From another hand XGBRegressor supports NAN also and passes this check.

@Alexsandruss, have you any ideas how to deal with it?

@Alexsandruss
Copy link
Contributor

NaN errors are not the only ones:

======================================================================
ERROR: test_gbt_cls_model_create_from_catboost_batch (test_examples.TestExNpyArray)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/vsts/work/1/s/tests/test_examples.py", line 96, in testit
    result = self.call(ex)
  File "/home/vsts/work/1/s/tests/test_examples.py", line 269, in call
    return ex.main(readcsv=np_read_csv)
  File "/home/vsts/work/1/s/examples/daal4py/gbt_cls_model_create_from_catboost_batch.py", line 73, in main
    daal_prediction = daal_predict_algo.compute(X_test, daal_model)
TypeError: Argument 'model' has incorrect type (expected daal4py._daal4py.gbt_classification_model, got tuple)

@razdoburdin razdoburdin marked this pull request as draft June 15, 2023 08:57
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snake case is used for file names in python

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -27,7 +27,7 @@
from .._utils import getFPType


class GBTDAALBase(BaseEstimator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implement nan tag dispatching for base class:

def _more_tags(self):
    return {"allow_nan": self.allow_nan_}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implement nan tag dispatching for base class:

def _more_tags(self):
    return {"allow_nan": self.allow_nan_}

done

return pd.read_csv(f, usecols=c, delimiter=',', header=None, dtype=t)


def main(readcsv=pd_read_csv, method='defaultDense'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method is not used. Same for second example.

Suggested change
def main(readcsv=pd_read_csv, method='defaultDense'):
def main(readcsv=pd_read_csv):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method is not used. Same for second example.

done

@razdoburdin
Copy link
Contributor Author

/intelci: run

@napetrov
Copy link
Contributor

@razdoburdin test_model_builders_xgboost/ test_model_builders_lightgbm are failing internally

@napetrov
Copy link
Contributor

/intelci: run

src/gbt_convertors.pyx Outdated Show resolved Hide resolved
def __init__(self):
pass

def predict(self, X, fptype="float"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fptype should be deduced from input data.

razdoburdin and others added 3 commits July 12, 2023 13:49
@napetrov
Copy link
Contributor

/intelci: run

@napetrov napetrov merged commit 12b963a into intel:master Jul 12, 2023
@samir-nasibli
Copy link
Contributor

samir-nasibli commented Jul 12, 2023

@razdoburdin you have changed examples names, removed suffix _batch. Examples runner will ignore this such file names https://github.com/intel/scikit-learn-intelex/blob/master/tests/run_examples.py#L219:L222 .

Comment on lines +160 to +162
req_library['model_builders_lightgbm.py'] = ['lightgbm']
req_library['model_builders_xgboost.py'] = ['xgboost']
req_library['model_builders_catboost.py'] = ['catboost']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename your examples files and update runner script

Suggested change
req_library['model_builders_lightgbm.py'] = ['lightgbm']
req_library['model_builders_xgboost.py'] = ['xgboost']
req_library['model_builders_catboost.py'] = ['catboost']
req_library['model_builders_lightgbm_batch.py'] = ['lightgbm']
req_library['model_builders_xgboost_batch.py'] = ['xgboost']
req_library['model_builders_catboost_batch.py'] = ['catboost']

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Batch is strange and miningless suffix. If runner will not pick up them, then launcher should be fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's decide how we can group-by this kind of examples then.
All examples scikit-learn-intelex/examples/ are all grouped logically into 4 groups: spmd.py, streaming.py, stream.py, batch.py.
If none of them is suitable, then:

  • we will come up with a new suffix for such examples.
  • update whole examples runner logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just made a brief look in the run_examples.py. May we just add some sort of default behavior in case the filename doesn't end on *spmd.py, *streaming.py, *stream.py or *batch.py?
As far as I see we could switch to this default behavior for all the *_batch.py and *_stream.py files. In my mind, it will make naming of examples more clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@razdoburdin i like that idea more, than using suffixes.

@samir-nasibli - those are logical only if you look on this from some testing perspective. From end user perspective this looks strange.

I would go with base examples without batch because it make no sense to specify this. On stream/streaming - looks like this should have be single group. And for spmd - let's leave it as is

@Alexsandruss, @KulikovNikita - your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just made a brief look in the run_examples.py. May we just add some sort of default behavior in case the filename doesn't end on *spmd.py, *streaming.py, *stream.py or *batch.py?

Sure we can. That means that we need update examples runner logic little bit (second option).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants