-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Persistence: Add support for LightGBM and XGBoost #244
ENH Persistence: Add support for LightGBM and XGBoost #244
Conversation
Partially solves skops-dev#226 Description LightGBM already worked out of the box, so for this, only tests were added. For XGBoost, extra code was added to support it. Implementation Dependencies were extended to include xgboost and lightgbm for testing, so that they should be covered by CI. If those libraries are not installed, the tests should be gracefully skipped. For XGBoost, the xgb-specific persistence format was chosen, which can be used through estimator.save_model and estimator.load_model. I could not verify (yet) that this format is 100% secure. However, since it's a binary json file (.ubj), it should as secure as json. More information here: https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html The actual implementation for xgboost mirrors the one for numpy arrays, which also stores its data in an external file. However, I did not apply any caching here, as it seems it wouldn't be useful. I decided to create a new test file, test_external.py, to put the new tests, instead of cramming more into test_persist.py. Since some functions from test_persist were required here, I moved them a new _utils.py module. Interestingly, I found some potential bugs where some values returned by get_params may differ for xgboost models after loading. This is independent of skops. Those edge cases are documented but were excluded from tests, as it appears that they should be inconsequential.
1.7 not supported for Python 3.7
Fails when no GPU is available and I couldn't find a way to detect from Python if a GPU is available or not (short of trying and catching the error).
Apparently, on Windows, np.int32 is used, not np.int64.
I removed the xgboost bug (?) docstring, since I couldn't find how to exclude it from running with doctest. Replicating it here so that it doesn't get lost. Ideally, someone with more experience in xgboost could comment on whether this really is a bug or not. If so, we could submit it to the xgboost issue tracker. >>> import xgboost
>>> print(xgboost.__version__)
1.7.1
>>> estimator = xgboost.XGBClassifier(tree_method=None)
>>> X, y = [[0, 1], [2, 3]], [0, 1]
>>> estimator.fit(X, y)
XGBClassifier(...)
>>> print(estimator.tree_method)
None
>>> print(estimator.get_params()["tree_method"])
exact
>>> # after save/load roundtrip, tree_method's get_params value changes from exact to auto
>>> import tempfile
>>> tmp_file = f"{tempfile.mkdtemp()}.ubj"
>>> estimator.save_model(tmp_file)
>>> estimator.load_model(tmp_file)
>>> print(estimator.tree_method)
None
>>> print(estimator.get_params()["tree_method"])
auto
>>> # a similar thing happens with gpu_id
>>> estimator = xgboost.XGBClassifier(tree_method='gpu_hist', booster='gbtree')
>>> estimator.fit(X, y)
XGBClassifier(...)
>>> print(estimator.gpu_id)
None
>>> print(estimator.get_params()["gpu_id"])
0
>>> # after save/load roundtrip, gpu_id's get_params value changes from 0 to -1
>>> estimator.save_model(tmp_file)
>>> # for gpu_id, the estimator needs to be re-initialized for the effect to occur
>>> estimator = xgboost.XGBClassifier()
>>> estimator.load_model(tmp_file)
>>> print(estimator.gpu_id)
None
>>> print(estimator.get_params()["gpu_id"])
-1 |
@@ -0,0 +1,145 @@ | |||
import warnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything in here is unmodified, just moved around.
CI is green, so ready for review @skops-dev/maintainers I would take a look at catboost separately to keep the PR smaller. I included lightgbm here since it didn't require any code changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you maybe explain why going through the __dict__
wouldn't work for xgboost? I suppose because the attributes are not C extension types, and rather pure C types?
I would like us to understand what save/load of xgboost does, before adding this. We should document that somewhere.
docs/persistence.rst
Outdated
Apart from this core, we plan to support machine learning libraries commonly | ||
used be the community. So far, those are: | ||
|
||
- `LightGBM <lightgbm.readthedocs.io/>`_ (scikit-learn API) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we add xgboost here as a result of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
skops/io/_xgboost.py
Outdated
# list of constructors is hard-coded for higher security | ||
constructors = { | ||
"XGBClassifier": xgboost.XGBClassifier, | ||
"XGBRegressor": xgboost.XGBRegressor, | ||
"XGBRFClassifier": xgboost.XGBRFClassifier, | ||
"XGBRFRegressor": xgboost.XGBRFRegressor, | ||
"XGBRanker": xgboost.XGBRanker, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we use the type saved in __class__
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that it's more secure, but we could also store the class itself and use that here.
skops/io/_xgboost.py
Outdated
try: | ||
# xgboost doesn't allow to save to a memory buffer, so take roundtrip | ||
# through temp file | ||
tmp_file = f"{tempfile.mkdtemp()}.ubj" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use a context manager to make sure it's cleaned up after the fact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look.
skops/io/_xgboost.py
Outdated
|
||
if self.type == "xgboost-format": # fitted | ||
# load_model works with bytearray, so no temp file necessary here | ||
instance.load_model(bytearray(self.children["content"].getvalue())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like this. I can't easily figure out what this actually does, and if it's anywhere near being safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the equivalent of what we do with numpy arrays (in fact, the whole approach is analogous):
class NdArrayNode(Node):
def __init__(...):
...
if self.type == "numpy":
self.children = {
"content": io.BytesIO(load_context.src.read(state["file"]))
}
def _construct(self):
if self.type == "numpy":
content = np.load(self.children["content"], allow_pickle=False)
...
The difference is that, unfortunately, load_model
cannot work with bytes
, instead the content first needs to be converted to a bytearray
instead.
I could also load the model from a temp file, but I thought this approach would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you maybe explain why going through the
__dict__
wouldn't work for xgboost? I suppose because the attributes are not C extension types, and rather pure C types?
So without special treatment, xgboost goes through ObjectNode
. Since it has __getstate__
and __setstate__
, those are being used. After we call the latter, there is some kind of validation going on:
This leads us to the C++ part of the code base, which I honestly didn't even want to inspect.
My thinking is that trying to make that approach work is too error prone, we might get something to run by tinkering with the attributes, but we couldn't be sure that it would work with other parameters. However, the save_model
way is officially recommended, is (presumably) well tested, and it is stable across versions, so more robust for users than if we were directly storing attributes.
I would like us to understand what save/load of xgboost does, before adding this. We should document that somewhere.
Regarding security, I couldn't find anything being mentioned explicitly, but the general advantages are explained in the linked document. As I said, since it's json format, I don't see how that would not be safe.
skops/io/_xgboost.py
Outdated
|
||
if self.type == "xgboost-format": # fitted | ||
# load_model works with bytearray, so no temp file necessary here | ||
instance.load_model(bytearray(self.children["content"].getvalue())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the equivalent of what we do with numpy arrays (in fact, the whole approach is analogous):
class NdArrayNode(Node):
def __init__(...):
...
if self.type == "numpy":
self.children = {
"content": io.BytesIO(load_context.src.read(state["file"]))
}
def _construct(self):
if self.type == "numpy":
content = np.load(self.children["content"], allow_pickle=False)
...
The difference is that, unfortunately, load_model
cannot work with bytes
, instead the content first needs to be converted to a bytearray
instead.
I could also load the model from a temp file, but I thought this approach would be better.
skops/io/_xgboost.py
Outdated
# list of constructors is hard-coded for higher security | ||
constructors = { | ||
"XGBClassifier": xgboost.XGBClassifier, | ||
"XGBRegressor": xgboost.XGBRegressor, | ||
"XGBRFClassifier": xgboost.XGBRFClassifier, | ||
"XGBRFRegressor": xgboost.XGBRFRegressor, | ||
"XGBRanker": xgboost.XGBRanker, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that it's more secure, but we could also store the class itself and use that here.
skops/io/_xgboost.py
Outdated
try: | ||
# xgboost doesn't allow to save to a memory buffer, so take roundtrip | ||
# through temp file | ||
tmp_file = f"{tempfile.mkdtemp()}.ubj" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look.
docs/persistence.rst
Outdated
Apart from this core, we plan to support machine learning libraries commonly | ||
used be the community. So far, those are: | ||
|
||
- `LightGBM <lightgbm.readthedocs.io/>`_ (scikit-learn API) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
So looking at this again, I don't understand why you think using It seems the issue we have is that we don't support |
My interpretation of the xgboost documentation was that their custom format is the most portable. Say they add a private attribute from version X to version Y, it seems like their format would take care of correctly dealing with the missing attribute, whereas something like pickle would fail. What I cannot say with certainty is: If we use the output of their I think this problem touches a more general topic when dealing with 3rd party packages: Should we try to implement methods as generically as possible or should we make use of custom serialization formats if they are offered by a package? In this case, should we use *we should probably support |
Added support for bytes and bytearray, which is sufficient to dump and load xgboost. The xgboost specific code was commented out for now. To serialize bytes in json, base64 encoding was used.
@adrinjalali I tested your suggestion to add support for |
I agree. But I think for now we can do with only supporting bytes/byte array, and figure out supporting custom formats of 3d party libraries later. One thing I wonder: should we store bytes in a binary file instead? I feel like it'd make save/load much faster, and keep the schema json file cleaner. |
Okay, then let's start with that.
It'll probably depend. In this case, maybe, but having the additional i/o is probably a big overhead if we have multiple small bytes. |
I wanted to suggest having a threshold for having them in the schema or in the file, but then I realized another thing to think of, is that if we allow binary data in the json schema, then people can create |
I'm not against your proposal, I just want to understand what the concern is. Is it only about readability? Since you could argue that even if there is a big blob in the json, it's only one line so doesn't distract much (it could slow down some editors though). |
Slowing down an editor is one thing, loading a massive json into memory is another thing. I'm not sure how people could exploit this, but I kinda feel like keeping them in separate files is cleaner. Imagine people having images as bytes, and we'd be keeping them in the json file all together. Besides, I don't think there are many cases where we'd encounter small byte arrays. So my guess is that in most cases those byte arrays are large rather than small, and therefore having all of them in files would speed things up on average. |
@adrinjalali I made the changes as you suggested, please review again. For good measure, I checked if adding the bytes support is sufficient for making catboost work and indeed it does, so I added tests for it as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty cool 🔥
skops/io/_general.py
Outdated
trusted: bool | Sequence[str] = False, | ||
) -> None: | ||
super().__init__(state, load_context, trusted) | ||
self.trusted = self._get_trusted(trusted, []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we trust bytes
and bytearray
?
skops/io/tests/test_external.py
Outdated
assert_params_equal(estimator.get_params(), loaded.get_params()) | ||
|
||
|
||
def check_method_outputs(estimator, X, trusted): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this common between these tests and scikit-learn ones? Would be nice to avoid duplication as much as we can here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very close. For my understanding, is the suggestion to put this method into tests/_utils.py
and use it in test_persist.py
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
skops/io/tests/test_external.py
Outdated
These discrepancies occur regardless of skops, so they're a problem in | ||
xgboost itself. We assume that this has no practical consequences and thus | ||
avoid testing these cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there an upstream issue for this on the xgboost side? If not, it'd be nice to open one and link to it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I commented, I could open on issue but ideally someone who knows xgboost well could comment on whether it's really a bug or not. If there is no one around, I can just open no matter what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think the devs on the xgboost repo can decide if it's a bug which they wanna fix or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Refactor method output testing to be utility function - Add bytes and bytearray to trusted types
…jaminBossan/skops into persistence-support-lightgbm-xgboost
@adrinjalali I addressed your comments, please review again. Regarding the issue of forgetting to add bytes/bytearray to trusted: It's a bit of a sign that our test coverage could be improved, right? As is, those oversights were not caught with the standard set of tests we have. |
tests failing @BenjaminBossan |
@adrinjalali sorry, I made a list minute change and rolled it back, introducing the bug. Tests are green now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise happy to merge.
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
EDITED
Resolves #226
Description
LightGBM already worked out of the box, so for this, only tests were added. For XGBoost and CatBoost, extra code was added to support them.
Implementation
Dependencies were extended to include xgboost, lightgbm, and catboost for testing, so that they should be covered by CI. If those libraries are not installed, the tests should be gracefully skipped.
For xgboost, the xgboost-specific persistence format was chosen, which can be used throughestimator.save_model
andestimator.load_model
. I could not verify (yet) that this format is 100% secure. However, since it's a binary json file (.ubj
), it should as secure as json. More information here:https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.htmlThe actual implementation for xgboost mirrors the one for numpy arrays, which also stores its data in an external file. However, I did not apply any caching here, as it seems it wouldn't be useful.I decided to create a new test file,
test_external.py
, to put the new tests, instead of cramming more intotest_persist.py
. Since some functions fromtest_persist.py
were required here, I moved them a new_utils.py
module.Interestingly, I found some potential bugs where some values returned by
get_params()
may differ for xgboost models after loading. This is independent of skops. Those edge cases are documented but were excluded from tests, as it appears that they should be inconsequential. We may want to report them.