Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XGBRegressor converter is maybe broken #371

Closed
victornoel opened this issue Feb 26, 2020 · 6 comments
Closed

XGBRegressor converter is maybe broken #371

victornoel opened this issue Feb 26, 2020 · 6 comments

Comments

@victornoel
Copy link

It seems there are problems with the XGBRegressor converter.

I've opened an issue at sklearn-onnx but they redirected me to here: onnx/sklearn-onnx#321

Here is a repro: https://gist.github.com/victornoel/06d6231f6276719ddba53cb381dfd468

I casted an int column to str because in my original dataset, we have categories made of numbers so I thought this could be related…

As we can see, the differences are huge: the predictions should be between 0 and 1 and the differences can be as big as 1 in my tests!

>>> import test
>>> test.test()
[0.9019426  0.91082716 0.91868186 0.94926846 1.0898147 ]
min(Y)-max(Y): 0 1
@victornoel victornoel changed the title XGBRegressor is maybe broken XGBRegressor converter is maybe broken Feb 26, 2020
@victornoel
Copy link
Author

For the record, I tested latest git version of sklearn-onnx, onnxconverter-common and onnxmltools and the problem is still there…

@xadupre
Copy link
Collaborator

xadupre commented Mar 3, 2020

I was able to replicate. I'll have a look tomorrow.

@xadupre
Copy link
Collaborator

xadupre commented Mar 4, 2020

One issue comes from OneHotEncoder. By default, scikit-learn is using sparse but sparse are not supported yet in Onnxruntime even though they are defined in ONNX. Because Onnxruntime is using a dense representation, missing values are different in scikit-learn (nan) and dense matrices in onnxruntime (0). I suggest using OneHotEncoder(sparse=False) for the time being until sparse are supported in onnxruntime and fully supported in ONNX.

@victornoel
Copy link
Author

victornoel commented Mar 4, 2020

@xadupre thank you for the investigation, using sparse=False does indeed fix the problem for me!

I have some questions:

  • would it make sense for sklearn-onnx to fail for OneHotEncoder that were built with sparse=True maybe to avoid any problem until this is supported?
  • what are the drawbacks of sparse? It feels to me like training a model takes much more time with sparse=False (x10)
  • is there an issue to track supporting sparse in onnxruntime?
  • you say "one issue", are there more ? 😄

@xadupre
Copy link
Collaborator

xadupre commented Mar 4, 2020

I'm hesitating. sklearn-onnx is being tested with Onnxruntime but it does not mean only Onnxruntime should run ONNX graphs, so failing for OneHotEncoder is not my favourite option. For the OneHotEncoder, there is usually no drawback as it saves much memory space and it is faster as many features are null and are not part of the computation. So, in this case, sparse is definitly better. There is an issue on onnx (onnx/onnx#2008), none in onnxruntime. Feel free to add one. I don't have a better option right now.

@victornoel
Copy link
Author

@xadupre thank you, I created an issue there: microsoft/onnxruntime#3144

Let me close the current issue then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants