Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ran into an issue with numerical precision and thresholds #4060

Closed
slundberg opened this issue Jan 17, 2019 · 9 comments
Closed

Ran into an issue with numerical precision and thresholds #4060

slundberg opened this issue Jan 17, 2019 · 9 comments

Comments

@slundberg
Copy link
Contributor

slundberg commented Jan 17, 2019

For a (not-sharable) dataset I was working on XGBoost chose thresholds that were very close to machine epsilon from actual training set feature values (less than np.finfo(np.float32).eps away). This lead to the JSON dump of the model being wrong, since the conversion to base-10 destroyed the information needed to route down the correct path. This meant the only way to read a tree that would behave the same as XGBoost was the use the raw memory dump.

I am posting this issue so this danger is noted for others, and in case it would be worth seeing in the future why XGBoost is choosing such unstable thresholds. It seems like just asking for trouble when exporting models and when running them on different machines, since an epsilon change in a feature value causes a potentially meaningful change in the model output (since a sample now goes down different paths).

@hcho3
Copy link
Collaborator

hcho3 commented Jan 17, 2019

Would adding more decimal digits in the dump help? But I do agree that the choice of thresholds is quite weird. It would be nice if we can reproduce the issue with a public dataset.

@slundberg
Copy link
Contributor Author

It might, but since base 10 and 2 don't line up exactly it would be tricky to ensure it was always right. Finding an example where the JSON is wrong would take a bit more work, but below is a simple example that shows the kind of very tight thresholds XGBoost creates (note it depends on the latest benchmark branch of shap to read the values without using the JSON output):

import xgboost
import shap
import numpy as np

X,y = shap.datasets.adult()

model = xgboost.XGBClassifier(n_estimators=1, max_depth=4, random_state=0)
model.fit(X, y)

explainer = shap.TreeExplainer(model)

# Find close thresholds
Xv = X.values
for i in range(explainer.model.thresholds.shape[0]):
    for j in range(explainer.model.num_nodes[i]):
        for k in range(Xv.shape[0]):
            v = Xv[k,explainer.model.features[i,j]]
            t = explainer.model.thresholds[i,j]
            if abs(v - t) < 1e-5:
                print(
                    "The value of feature",
                    explainer.model.features[i,j],
                    "for sample",k,"is",v,"and a split threshold for that feature is",
                    t,". That's a difference of",v - t,"."
                )

output:

The value of feature 0 for sample 8381 is 85.0 and a split threshold for that feature is 84.99999 . That's a difference of 7.62939453125e-06 .
The value of feature 0 for sample 20463 is 85.0 and a split threshold for that feature is 84.99999 . That's a difference of 7.62939453125e-06 .
The value of feature 0 for sample 32459 is 85.0 and a split threshold for that feature is 84.99999 . That's a difference of 7.62939453125e-06 .

@hcho3
Copy link
Collaborator

hcho3 commented Jan 18, 2019

@slundberg FYI, you can find out the number of digits that's required to differentiate all floating-point representations: std::numeric_limits<float>::max_digits10.

As for the weird choice of thresholds, I'll have to come back to it and dive deeper.

@slundberg
Copy link
Contributor Author

@hcho3 Interesting since there are already 10 digits printed, yet I saw a mismatch in the JSON dump. It could probably be sorted out, but I think the choice of a very unstable threshold is probably a more important issue.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 18, 2019

@slundberg Does SHAP explainer use the JSON dump or parse the binary model file directly?

@slundberg
Copy link
Contributor Author

The main version is built into XGBoost, so we use that when possible. But some features, like explaining the loss of the model, relied on the JSON dump. But now v0.28 directly parses the binary to avoid the problems in this issue.

@jjdelvalle
Copy link

This issue seems related to mine in that if you load the binary it works but if you parse the JSON dump you get the wrong thresholds. Hopefully the new implementation will store the values accurately.

@ras44
Copy link
Contributor

ras44 commented May 2, 2019

@slundberg I ran into a similar issue and I believe the solution is to convert the thresholds to floats after parsing the JSON.

The decimal representation shown in the JSON is guaranteed to reproduce the float value, but it must be converted to a float before doing any further calculations. In this case, the conversion to float would yield 85 (this is what the xgboost binary uses internally with thresholds stored as floats).

In fact, any values that are loaded from JSON (split conditions, leaf values, etc.) that are internally treated as floats should be converted to floats and treated with float operators. For example, if you were to parse JSON to recreate a binary logistic model and then recreate the predictions by calculating the sigmoid function as xgboost does, you would want to use a float-exponentiation instead of double exponentiation as well as float-converted weights.

An example of this treatment is here: #3960 (comment)

@trivialfis
Copy link
Member

Close by #7545 . One can now directly export the model in UBJ by calling save_raw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants