Ran into an issue with numerical precision and thresholds #4060

slundberg · 2019-01-17T06:07:35Z

For a (not-sharable) dataset I was working on XGBoost chose thresholds that were very close to machine epsilon from actual training set feature values (less than np.finfo(np.float32).eps away). This lead to the JSON dump of the model being wrong, since the conversion to base-10 destroyed the information needed to route down the correct path. This meant the only way to read a tree that would behave the same as XGBoost was the use the raw memory dump.

I am posting this issue so this danger is noted for others, and in case it would be worth seeing in the future why XGBoost is choosing such unstable thresholds. It seems like just asking for trouble when exporting models and when running them on different machines, since an epsilon change in a feature value causes a potentially meaningful change in the model output (since a sample now goes down different paths).

The text was updated successfully, but these errors were encountered:

hcho3 · 2019-01-17T08:42:35Z

Would adding more decimal digits in the dump help? But I do agree that the choice of thresholds is quite weird. It would be nice if we can reproduce the issue with a public dataset.

slundberg · 2019-01-17T17:53:24Z

It might, but since base 10 and 2 don't line up exactly it would be tricky to ensure it was always right. Finding an example where the JSON is wrong would take a bit more work, but below is a simple example that shows the kind of very tight thresholds XGBoost creates (note it depends on the latest benchmark branch of shap to read the values without using the JSON output):

import xgboost
import shap
import numpy as np

X,y = shap.datasets.adult()

model = xgboost.XGBClassifier(n_estimators=1, max_depth=4, random_state=0)
model.fit(X, y)

explainer = shap.TreeExplainer(model)

# Find close thresholds
Xv = X.values
for i in range(explainer.model.thresholds.shape[0]):
    for j in range(explainer.model.num_nodes[i]):
        for k in range(Xv.shape[0]):
            v = Xv[k,explainer.model.features[i,j]]
            t = explainer.model.thresholds[i,j]
            if abs(v - t) < 1e-5:
                print(
                    "The value of feature",
                    explainer.model.features[i,j],
                    "for sample",k,"is",v,"and a split threshold for that feature is",
                    t,". That's a difference of",v - t,"."
                )

output:

The value of feature 0 for sample 8381 is 85.0 and a split threshold for that feature is 84.99999 . That's a difference of 7.62939453125e-06 .
The value of feature 0 for sample 20463 is 85.0 and a split threshold for that feature is 84.99999 . That's a difference of 7.62939453125e-06 .
The value of feature 0 for sample 32459 is 85.0 and a split threshold for that feature is 84.99999 . That's a difference of 7.62939453125e-06 .

hcho3 · 2019-01-18T00:11:57Z

@slundberg FYI, you can find out the number of digits that's required to differentiate all floating-point representations: std::numeric_limits<float>::max_digits10.

As for the weird choice of thresholds, I'll have to come back to it and dive deeper.

slundberg · 2019-01-18T00:27:35Z

@hcho3 Interesting since there are already 10 digits printed, yet I saw a mismatch in the JSON dump. It could probably be sorted out, but I think the choice of a very unstable threshold is probably a more important issue.

hcho3 · 2019-01-18T00:30:18Z

@slundberg Does SHAP explainer use the JSON dump or parse the binary model file directly?

slundberg · 2019-01-18T00:44:45Z

The main version is built into XGBoost, so we use that when possible. But some features, like explaining the loss of the model, relied on the JSON dump. But now v0.28 directly parses the binary to avoid the problems in this issue.

jjdelvalle · 2019-01-29T12:53:09Z

This issue seems related to mine in that if you load the binary it works but if you parse the JSON dump you get the wrong thresholds. Hopefully the new implementation will store the values accurately.

ras44 · 2019-05-02T16:04:51Z

@slundberg I ran into a similar issue and I believe the solution is to convert the thresholds to floats after parsing the JSON.

The decimal representation shown in the JSON is guaranteed to reproduce the float value, but it must be converted to a float before doing any further calculations. In this case, the conversion to float would yield 85 (this is what the xgboost binary uses internally with thresholds stored as floats).

In fact, any values that are loaded from JSON (split conditions, leaf values, etc.) that are internally treated as floats should be converted to floats and treated with float operators. For example, if you were to parse JSON to recreate a binary logistic model and then recreate the predictions by calculating the sigmoid function as xgboost does, you would want to use a float-exponentiation instead of double exponentiation as well as float-converted weights.

An example of this treatment is here: #3960 (comment)

trivialfis · 2022-01-22T13:46:46Z

Close by #7545 . One can now directly export the model in UBJ by calling save_raw.

slundberg mentioned this issue Jan 22, 2019

RFC: JSON as Next-Generation Model Serialization Format #3980

Closed

trivialfis added the type: bug label Dec 24, 2019

trivialfis mentioned this issue Jan 13, 2022

[RFC] Add support for Universal Binary JSON #7545

Closed

6 tasks

trivialfis closed this as completed Jan 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ran into an issue with numerical precision and thresholds #4060

Ran into an issue with numerical precision and thresholds #4060

slundberg commented Jan 17, 2019 •

edited

Loading

hcho3 commented Jan 17, 2019

slundberg commented Jan 17, 2019

hcho3 commented Jan 18, 2019

slundberg commented Jan 18, 2019

hcho3 commented Jan 18, 2019

slundberg commented Jan 18, 2019

jjdelvalle commented Jan 29, 2019

ras44 commented May 2, 2019

trivialfis commented Jan 22, 2022

Ran into an issue with numerical precision and thresholds #4060

Ran into an issue with numerical precision and thresholds #4060

Comments

slundberg commented Jan 17, 2019 • edited Loading

hcho3 commented Jan 17, 2019

slundberg commented Jan 17, 2019

hcho3 commented Jan 18, 2019

slundberg commented Jan 18, 2019

hcho3 commented Jan 18, 2019

slundberg commented Jan 18, 2019

jjdelvalle commented Jan 29, 2019

ras44 commented May 2, 2019

trivialfis commented Jan 22, 2022

slundberg commented Jan 17, 2019 •

edited

Loading