-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ran into an issue with numerical precision and thresholds #4060
Comments
Would adding more decimal digits in the dump help? But I do agree that the choice of thresholds is quite weird. It would be nice if we can reproduce the issue with a public dataset. |
It might, but since base 10 and 2 don't line up exactly it would be tricky to ensure it was always right. Finding an example where the JSON is wrong would take a bit more work, but below is a simple example that shows the kind of very tight thresholds XGBoost creates (note it depends on the latest benchmark branch of shap to read the values without using the JSON output): import xgboost
import shap
import numpy as np
X,y = shap.datasets.adult()
model = xgboost.XGBClassifier(n_estimators=1, max_depth=4, random_state=0)
model.fit(X, y)
explainer = shap.TreeExplainer(model)
# Find close thresholds
Xv = X.values
for i in range(explainer.model.thresholds.shape[0]):
for j in range(explainer.model.num_nodes[i]):
for k in range(Xv.shape[0]):
v = Xv[k,explainer.model.features[i,j]]
t = explainer.model.thresholds[i,j]
if abs(v - t) < 1e-5:
print(
"The value of feature",
explainer.model.features[i,j],
"for sample",k,"is",v,"and a split threshold for that feature is",
t,". That's a difference of",v - t,"."
) output:
|
@slundberg FYI, you can find out the number of digits that's required to differentiate all floating-point representations: As for the weird choice of thresholds, I'll have to come back to it and dive deeper. |
@hcho3 Interesting since there are already 10 digits printed, yet I saw a mismatch in the JSON dump. It could probably be sorted out, but I think the choice of a very unstable threshold is probably a more important issue. |
@slundberg Does SHAP explainer use the JSON dump or parse the binary model file directly? |
The main version is built into XGBoost, so we use that when possible. But some features, like explaining the loss of the model, relied on the JSON dump. But now v0.28 directly parses the binary to avoid the problems in this issue. |
This issue seems related to mine in that if you load the binary it works but if you parse the JSON dump you get the wrong thresholds. Hopefully the new implementation will store the values accurately. |
@slundberg I ran into a similar issue and I believe the solution is to convert the thresholds to floats after parsing the JSON. The decimal representation shown in the JSON is guaranteed to reproduce the float value, but it must be converted to a float before doing any further calculations. In this case, the conversion to float would yield 85 (this is what the xgboost binary uses internally with thresholds stored as floats). In fact, any values that are loaded from JSON (split conditions, leaf values, etc.) that are internally treated as floats should be converted to floats and treated with float operators. For example, if you were to parse JSON to recreate a binary logistic model and then recreate the predictions by calculating the sigmoid function as xgboost does, you would want to use a float-exponentiation instead of double exponentiation as well as float-converted weights. An example of this treatment is here: #3960 (comment) |
Close by #7545 . One can now directly export the model in UBJ by calling |
For a (not-sharable) dataset I was working on XGBoost chose thresholds that were very close to machine epsilon from actual training set feature values (less than
np.finfo(np.float32).eps
away). This lead to the JSON dump of the model being wrong, since the conversion to base-10 destroyed the information needed to route down the correct path. This meant the only way to read a tree that would behave the same as XGBoost was the use the raw memory dump.I am posting this issue so this danger is noted for others, and in case it would be worth seeing in the future why XGBoost is choosing such unstable thresholds. It seems like just asking for trouble when exporting models and when running them on different machines, since an epsilon change in a feature value causes a potentially meaningful change in the model output (since a sample now goes down different paths).
The text was updated successfully, but these errors were encountered: