Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predictions differ #8

Closed
Hoeze opened this issue May 19, 2023 · 5 comments
Closed

Predictions differ #8

Hoeze opened this issue May 19, 2023 · 5 comments
Labels
question Further information is requested

Comments

@Hoeze
Copy link

Hoeze commented May 19, 2023

  • ebm2onnx version: 1.3.0
  • Python version: 3.8
  • Operating System: Arch linux

Description

I would like to convert an interpret v0.2.7 model to ONNX for conserving it for the future.
However, the predictions that I get strongly differ from the original model:

>>> original_result
array([[0.98610258, 0.01389742],
       [0.99524099, 0.00475901],
       [0.99398961, 0.00601039],
       [0.99739259, 0.00260741]])
>>> onnx_result
[array([[0.9861026 , 0.01389742],
       [0.96940696, 0.03059301],
       [0.9602685 , 0.03973148],
       [0.98411477, 0.01588521]], dtype=float32)]

What I Did

My conversion script:

#!/usr/bin/env python3
# requires "interpret==0.2.7" "interpret_core==0.2.7" "ebm2onnx==1.3"
import onnx
import onnxruntime
import ebm2onnx

import pickle
import json

import numpy as np
import pandas as pd

with open("AbSplice_DNA.pkl", "rb") as fd:
    absplice_dna_model = pickle.load(fd)

print(json.dumps(dict(zip(absplice_dna_model.feature_names, absplice_dna_model.feature_types)), indent=2))

test_df = pd.read_parquet("test.parquet")

onnx_model = ebm2onnx.to_onnx(
    absplice_dna_model,
    ebm2onnx.get_dtype_from_pandas(test_df),
    predict_proba=True
)
onnx.save_model(onnx_model, 'ebm_model.onnx')
session = onnxruntime.InferenceSession('ebm_model.onnx')

original_result = absplice_dna_model.predict_proba(test_df)
print(original_result)
onnx_result = session.run(None, {k: np.asarray(v) for k, v in test_df.items()})
print(onnx_result)

Further, you can find all necessary files to reproduce my issue in the attached zip file:
onnx_test.zip

Any help would be highly appreciated!

@MainRo MainRo added the question Further information is requested label May 31, 2023
@MainRo
Copy link
Collaborator

MainRo commented May 31, 2023

did you mean 3.1.0 instead of 1.3.0 for the ebm2onnx version?
ok I see in the script that it is indeed 1.3.0.

Do you have the result of "ebm2onnx.get_dtype_from_pandas(test_df)" or can you share the test set?

@Hoeze
Copy link
Author

Hoeze commented May 31, 2023

Hi @MainRo, thanks for looking into it!
Yes, please check above, the reproducible example is in the onnx_test.zip file

@MainRo
Copy link
Collaborator

MainRo commented Jun 27, 2023

I started to analyze the issue but did not find yet where it comes from. The scores associated with each term do not seem correct in the converted model.
Can you try to:

  • retrain with the same environment but disable the interactions
  • retrain with the latest version of interpretml (0.4.2) and ebm2onnx (3.1.1)

@MainRo
Copy link
Collaborator

MainRo commented Jun 27, 2023

@Hoeze forget my previous comment.

Can you check the type of the splice_site_is_expressed column when training the model? Especially, check that it is declared as an int and not a float.

This is a categorical column and the values in the dataframe are 0 or 1. The type of the column in the parquet file is integer, but I see that internally, ebm considers them as floats before doing the categorical encoding.

The difference comes from this feature, and it is probable that it is because at some point it is converted to a float.

When I change the type of this column to string and update the internal types of the ebm model, I have similar values between interpret and onnx.

@Hoeze
Copy link
Author

Hoeze commented Aug 9, 2023

Thanks a lot @MainRo, now this makes a lot of sense.
"splice_site_is_expressed" in the EBM gets converted from int -> float -> string -> int...
E.g. splice_site_is_expressed == 1 (int) -> 1.0 (float) -> "1.0" (string) -> 0 (int) 🤦

I manually fixed this in the onnx models using onnx-modifier.

@Hoeze Hoeze closed this as completed Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Development

No branches or pull requests

2 participants