Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option locale to CountVectorizer, TfIdfVectorizer converters #1020

Merged
merged 16 commits into from
Aug 30, 2023
6 changes: 6 additions & 0 deletions .azure-pipelines/linux-conda-CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,12 @@ jobs:
environmentName: 'py$(python.version)'
packageSpecs: 'python=$(python.version)'

- script: |
sudo apt-get install -y language-pack-en
sudo locale-gen en_US.UTF-8
sudo update-locale LANG=en_US.UTF-8
displayName: 'Install packages'

- script: |
test '$(python.version)' == '3.7' && apt-get install protobuf-compiler libprotoc-dev
conda config --set always_yes yes --set changeps1 no
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -68,3 +68,4 @@ docs/tutorial/*.onnx
docs/tutorial/*.jpg
docs/tutorial/*.png
docs/tutorial/*.dot
docs/tutorial/catboost_info
8 changes: 8 additions & 0 deletions CHANGELOGS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Change Logs
===========

1.16.0
++++++

* add option 'language' to converters of CountVectorizer, TfIdfVectorizer
[#1020](https://github.com/onnx/sklearn-onnx/pull/1020)
36 changes: 34 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

<p align="center"><img width="50%" src="docs/logo_main.png" /></p>

[![Build Status Linux](https://dev.azure.com/onnxmltools/sklearn-onnx/_apis/build/status%2Fonnx.sklearn-onnx.linux.CI?branchName=refs%2Fpull%2F1009%2Fmerge)](https://dev.azure.com/onnxmltools/sklearn-onnx/_build/latest?definitionId=21&branchName=refs%2Fpull%2F1009%2Fmerge)
[![Build Status](https://dev.azure.com/onnxmltools/sklearn-onnx/_apis/build/status%2Fonnx.sklearn-onnx.linux.CI?branchName=refs%2Fpull%2F1020%2Fmerge)](https://dev.azure.com/onnxmltools/sklearn-onnx/_build/latest?definitionId=21&branchName=refs%2Fpull%2F1020%2Fmerge)

[![Build Status Windows](https://dev.azure.com/onnxmltools/sklearn-onnx/_apis/build/status%2Fonnx.sklearn-onnx.win.CI?branchName=refs%2Fpull%2F1009%2Fmerge)](https://dev.azure.com/onnxmltools/sklearn-onnx/_build/latest?definitionId=22&branchName=refs%2Fpull%2F1009%2Fmerge)
[![Build Status](https://dev.azure.com/onnxmltools/sklearn-onnx/_apis/build/status%2Fonnx.sklearn-onnx.win.CI?branchName=refs%2Fpull%2F1020%2Fmerge)](https://dev.azure.com/onnxmltools/sklearn-onnx/_build/latest?definitionId=22&branchName=refs%2Fpull%2F1020%2Fmerge)

[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

Expand Down Expand Up @@ -33,6 +33,38 @@ Or you can install from the source with the latest changes.
pip install git+https://github.com/onnx/sklearn-onnx.git
```

## Getting started

```python
# Train a model.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

iris = load_iris()
X, y = iris.data, iris.target
X = X.astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y)
clr = RandomForestClassifier()
clr.fit(X_train, y_train)

# Convert into ONNX format.
from skl2onnx import to_onnx

onx = to_onnx(clr, X[:1])
with open("rf_iris.onnx", "wb") as f:
f.write(onx.SerializeToString())

# Compute the prediction with onnxruntime.
import onnxruntime as rt

sess = rt.InferenceSession("rf_iris.onnx", providers=["CPUExecutionProvider"])
input_name = sess.get_inputs()[0].name
label_name = sess.get_outputs()[0].name
pred_onx = sess.run([label_name], {input_name: X_test.astype(np.float32)})[0]
```

## Contribute
We welcome contributions in the form of feedback, ideas, or code.

Expand Down
1 change: 0 additions & 1 deletion docs/api_summary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,6 @@ it is possible to enable logging:
import logging
logger = logging.getLogger('skl2onnx')
logger.setLevel(logging.DEBUG)
logging.basicConfig(level=logging.DEBUG)

Example :ref:`l-example-logging` illustrates what it looks like.

Expand Down
15 changes: 12 additions & 3 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

import os
import sys
import logging
import warnings
import skl2onnx

Expand Down Expand Up @@ -72,16 +73,14 @@

linkcode_resolve = make_linkcode_resolve(
"skl2onnx",
"https://github.com/onnx/skl2onnx/blob/{revision}/" "{package}/{path}#L{lineno}",
"https://github.com/onnx/skl2onnx/blob/{revision}/{package}/{path}#L{lineno}",
)

intersphinx_mapping = {
"joblib": ("https://joblib.readthedocs.io/en/latest/", None),
"python": ("https://docs.python.org/{.major}".format(sys.version_info), None),
"matplotlib": ("https://matplotlib.org/", None),
"mlinsights": ("http://www.xavierdupre.fr/app/mlinsights/helpsphinx/", None),
"numpy": ("https://docs.scipy.org/doc/numpy/", None),
"pyquickhelper": ("http://www.xavierdupre.fr/app/pyquickhelper/helpsphinx/", None),
"onnxruntime": ("https://onnxruntime.ai/docs/api/python/", None),
"pandas": ("https://pandas.pydata.org/pandas-docs/stable/", None),
"scipy": ("https://docs.scipy.org/doc/scipy/reference", None),
Expand Down Expand Up @@ -144,4 +143,14 @@
def setup(app):
# Placeholder to initialize the folder before
# generating the documentation.
logger = logging.getLogger("skl2onnx")
logger.setLevel(logging.WARNING)
logger = logging.getLogger("matplotlib.font_manager")
logger.setLevel(logging.WARNING)
logger = logging.getLogger("matplotlib.ticker")
logger.setLevel(logging.WARNING)
logger = logging.getLogger("PIL.PngImagePlugin")
logger.setLevel(logging.WARNING)
logger = logging.getLogger("graphviz._tools")
logger.setLevel(logging.WARNING)
return app
2 changes: 1 addition & 1 deletion docs/examples/plot_convert_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@
with open("logreg_iris.onnx", "wb") as f:
f.write(onx.SerializeToString())

sess = rt.InferenceSession("logreg_iris.onnx")
sess = rt.InferenceSession("logreg_iris.onnx", providers=["CPUExecutionProvider"])
input_name = sess.get_inputs()[0].name
label_name = sess.get_outputs()[0].name
pred_onx = sess.run([label_name], {input_name: X_test.astype(numpy.float32)})[0]
Expand Down
2 changes: 1 addition & 1 deletion docs/examples/plot_convert_syntax.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@


def predict_with_onnxruntime(onx, X):
sess = InferenceSession(onx.SerializeToString())
sess = InferenceSession(onx.SerializeToString(), providers=["CPUExecutionProvider"])
input_name = sess.get_inputs()[0].name
res = sess.run(None, {input_name: X.astype(np.float32)})
return res[0]
Expand Down
10 changes: 7 additions & 3 deletions docs/examples/plot_convert_zipmap.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@
# Let's confirm the output type of the probabilities
# is a list of dictionaries with onnxruntime.

sess = rt.InferenceSession(onx.SerializeToString())
sess = rt.InferenceSession(onx.SerializeToString(), providers=["CPUExecutionProvider"])
res = sess.run(None, {"float_input": X_test.astype(numpy.float32)})
print(res[1][:2])
print("probabilities type:", type(res[1]))
Expand All @@ -66,7 +66,9 @@
clr, initial_types=initial_type, options=options, target_opset=12
)

sess2 = rt.InferenceSession(onx2.SerializeToString())
sess2 = rt.InferenceSession(
onx2.SerializeToString(), providers=["CPUExecutionProvider"]
)
res2 = sess2.run(None, {"float_input": X_test.astype(numpy.float32)})
print(res2[1][:2])
print("probabilities type:", type(res2[1]))
Expand All @@ -85,7 +87,9 @@
clr, initial_types=initial_type, options=options, target_opset=12
)

sess3 = rt.InferenceSession(onx3.SerializeToString())
sess3 = rt.InferenceSession(
onx3.SerializeToString(), providers=["CPUExecutionProvider"]
)
res3 = sess3.run(None, {"float_input": X_test.astype(numpy.float32)})
for i, out in enumerate(sess3.get_outputs()):
print(
Expand Down
2 changes: 1 addition & 1 deletion docs/examples/plot_custom_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -410,7 +410,7 @@ def predictable_tsne_converter(scope, operator, container):
##########################
# Predictions with onnxruntime.

sess = rt.InferenceSession("predictable_tsne.onnx")
sess = rt.InferenceSession("predictable_tsne.onnx", providers=["CPUExecutionProvider"])

pred_onx = sess.run(None, {"input": X_test[:1].astype(numpy.float32)})
print("transform", pred_onx[0])
Expand Down
4 changes: 3 additions & 1 deletion docs/examples/plot_custom_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -259,7 +259,9 @@ def validator_classifier_parser(scope, model, inputs, custom_parsers=None):

X32 = X_test[:5].astype(np.float32)

sess = rt.InferenceSession(model_onnx.SerializeToString())
sess = rt.InferenceSession(
model_onnx.SerializeToString(), providers=["CPUExecutionProvider"]
)
results = sess.run(None, {"X": X32})

print("--labels--")
Expand Down
4 changes: 3 additions & 1 deletion docs/examples/plot_custom_parser_alternative.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,7 +236,9 @@ def validator_classifier_parser(scope, model, inputs, custom_parsers=None):

X32 = X_test[:5].astype(np.float32)

sess = rt.InferenceSession(model_onnx.SerializeToString())
sess = rt.InferenceSession(
model_onnx.SerializeToString(), providers=["CPUExecutionProvider"]
)
results = sess.run(None, {"X": X32})

print("--labels--")
Expand Down
2 changes: 1 addition & 1 deletion docs/examples/plot_errors_onnxruntime.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
)

example2 = "logreg_iris.onnx"
sess = rt.InferenceSession(example2)
sess = rt.InferenceSession(example2, providers=["CPUExecutionProvider"])

input_name = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name
Expand Down
12 changes: 8 additions & 4 deletions docs/examples/plot_gpr.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@
initial_type = [("X", FloatTensorType([None, X_train.shape[1]]))]
onx = convert_sklearn(gpr, initial_types=initial_type, target_opset=12)

sess = rt.InferenceSession(onx.SerializeToString())
sess = rt.InferenceSession(onx.SerializeToString(), providers=["CPUExecutionProvider"])
try:
pred_onx = sess.run(None, {"X": X_test.astype(numpy.float32)})[0]
except RuntimeError as e:
Expand All @@ -74,7 +74,7 @@
initial_type = [("X", FloatTensorType([None, None]))]
onx = convert_sklearn(gpr, initial_types=initial_type, target_opset=12)

sess = rt.InferenceSession(onx.SerializeToString())
sess = rt.InferenceSession(onx.SerializeToString(), providers=["CPUExecutionProvider"])
pred_onx = sess.run(None, {"X": X_test.astype(numpy.float32)})[0]

pred_skl = gpr.predict(X_test)
Expand Down Expand Up @@ -111,7 +111,9 @@
initial_type = [("X", DoubleTensorType([None, None]))]
onx64 = convert_sklearn(gpr, initial_types=initial_type, target_opset=12)

sess64 = rt.InferenceSession(onx64.SerializeToString())
sess64 = rt.InferenceSession(
onx64.SerializeToString(), providers=["CPUExecutionProvider"]
)
pred_onx64 = sess64.run(None, {"X": X_test})[0]

print(pred_onx64[0, :10])
Expand Down Expand Up @@ -169,7 +171,9 @@
gpr, initial_types=initial_type, options=options, target_opset=12
)

sess64_std = rt.InferenceSession(onx64_std.SerializeToString())
sess64_std = rt.InferenceSession(
onx64_std.SerializeToString(), providers=["CPUExecutionProvider"]
)
pred_onx64_std = sess64_std.run(None, {"X": X_test[:5]})

pprint.pprint(pred_onx64_std)
Expand Down
10 changes: 7 additions & 3 deletions docs/examples/plot_intermediate_outputs.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ def convert_dataframe_schema(df, drop=None):
################################
# We are ready to run *onnxruntime*.

sess = rt.InferenceSession("pipeline_titanic.onnx")
sess = rt.InferenceSession("pipeline_titanic.onnx", providers=["CPUExecutionProvider"])
pred_onx = sess.run(None, inputs)
print("predict", pred_onx[0][:5])
print("predict_proba", pred_onx[1][:1])
Expand Down Expand Up @@ -228,7 +228,9 @@ def convert_dataframe_schema(df, drop=None):
################################
# Let's compute the numerical features.

sess = rt.InferenceSession("pipeline_titanic_numerical.onnx")
sess = rt.InferenceSession(
"pipeline_titanic_numerical.onnx", providers=["CPUExecutionProvider"]
)
numX = sess.run(None, inputs)
print("numerical features", numX[0][:1])

Expand All @@ -238,7 +240,9 @@ def convert_dataframe_schema(df, drop=None):
print(model_onnx)
text_onnx = select_model_inputs_outputs(model_onnx, "variable2")
save_onnx_model(text_onnx, "pipeline_titanic_textual.onnx")
sess = rt.InferenceSession("pipeline_titanic_textual.onnx")
sess = rt.InferenceSession(
"pipeline_titanic_textual.onnx", providers=["CPUExecutionProvider"]
)
numT = sess.run(None, inputs)
print("textual features", numT[0][:1])

Expand Down
8 changes: 6 additions & 2 deletions docs/examples/plot_investigate_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,9 @@
initial_types = [("input", FloatTensorType((None, X_digits.shape[1])))]
model_onnx = convert_sklearn(pipe, initial_types=initial_types, target_opset=12)

sess = rt.InferenceSession(model_onnx.SerializeToString())
sess = rt.InferenceSession(
model_onnx.SerializeToString(), providers=["CPUExecutionProvider"]
)
print("skl predict_proba")
print(pipe.predict_proba(X_digits[:2]))
onx_pred = sess.run(None, {"input": X_digits[:2].astype(np.float32)})[1]
Expand All @@ -82,7 +84,9 @@

for i, step in enumerate(steps):
onnx_step = step["onnx_step"]
sess = rt.InferenceSession(onnx_step.SerializeToString())
sess = rt.InferenceSession(
onnx_step.SerializeToString(), providers=["CPUExecutionProvider"]
)
onnx_outputs = sess.run(None, {"input": X_digits[:2].astype(np.float32)})
skl_outputs = step["model"]._debug.outputs
print("step 1", type(step["model"]))
Expand Down
5 changes: 2 additions & 3 deletions docs/examples/plot_logging.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@
onx = convert_sklearn(clr, initial_types=initial_type, target_opset=12)


sess = rt.InferenceSession(onx.SerializeToString())
sess = rt.InferenceSession(onx.SerializeToString(), providers=["CPUExecutionProvider"])
input_name = sess.get_inputs()[0].name
label_name = sess.get_outputs()[0].name
pred_onx = sess.run([label_name], {input_name: X_test.astype(numpy.float32)})[0]
Expand Down Expand Up @@ -74,18 +74,17 @@

logger = logging.getLogger("skl2onnx")
logger.setLevel(logging.DEBUG)
logging.basicConfig(level=logging.DEBUG)

convert_sklearn(clr, initial_types=initial_type, target_opset=12)

###########################
# And to disable it.

logger.setLevel(logging.INFO)
logging.basicConfig(level=logging.INFO)

convert_sklearn(clr, initial_types=initial_type, target_opset=12)

logger.setLevel(logging.WARNING)

#################################
# **Versions used for this example**
Expand Down
2 changes: 1 addition & 1 deletion docs/examples/plot_metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
#############################
# With *ONNX Runtime*:

sess = InferenceSession(example)
sess = InferenceSession(example, providers=["CPUExecutionProvider"])
meta = sess.get_modelmeta()

print("custom_metadata_map={}".format(meta.custom_metadata_map))
Expand Down
4 changes: 3 additions & 1 deletion docs/examples/plot_nmf.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,9 @@ def nmf_to_onnx(W, H, op_version=12):
########################################
# Let's compute prediction with it.

sess = InferenceSession(model_onnx.SerializeToString())
sess = InferenceSession(
model_onnx.SerializeToString(), providers=["CPUExecutionProvider"]
)


def predict_onnx(sess, row_indices, col_indices):
Expand Down
4 changes: 3 additions & 1 deletion docs/examples/plot_onnx_operators.py
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,9 @@
def predict_with_onnxruntime(model_def, *inputs):
import onnxruntime as ort

sess = ort.InferenceSession(model_def.SerializeToString())
sess = ort.InferenceSession(
model_def.SerializeToString(), providers=["CPUExecutionProvider"]
)
names = [i.name for i in sess.get_inputs()]
dinputs = {name: input for name, input in zip(names, inputs)}
res = sess.run(None, dinputs)
Expand Down
4 changes: 3 additions & 1 deletion docs/examples/plot_pipeline_lightgbm.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,9 @@
# Predictions with onnxruntime.

try:
sess = rt.InferenceSession("pipeline_lightgbm.onnx")
sess = rt.InferenceSession(
"pipeline_lightgbm.onnx", providers=["CPUExecutionProvider"]
)
except OrtFail as e:
print(e)
print("The converter requires onnxmltools>=1.7.0")
Expand Down
2 changes: 1 addition & 1 deletion docs/examples/plot_pipeline_xgboost.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@
##########################
# Predictions with onnxruntime.

sess = rt.InferenceSession("pipeline_xgboost.onnx")
sess = rt.InferenceSession("pipeline_xgboost.onnx", providers=["CPUExecutionProvider"])
pred_onx = sess.run(None, {"input": X[:5].astype(numpy.float32)})
print("predict", pred_onx[0])
print("predict_proba", pred_onx[1][:1])
Expand Down
2 changes: 1 addition & 1 deletion docs/examples/plot_tfidfvectorizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,7 @@ def transform(self, posts):
##########################
# Predictions with onnxruntime.

sess = rt.InferenceSession("pipeline_tfidf.onnx")
sess = rt.InferenceSession("pipeline_tfidf.onnx", providers=["CPUExecutionProvider"])
print("---", train_data[0])
inputs = {"input": train_data[:1]}
pred_onx = sess.run(None, inputs)
Expand Down
Loading
Loading