Add PMMLCountVectorizer and PMMLTfIdfVectorizer classes #74

nejatb · 2018-01-08T06:17:50Z

Hi Villu,

Following issue #68 , we have been attempting to add support for normalized tfidfVectorizer, since our tfidfVectorizer was part of the pipeline. The way I went about doing this is rather hacky but is:
I created a python class called NormalizedFeatureExtraction and in that class, I copied the logistics of scikit learn's tfidfVectorizer but I enforced normalization by removing the norm parameter from the init method and also removing self.norm and instead, normalizing the vectors wherever needed. However, to satisfy the need of the java side, I still return None for

    @property
    def norm(self):
        return None

and I kept your implementation of jpmml tfidfVectorizer as its corresponding java side.
I wanted to ask you if there is any better way to do this? And also, when making this change to the python side, do I have to make any changes to the java side as well? Is there any reason why the jpmml tfidfVectorizer throws an exception when norm is not None? Could having normalization the way I have added it, without any change to the java side, potentially cause any logistic problems?
Many thanks for your clarifications.

The text was updated successfully, but these errors were encountered:

vruusmann · 2018-01-08T14:27:23Z

Is there any reason why the jpmml tfidfVectorizer throws an exception when norm is not None?

This exception is thrown to inform you that normalization in general, and the "norm" attribute in particular, is not supported at the moment. If the converter simply ignored this fact, then the PMML representation of the Scikit-Learn pipeline wouldn't be exact/correct, and you'd be getting mismatching predictions later on.

I remember that I have discussed the background of this "not supported" decision someplace else before. In brief, Scikit-Learn calculates term frequencies all in one go, whereas (J)PMML calculates them one by one, as they are needed (kind of lazy evaluation approach). It would be terribly inefficient to perform normalization over the complete vocabulary - (J)PMML would need to invoke term frequency calculation both for "active" and "inactive" terms. In most cases, the majority of terms fall into the "inactive" category.

You should consider rearranging your Scikit-Learn pipeline to perform normalization after you've identified which terms are "active":

pipeline = Pipeline([
  ("generate terms", TfIdfVectorizer()), # All terms
  ("select terms", SelectKBest(k = 500)) # Identify/select 500 "active" terms
  ("normalize terms", Normalizer()) # Perform normalization over the subset of 500 "active" terms
  ("classifier", RandomForestClassifier())
])

Would the explicit normalization using the sklearn.preprocessing.Normalizer transformation work for you?

Some estimator types contain "internal" normalization logic, so the explicit normalization step might be unnecessary/redundant.

And also, when making this change to the python side, do I have to make any changes to the java side as well?

The Java side must reproduce the prediction logic of the Scikit-Learn/Python side.

In other words, if some Python class attribute is significant from the prediction logic perspective, then the converter must look it up, and generate appropriate PMML markup for it.

Could having normalization the way I have added it, without any change to the java side, potentially cause any logistic problems?

When in doubt, compare Scikit-Learn predictions with (J)PMML predictions over a representative data set.

vruusmann · 2018-01-08T14:40:51Z

I refactored this issue to address a more generic usability concern.

Namely, it's rather tricky to get the paramerization of Scikit-Learn's CountVectorizer and TfidfVectorizer classes correct - there are a number of constructor parameters (eg. norm, tokenizer) that only accept fixed values.

The solution would be to introduce PMMLCountVectorizer and PMMLTfidfVectorizer subclasses, respectively, that automatically fill in "fixed value" arguments, and only query the end user for "variable value" arguments.

vruusmann changed the title ~~Adding Normalization to TfidfVectorizer~~ Add PMMLCountVectorizer and PMMLTfIdfVectorizer classes Jan 8, 2018

vruusmann mentioned this issue Jan 8, 2018

Add support for Normalizer transformation type jpmml/jpmml-sklearn#64

Open

vruusmann closed this as completed in jpmml/jpmml-sklearn@00ea7b3 Jan 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PMMLCountVectorizer and PMMLTfIdfVectorizer classes #74

Add PMMLCountVectorizer and PMMLTfIdfVectorizer classes #74

nejatb commented Jan 8, 2018

vruusmann commented Jan 8, 2018 •

edited

Loading

vruusmann commented Jan 8, 2018

Add PMMLCountVectorizer and PMMLTfIdfVectorizer classes #74

Add PMMLCountVectorizer and PMMLTfIdfVectorizer classes #74

Comments

nejatb commented Jan 8, 2018

vruusmann commented Jan 8, 2018 • edited Loading

vruusmann commented Jan 8, 2018

vruusmann commented Jan 8, 2018 •

edited

Loading