Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a built-in configuration dictionary for machine learning with text data #507

Open
rhiever opened this issue Jun 24, 2017 · 21 comments
Open

Comments

@rhiever
Copy link
Contributor

rhiever commented Jun 24, 2017

We can support machine learning with text data in TPOT by adding the CountVectorizer and TfidfVecorizer to a separate built-in configuration dictionary. I don't think we would need to change any of the other operators.

Unfortunately, without a pipeline grammar we can't force those vectorizers to always be at the beginning of every pipeline, but I suppose for text classification problems all of the pipelines that don't have one of the vectorizers will go "extinct."

Thoughts?

cc @weixuanfu2016 @teaearlgraycold

@rhiever rhiever changed the title Add a built-in configuration dictionary for text classification Add a built-in configuration dictionary for machine learning with text data Jun 24, 2017
@danthedaniel
Copy link
Contributor

danthedaniel commented Jun 24, 2017

How would the text be passed to fit/transform? Right now we only handle float values for features, but both of these operators expect blocks of text.

Without any major changes to TPOT you'd only be able to use either of these operators in an external pipeline:

my_text = "...".split("\n")

pipeline = make_pipeline(
    CountVectorizer(),
    TPOTClassifier()
)
pipeline.fit(my_text, ...)

@weixuanfu
Copy link
Contributor

Maybe we need a new text classification mode rather than a config file for transforming text input by the two operators.

@ziarahman
Copy link

@weixuanfu2016, I like the idea of 'new text classification mode' .

@rhiever , thank you for opening this up for discussion. You mentioned CountVectorizer & TFIDFVectorizer. What about HashingVectorizer?

@rhiever
Copy link
Contributor Author

rhiever commented Jun 26, 2017

@teaearlgraycold, taking your example, it could work like this:

my_text = "...".split("\n")
class_labels = [1, 0, 0, 1, 1, ..., 0, 0]
# Assuming len(my_text) == len(class_labels)

my_tpot = TPOTClassifier(..., config='TPOT text', ...)
my_tpot.fit(my_text, class_labels)

We would indeed need to change the dataset validation procedure when config='TPOT text'. I wonder how we could get it to support a mix of text and non-text data, though.

@ziarahman, I haven't used the HashingVectorizer. Does it take input data in the same format as the other two vectorizers?

@weixuanfu
Copy link
Contributor

how about fit(features (non-text data), text, target)? CountVectorizer & TFIDFVectorizer can be used on text data only and then stacking them in pipeline as below:

pipeline = make_pipeline(
    StackingText(CountVectorizer(input_text), input_matrix),
    TPOTClassifier()
)

@danthedaniel
Copy link
Contributor

danthedaniel commented Jun 26, 2017 via email

@weixuanfu
Copy link
Contributor

Agree. Maybe we need fit_params in fit() to organizeweight, groups and text_input parameters as scikit-learn does.

@weixuanfu
Copy link
Contributor

fit_params can take a parameter dictionary for these parameters.

@ziarahman
Copy link

@rhiever, as far as I know, yes, HashingVectorizer takes input data the same way as the other two vectorizers. Even though TFIDF and Count vectorizers are more popular for text classification with ML, Hashing vectorizer is mostly used when the text corpus is large.

@danthedaniel
Copy link
Contributor

danthedaniel commented Jun 27, 2017

I think that the best solution here is to decompose TPOTBase.fit() into a series of smaller functions, then implement a separate fit function for each of TPOTClassifier, TPOTRegressor and a new TPOTTextClassifier - each using the small functions to reduce code duplication. That way there doesn't need to be a ton of code like:

if self._text:
    # foo
else:
    # bar

Might need to do something similar for _evaluate_individual().

@rhiever
Copy link
Contributor Author

rhiever commented Jun 27, 2017

Is text classification such a fundamentally different problem type that it requires a new TPOT class? Once the text is converted to a bag-of-[words, ngrams, etc.] representation, we're working with a regular feature matrix again. Maybe it'll be a sparse matrix, but it's still a feature matrix.

Here's an example of using sklearn pipelines to CountVectorize (etc) specific columns in a dataset: link

Although that seems to rely on the data being passed as a dictionary, I think we could have TPOT recognize what columns are text and apply the vectorizers specifically to those columns. Maybe via wrapped versions of CountVectorizer etc?

@rhiever
Copy link
Contributor Author

rhiever commented Jun 27, 2017

This sklearn issue seems relevant to our conversations here: scikit-learn/scikit-learn#9012

Maybe a better way to accomplish what we want in the sklearn Pipeline architecture.

@davidfox-ap
Copy link

Is there a workaround for this we can use in the interim?

I tried simply adding the tpot classifier to a pipeline:

pipeline_optimizer = pipeline = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', TPOTClassifier(generations=5, population_size=20, verbosity=2))])

But this returns an error 'ufunc 'isnan' not supported for the input types' which I saw addressed elsewhere in the issues list. It seems to be an issue with numpy. I saw a suggestion to use toarray(), so ended up with something like this:

vectorizer = CountVectorizer()
tfidf_transformer = TfidfTransformer()
trainVectorized = vectorizer.fit_transform(X_train)
testVectorized = vectorizer.fit_transform(X_test)
finalTrain = tfidf_transformer.fit_transform(trainVectorized).toarray()
finalTest = tfidf_transformer.fit_transform(testVectorized).toarray()

But this seems to produce input data that tpot doesn't care for. I get a different error message: 'Input data is not in a valid format.' I'm sure I'm doing something wrong.

Any other ideas what I might try to get text classification working today (though I look forward to the tighter integration proposed here)?

@weixuanfu
Copy link
Contributor

For solving this issue, we are working on the related issue #529 for adding a grammar configuration for supporting text classification.

@weixuanfu
Copy link
Contributor

Sorry it should be #523

@weixuanfu weixuanfu reopened this Jul 26, 2017
@weixuanfu
Copy link
Contributor

Oops, I just closed by mistake. Cellphone screen is too small for typing. Sorry.

@chananshgong
Copy link

I think we should keep the automatic tpot spirit when dealing with categorical and text features and assume the input data is a mixture of numerical/categorical and text features. We should automatically infer the column type and treat properly trying different pipelines. In other words, it should be transparent to the user whether or not a column is textual or not. What do you think?

@8bit-pixies
Copy link
Contributor

8bit-pixies commented Jul 10, 2019

So putting all the comments together; is the solution to do something like this?

import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from tpot import TPOTClassifier
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from tpot.config import classifier_config_dict_light 
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.compose import ColumnTransformer
import copy

X = pd.DataFrame({"DESCRIPTION": np.random.choice(["hello world", "foo bar"], 200), "DESCRIPTION2": np.random.choice(["hello world", "foo bar"], 200),
                  'NUMERIC': np.random.choice([0, 1], 200), "NUMERIC2": np.random.choice([0, 1], 200)})
y = np.random.choice([0, 1], 200)

class IdentityTransformer(TransformerMixin, BaseEstimator):
    def fit(self, X, y=None, **fit_params):
        return self
    def transform(self, X):
        return X

class TfidfTransformer(TransformerMixin):
    def __init__(self, text_columns, keep_columns=[], **kwargs):
        self.text_columns = text_columns if type(text_columns) is list else [text_columns]
        self.keep_columns = keep_columns if type(keep_columns) is list else [keep_columns]
        
        column_list = []
        for idx, text in enumerate(self.text_columns):
            column_list.append(('text' + str(idx), TfidfVectorizer(**kwargs), text))
        
        if len(keep_columns) > 0:
            column_list.append(('other', IdentityTransformer(), self.keep_columns))
        
        self.column_transformer = ColumnTransformer(column_list)
    def fit(self, X, y=None):
        self.column_transformer.fit(X, y)
        return self
    def transform(self, X):
        return self.column_transformer.transform(X)        

# using TPOT config
config = copy.deepcopy(classifier_config_dict_light)
config["__main__.TfidfTransformer"] = {
        "text_columns": [["DESCRIPTION", "DESCRIPTION2"]],
        "keep_columns": [["NUMERIC", "NUMERIC2"]]
    }

tpot = TPOTClassifier(config_dict=config, verbosity=2, generations=5, population_size=2, early_stop=2, max_time_mins=2,
                     template='TfidfTransformer-Selector-Transformer-Classifier')
tpot.fit(X, y)

@8bit-pixies
Copy link
Contributor

I've had a go at this here: https://github.com/chappers/tpot/tree/feat/text_preprocess
its not quite 100% working to a level that should have a PR on it. Keen for feedback on the interface. Here is a short snippet using Iris dataset:

X_train_df = pd.DataFrame(X_train, columns=["num1", "num2", "num3", "num4"])
X_train_df['text'] = np.random.choice(["hello", "world", "foo", "bar world", "bar hello"], X_train.shape[0])

tpot2 = TPOTClassifier(generations=1, population_size=5, verbosity=2, template="PCA-LogisticRegression", 
                      preprocess_config_dict = {
                          'numeric_columns': ["num2"]
                      })
tpot2.fit(X_train_df, y_train)

tpot2 = TPOTClassifier(generations=1, population_size=5, verbosity=2, template="PCA-LogisticRegression", 
                      preprocess_config_dict = {
                          'numeric_columns': ["num2", "num3", "num4"]
                      })
tpot2.fit(X_train_df, y_train)

tpot2 = TPOTClassifier(generations=1, population_size=5, verbosity=2, template="PCA-LogisticRegression", 
                      preprocess_config_dict = {
                          'numeric_columns': ["num2"],
                          'text_columns': ['text']
                      })
tpot2.fit(X_train_df, y_train)

And the output looks like this:

Generation 1 - Current best internal CV score: 0.5908385093167702

Best pipeline: LogisticRegression(PCA(PreprocessTransformer(input_matrix, numeric_columns=['num2']), iterated_power=2, svd_solver=randomized), C=20.0, dual=True, penalty=l2)

Generation 1 - Current best internal CV score: 0.9556935817805383

Best pipeline: LogisticRegression(PCA(PreprocessTransformer(input_matrix, numeric_columns=['num2', 'num3', 'num4']), iterated_power=9, svd_solver=randomized), C=20.0, dual=False, penalty=l1)

Generation 1 - Current best internal CV score: 0.5096273291925466

Essentially the approach is for a user to provide some metadata (numeric columns, text columns, categorical columns) which is then injected into the setup (I can imagine several reasons why we shouldn't do what I've done in my code; we can deal with that later).

You can see from above, that it appears to work; whereby a user can selectively choose which columns to be used in their pipeline (including text). In this pipeline, and through the templates we can also force TPOT to optimise variety of vectorizers as well if we wish.

@v2thegreat
Copy link

Any updates on this?

@manasomali
Copy link

Maybe do an GridSearchCV to text feature extraction together with the TPOT setting config_dict='TPOT sparse'. It is a viable solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants