Add a built-in configuration dictionary for machine learning with text data #507

rhiever · 2017-06-24T15:04:51Z

We can support machine learning with text data in TPOT by adding the CountVectorizer and TfidfVecorizer to a separate built-in configuration dictionary. I don't think we would need to change any of the other operators.

Unfortunately, without a pipeline grammar we can't force those vectorizers to always be at the beginning of every pipeline, but I suppose for text classification problems all of the pipelines that don't have one of the vectorizers will go "extinct."

Thoughts?

cc @weixuanfu2016 @teaearlgraycold

danthedaniel · 2017-06-24T17:38:25Z

How would the text be passed to fit/transform? Right now we only handle float values for features, but both of these operators expect blocks of text.

Without any major changes to TPOT you'd only be able to use either of these operators in an external pipeline:

my_text = "...".split("\n")

pipeline = make_pipeline(
    CountVectorizer(),
    TPOTClassifier()
)
pipeline.fit(my_text, ...)

weixuanfu · 2017-06-24T19:37:54Z

Maybe we need a new text classification mode rather than a config file for transforming text input by the two operators.

ziarahman · 2017-06-25T06:20:53Z

@weixuanfu2016, I like the idea of 'new text classification mode' .

@rhiever , thank you for opening this up for discussion. You mentioned CountVectorizer & TFIDFVectorizer. What about HashingVectorizer?

rhiever · 2017-06-26T20:07:07Z

@teaearlgraycold, taking your example, it could work like this:

my_text = "...".split("\n")
class_labels = [1, 0, 0, 1, 1, ..., 0, 0]
# Assuming len(my_text) == len(class_labels)

my_tpot = TPOTClassifier(..., config='TPOT text', ...)
my_tpot.fit(my_text, class_labels)

We would indeed need to change the dataset validation procedure when config='TPOT text'. I wonder how we could get it to support a mix of text and non-text data, though.

@ziarahman, I haven't used the HashingVectorizer. Does it take input data in the same format as the other two vectorizers?

weixuanfu · 2017-06-26T21:00:14Z

how about fit(features (non-text data), text, target)? CountVectorizer & TFIDFVectorizer can be used on text data only and then stacking them in pipeline as below:

pipeline = make_pipeline(
    StackingText(CountVectorizer(input_text), input_matrix),
    TPOTClassifier()
)

danthedaniel · 2017-06-26T21:03:28Z

That'd work, but would make for an even messier fit function

…

On Mon, Jun 26, 2017, 4:07 PM Randy Olson ***@***.***> wrote: @teaearlgraycold <https://github.com/teaearlgraycold>, taking your example, it could work like this: my_text = "...".split("\n") class_labels = [1, 0, 0, 1, 1, ..., 0, 0]# Assuming len(my_text) == len(class_labels) my_tpot = TPOTClassifier(..., config='TPOT text', ...) my_tpot.fit(my_text, class_labels) We would indeed need to change the dataset validation procedure when config='TPOT text'. @ziarahman <https://github.com/ziarahman>, I haven't used the HashingVectorizer. Does it take input data in the same format as the other two vectorizers? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#507 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADISY7Tv-p6pgLz7b_TgLB_8jB4eJOAtks5sIA9vgaJpZM4OEWJL> .

weixuanfu · 2017-06-26T21:19:01Z

Agree. Maybe we need fit_params in fit() to organizeweight, groups and text_input parameters as scikit-learn does.

weixuanfu · 2017-06-26T21:20:45Z

fit_params can take a parameter dictionary for these parameters.

ziarahman · 2017-06-27T02:47:06Z

@rhiever, as far as I know, yes, HashingVectorizer takes input data the same way as the other two vectorizers. Even though TFIDF and Count vectorizers are more popular for text classification with ML, Hashing vectorizer is mostly used when the text corpus is large.

danthedaniel · 2017-06-27T06:23:26Z

I think that the best solution here is to decompose TPOTBase.fit() into a series of smaller functions, then implement a separate fit function for each of TPOTClassifier, TPOTRegressor and a new TPOTTextClassifier - each using the small functions to reduce code duplication. That way there doesn't need to be a ton of code like:

if self._text:
    # foo
else:
    # bar

Might need to do something similar for _evaluate_individual().

rhiever · 2017-06-27T15:35:59Z

Is text classification such a fundamentally different problem type that it requires a new TPOT class? Once the text is converted to a bag-of-[words, ngrams, etc.] representation, we're working with a regular feature matrix again. Maybe it'll be a sparse matrix, but it's still a feature matrix.

Here's an example of using sklearn pipelines to CountVectorize (etc) specific columns in a dataset: link

Although that seems to rely on the data being passed as a dictionary, I think we could have TPOT recognize what columns are text and apply the vectorizers specifically to those columns. Maybe via wrapped versions of CountVectorizer etc?

rhiever · 2017-06-27T15:56:20Z

This sklearn issue seems relevant to our conversations here: scikit-learn/scikit-learn#9012

Maybe a better way to accomplish what we want in the sklearn Pipeline architecture.

davidfox-ap · 2017-07-26T20:57:18Z

Is there a workaround for this we can use in the interim?

I tried simply adding the tpot classifier to a pipeline:

pipeline_optimizer = pipeline = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', TPOTClassifier(generations=5, population_size=20, verbosity=2))])

But this returns an error 'ufunc 'isnan' not supported for the input types' which I saw addressed elsewhere in the issues list. It seems to be an issue with numpy. I saw a suggestion to use toarray(), so ended up with something like this:

vectorizer = CountVectorizer()
tfidf_transformer = TfidfTransformer()
trainVectorized = vectorizer.fit_transform(X_train)
testVectorized = vectorizer.fit_transform(X_test)
finalTrain = tfidf_transformer.fit_transform(trainVectorized).toarray()
finalTest = tfidf_transformer.fit_transform(testVectorized).toarray()

But this seems to produce input data that tpot doesn't care for. I get a different error message: 'Input data is not in a valid format.' I'm sure I'm doing something wrong.

Any other ideas what I might try to get text classification working today (though I look forward to the tighter integration proposed here)?

weixuanfu · 2017-07-26T22:55:30Z

For solving this issue, we are working on the related issue #529 for adding a grammar configuration for supporting text classification.

weixuanfu · 2017-07-26T22:56:50Z

Sorry it should be #523

weixuanfu · 2017-07-26T22:58:51Z

Oops, I just closed by mistake. Cellphone screen is too small for typing. Sorry.

chananshgong · 2017-11-12T12:52:34Z

I think we should keep the automatic tpot spirit when dealing with categorical and text features and assume the input data is a mixture of numerical/categorical and text features. We should automatically infer the column type and treat properly trying different pipelines. In other words, it should be transparent to the user whether or not a column is textual or not. What do you think?

8bit-pixies · 2019-07-10T01:37:39Z

So putting all the comments together; is the solution to do something like this?

import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from tpot import TPOTClassifier
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from tpot.config import classifier_config_dict_light 
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.compose import ColumnTransformer
import copy

X = pd.DataFrame({"DESCRIPTION": np.random.choice(["hello world", "foo bar"], 200), "DESCRIPTION2": np.random.choice(["hello world", "foo bar"], 200),
                  'NUMERIC': np.random.choice([0, 1], 200), "NUMERIC2": np.random.choice([0, 1], 200)})
y = np.random.choice([0, 1], 200)

class IdentityTransformer(TransformerMixin, BaseEstimator):
    def fit(self, X, y=None, **fit_params):
        return self
    def transform(self, X):
        return X

class TfidfTransformer(TransformerMixin):
    def __init__(self, text_columns, keep_columns=[], **kwargs):
        self.text_columns = text_columns if type(text_columns) is list else [text_columns]
        self.keep_columns = keep_columns if type(keep_columns) is list else [keep_columns]
        
        column_list = []
        for idx, text in enumerate(self.text_columns):
            column_list.append(('text' + str(idx), TfidfVectorizer(**kwargs), text))
        
        if len(keep_columns) > 0:
            column_list.append(('other', IdentityTransformer(), self.keep_columns))
        
        self.column_transformer = ColumnTransformer(column_list)
    def fit(self, X, y=None):
        self.column_transformer.fit(X, y)
        return self
    def transform(self, X):
        return self.column_transformer.transform(X)        

# using TPOT config
config = copy.deepcopy(classifier_config_dict_light)
config["__main__.TfidfTransformer"] = {
        "text_columns": [["DESCRIPTION", "DESCRIPTION2"]],
        "keep_columns": [["NUMERIC", "NUMERIC2"]]
    }

tpot = TPOTClassifier(config_dict=config, verbosity=2, generations=5, population_size=2, early_stop=2, max_time_mins=2,
                     template='TfidfTransformer-Selector-Transformer-Classifier')
tpot.fit(X, y)

8bit-pixies · 2019-07-27T12:39:29Z

I've had a go at this here: https://github.com/chappers/tpot/tree/feat/text_preprocess
its not quite 100% working to a level that should have a PR on it. Keen for feedback on the interface. Here is a short snippet using Iris dataset:

X_train_df = pd.DataFrame(X_train, columns=["num1", "num2", "num3", "num4"])
X_train_df['text'] = np.random.choice(["hello", "world", "foo", "bar world", "bar hello"], X_train.shape[0])

tpot2 = TPOTClassifier(generations=1, population_size=5, verbosity=2, template="PCA-LogisticRegression", 
                      preprocess_config_dict = {
                          'numeric_columns': ["num2"]
                      })
tpot2.fit(X_train_df, y_train)

tpot2 = TPOTClassifier(generations=1, population_size=5, verbosity=2, template="PCA-LogisticRegression", 
                      preprocess_config_dict = {
                          'numeric_columns': ["num2", "num3", "num4"]
                      })
tpot2.fit(X_train_df, y_train)

tpot2 = TPOTClassifier(generations=1, population_size=5, verbosity=2, template="PCA-LogisticRegression", 
                      preprocess_config_dict = {
                          'numeric_columns': ["num2"],
                          'text_columns': ['text']
                      })
tpot2.fit(X_train_df, y_train)

And the output looks like this:

Generation 1 - Current best internal CV score: 0.5908385093167702

Best pipeline: LogisticRegression(PCA(PreprocessTransformer(input_matrix, numeric_columns=['num2']), iterated_power=2, svd_solver=randomized), C=20.0, dual=True, penalty=l2)

Generation 1 - Current best internal CV score: 0.9556935817805383

Best pipeline: LogisticRegression(PCA(PreprocessTransformer(input_matrix, numeric_columns=['num2', 'num3', 'num4']), iterated_power=9, svd_solver=randomized), C=20.0, dual=False, penalty=l1)

Generation 1 - Current best internal CV score: 0.5096273291925466

Essentially the approach is for a user to provide some metadata (numeric columns, text columns, categorical columns) which is then injected into the setup (I can imagine several reasons why we shouldn't do what I've done in my code; we can deal with that later).

You can see from above, that it appears to work; whereby a user can selectively choose which columns to be used in their pipeline (including text). In this pipeline, and through the templates we can also force TPOT to optimise variety of vectorizers as well if we wish.

v2thegreat · 2021-01-21T23:10:57Z

Any updates on this?

manasomali · 2021-06-10T15:04:36Z

Maybe do an GridSearchCV to text feature extraction together with the TPOT setting config_dict='TPOT sparse'. It is a viable solution?

rhiever added enhancement need contributor labels Jun 24, 2017

rhiever changed the title ~~Add a built-in configuration dictionary for text classification~~ Add a built-in configuration dictionary for machine learning with text data Jun 24, 2017

weixuanfu closed this as completed Jul 26, 2017

weixuanfu reopened this Jul 26, 2017

weixuanfu mentioned this issue Aug 8, 2017

How to use TPOT in the text domain? #544

Closed

8bit-pixies mentioned this issue Jul 26, 2019

Add deep learning features to pipeline #809

Open

8bit-pixies mentioned this issue Jul 27, 2019

Allow for Flexible Preprocessing #897

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a built-in configuration dictionary for machine learning with text data #507

Add a built-in configuration dictionary for machine learning with text data #507

rhiever commented Jun 24, 2017 •

edited

Loading

danthedaniel commented Jun 24, 2017 •

edited

Loading

weixuanfu commented Jun 24, 2017

ziarahman commented Jun 25, 2017

rhiever commented Jun 26, 2017 •

edited

Loading

weixuanfu commented Jun 26, 2017

danthedaniel commented Jun 26, 2017 via email

weixuanfu commented Jun 26, 2017

weixuanfu commented Jun 26, 2017

ziarahman commented Jun 27, 2017

danthedaniel commented Jun 27, 2017 •

edited

Loading

rhiever commented Jun 27, 2017

rhiever commented Jun 27, 2017 •

edited

Loading

davidfox-ap commented Jul 26, 2017

weixuanfu commented Jul 26, 2017

weixuanfu commented Jul 26, 2017

weixuanfu commented Jul 26, 2017

chananshgong commented Nov 12, 2017

8bit-pixies commented Jul 10, 2019 •

edited

Loading

8bit-pixies commented Jul 27, 2019

v2thegreat commented Jan 21, 2021

manasomali commented Jun 10, 2021

Add a built-in configuration dictionary for machine learning with text data #507

Add a built-in configuration dictionary for machine learning with text data #507

Comments

rhiever commented Jun 24, 2017 • edited Loading

danthedaniel commented Jun 24, 2017 • edited Loading

weixuanfu commented Jun 24, 2017

ziarahman commented Jun 25, 2017

rhiever commented Jun 26, 2017 • edited Loading

weixuanfu commented Jun 26, 2017

danthedaniel commented Jun 26, 2017 via email

weixuanfu commented Jun 26, 2017

weixuanfu commented Jun 26, 2017

ziarahman commented Jun 27, 2017

danthedaniel commented Jun 27, 2017 • edited Loading

rhiever commented Jun 27, 2017

rhiever commented Jun 27, 2017 • edited Loading

davidfox-ap commented Jul 26, 2017

weixuanfu commented Jul 26, 2017

weixuanfu commented Jul 26, 2017

weixuanfu commented Jul 26, 2017

chananshgong commented Nov 12, 2017

8bit-pixies commented Jul 10, 2019 • edited Loading

8bit-pixies commented Jul 27, 2019

v2thegreat commented Jan 21, 2021

manasomali commented Jun 10, 2021

rhiever commented Jun 24, 2017 •

edited

Loading

danthedaniel commented Jun 24, 2017 •

edited

Loading

rhiever commented Jun 26, 2017 •

edited

Loading

danthedaniel commented Jun 27, 2017 •

edited

Loading

rhiever commented Jun 27, 2017 •

edited

Loading

8bit-pixies commented Jul 10, 2019 •

edited

Loading