-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a built-in configuration dictionary for machine learning with text data #507
Comments
How would the text be passed to Without any major changes to TPOT you'd only be able to use either of these operators in an external pipeline: my_text = "...".split("\n")
pipeline = make_pipeline(
CountVectorizer(),
TPOTClassifier()
)
pipeline.fit(my_text, ...) |
Maybe we need a new text classification mode rather than a config file for transforming text input by the two operators. |
@weixuanfu2016, I like the idea of 'new text classification mode' . @rhiever , thank you for opening this up for discussion. You mentioned CountVectorizer & TFIDFVectorizer. What about HashingVectorizer? |
@teaearlgraycold, taking your example, it could work like this: my_text = "...".split("\n")
class_labels = [1, 0, 0, 1, 1, ..., 0, 0]
# Assuming len(my_text) == len(class_labels)
my_tpot = TPOTClassifier(..., config='TPOT text', ...)
my_tpot.fit(my_text, class_labels) We would indeed need to change the dataset validation procedure when @ziarahman, I haven't used the HashingVectorizer. Does it take input data in the same format as the other two vectorizers? |
how about
|
That'd work, but would make for an even messier fit function
…On Mon, Jun 26, 2017, 4:07 PM Randy Olson ***@***.***> wrote:
@teaearlgraycold <https://github.com/teaearlgraycold>, taking your
example, it could work like this:
my_text = "...".split("\n")
class_labels = [1, 0, 0, 1, 1, ..., 0, 0]# Assuming len(my_text) == len(class_labels)
my_tpot = TPOTClassifier(..., config='TPOT text', ...)
my_tpot.fit(my_text, class_labels)
We would indeed need to change the dataset validation procedure when config='TPOT
text'.
@ziarahman <https://github.com/ziarahman>, I haven't used the
HashingVectorizer. Does it take input data in the same format as the other
two vectorizers?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#507 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADISY7Tv-p6pgLz7b_TgLB_8jB4eJOAtks5sIA9vgaJpZM4OEWJL>
.
|
Agree. Maybe we need |
|
@rhiever, as far as I know, yes, HashingVectorizer takes input data the same way as the other two vectorizers. Even though TFIDF and Count vectorizers are more popular for text classification with ML, Hashing vectorizer is mostly used when the text corpus is large. |
I think that the best solution here is to decompose if self._text:
# foo
else:
# bar Might need to do something similar for |
Is text classification such a fundamentally different problem type that it requires a new TPOT class? Once the text is converted to a bag-of-[words, ngrams, etc.] representation, we're working with a regular feature matrix again. Maybe it'll be a sparse matrix, but it's still a feature matrix. Here's an example of using sklearn pipelines to CountVectorize (etc) specific columns in a dataset: link Although that seems to rely on the data being passed as a dictionary, I think we could have TPOT recognize what columns are text and apply the vectorizers specifically to those columns. Maybe via wrapped versions of CountVectorizer etc? |
This sklearn issue seems relevant to our conversations here: scikit-learn/scikit-learn#9012 Maybe a better way to accomplish what we want in the sklearn Pipeline architecture. |
Is there a workaround for this we can use in the interim? I tried simply adding the tpot classifier to a pipeline: pipeline_optimizer = pipeline = Pipeline([('vect', CountVectorizer()), But this returns an error 'ufunc 'isnan' not supported for the input types' which I saw addressed elsewhere in the issues list. It seems to be an issue with numpy. I saw a suggestion to use toarray(), so ended up with something like this: vectorizer = CountVectorizer() But this seems to produce input data that tpot doesn't care for. I get a different error message: 'Input data is not in a valid format.' I'm sure I'm doing something wrong. Any other ideas what I might try to get text classification working today (though I look forward to the tighter integration proposed here)? |
For solving this issue, we are working on the related issue #529 for adding a grammar configuration for supporting text classification. |
Sorry it should be #523 |
Oops, I just closed by mistake. Cellphone screen is too small for typing. Sorry. |
I think we should keep the automatic tpot spirit when dealing with categorical and text features and assume the input data is a mixture of numerical/categorical and text features. We should automatically infer the column type and treat properly trying different pipelines. In other words, it should be transparent to the user whether or not a column is textual or not. What do you think? |
So putting all the comments together; is the solution to do something like this? import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from tpot import TPOTClassifier
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from tpot.config import classifier_config_dict_light
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.compose import ColumnTransformer
import copy
X = pd.DataFrame({"DESCRIPTION": np.random.choice(["hello world", "foo bar"], 200), "DESCRIPTION2": np.random.choice(["hello world", "foo bar"], 200),
'NUMERIC': np.random.choice([0, 1], 200), "NUMERIC2": np.random.choice([0, 1], 200)})
y = np.random.choice([0, 1], 200)
class IdentityTransformer(TransformerMixin, BaseEstimator):
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X):
return X
class TfidfTransformer(TransformerMixin):
def __init__(self, text_columns, keep_columns=[], **kwargs):
self.text_columns = text_columns if type(text_columns) is list else [text_columns]
self.keep_columns = keep_columns if type(keep_columns) is list else [keep_columns]
column_list = []
for idx, text in enumerate(self.text_columns):
column_list.append(('text' + str(idx), TfidfVectorizer(**kwargs), text))
if len(keep_columns) > 0:
column_list.append(('other', IdentityTransformer(), self.keep_columns))
self.column_transformer = ColumnTransformer(column_list)
def fit(self, X, y=None):
self.column_transformer.fit(X, y)
return self
def transform(self, X):
return self.column_transformer.transform(X)
# using TPOT config
config = copy.deepcopy(classifier_config_dict_light)
config["__main__.TfidfTransformer"] = {
"text_columns": [["DESCRIPTION", "DESCRIPTION2"]],
"keep_columns": [["NUMERIC", "NUMERIC2"]]
}
tpot = TPOTClassifier(config_dict=config, verbosity=2, generations=5, population_size=2, early_stop=2, max_time_mins=2,
template='TfidfTransformer-Selector-Transformer-Classifier')
tpot.fit(X, y) |
I've had a go at this here: https://github.com/chappers/tpot/tree/feat/text_preprocess X_train_df = pd.DataFrame(X_train, columns=["num1", "num2", "num3", "num4"])
X_train_df['text'] = np.random.choice(["hello", "world", "foo", "bar world", "bar hello"], X_train.shape[0])
tpot2 = TPOTClassifier(generations=1, population_size=5, verbosity=2, template="PCA-LogisticRegression",
preprocess_config_dict = {
'numeric_columns': ["num2"]
})
tpot2.fit(X_train_df, y_train)
tpot2 = TPOTClassifier(generations=1, population_size=5, verbosity=2, template="PCA-LogisticRegression",
preprocess_config_dict = {
'numeric_columns': ["num2", "num3", "num4"]
})
tpot2.fit(X_train_df, y_train)
tpot2 = TPOTClassifier(generations=1, population_size=5, verbosity=2, template="PCA-LogisticRegression",
preprocess_config_dict = {
'numeric_columns': ["num2"],
'text_columns': ['text']
})
tpot2.fit(X_train_df, y_train) And the output looks like this:
Essentially the approach is for a user to provide some metadata (numeric columns, text columns, categorical columns) which is then injected into the setup (I can imagine several reasons why we shouldn't do what I've done in my code; we can deal with that later). You can see from above, that it appears to work; whereby a user can selectively choose which columns to be used in their pipeline (including text). In this pipeline, and through the templates we can also force TPOT to optimise variety of vectorizers as well if we wish. |
Any updates on this? |
Maybe do an GridSearchCV to text feature extraction together with the TPOT setting config_dict='TPOT sparse'. It is a viable solution? |
We can support machine learning with text data in TPOT by adding the CountVectorizer and TfidfVecorizer to a separate built-in configuration dictionary. I don't think we would need to change any of the other operators.
Unfortunately, without a pipeline grammar we can't force those vectorizers to always be at the beginning of every pipeline, but I suppose for text classification problems all of the pipelines that don't have one of the vectorizers will go "extinct."
Thoughts?
cc @weixuanfu2016 @teaearlgraycold
The text was updated successfully, but these errors were encountered: