Replies: 3 comments 4 replies
-
I see and agree with the issue. We would have to extract that information from the model, such that at least for a given model the output would be the same. Would that work for your use-case? |
Beta Was this translation helpful? Give feedback.
-
This is a great idea, thanks for the suggestion. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the quick replies @KennethEnevoldsen and @HLasse A great use case would be to be able to use the TextDescriptives features quickly in a scikit-learn pipeline for a NLP classification or clustering context, and being able to generate a pandas dataframe with a consistent number of columns and order output for each model would allow it. I read your publications such as https://arxiv.org/abs/2301.06916 and saw this package has already been successfully applied for such as task. Given that these features are also human interpretable, I can see it being useful in a variety of domains. |
Beta Was this translation helpful? Give feedback.
-
Hi,
This looks like a quite interesting and useful looking package.
I have a feature suggestion/request. In the pos_proportions component, a pos_prop_POSTAG is not added to the output if no token were assigned with it. But for purposes of feature generation to be used in machine learning, it would be however great to have the option of a consistent output of all features in the same column order (even if some POSTAG columns might be 0.0) in a df.
For instance, when using td.extract_metrics(text=df["message"] .... ) on a set of training texts to get a df of features, one might not encounter a certain POSTAG. But later, when having trained a model on the feature df, one could like to use td.extract_metrics on another df with more texts to get more features, this time for prediction, and here a new POSTAG would be encountered. The two extracted DFs would not have the same columns and it would be up to the developer to code something to make it work.
I think it would be therefore great to have the option of a consistent output of all features in the same column order (even if some POSTAG columns might be 0.0) in a df.
Beta Was this translation helpful? Give feedback.
All reactions