pos_prop_POSTAG: "Does not create a key if no tokens in the document fit the POSTAG." #221

toltoxgh · 2023-04-25T22:07:53Z

toltoxgh
Apr 25, 2023

Hi,

This looks like a quite interesting and useful looking package.

I have a feature suggestion/request. In the pos_proportions component, a pos_prop_POSTAG is not added to the output if no token were assigned with it. But for purposes of feature generation to be used in machine learning, it would be however great to have the option of a consistent output of all features in the same column order (even if some POSTAG columns might be 0.0) in a df.

For instance, when using td.extract_metrics(text=df["message"] .... ) on a set of training texts to get a df of features, one might not encounter a certain POSTAG. But later, when having trained a model on the feature df, one could like to use td.extract_metrics on another df with more texts to get more features, this time for prediction, and here a new POSTAG would be encountered. The two extracted DFs would not have the same columns and it would be up to the developer to code something to make it work.

I think it would be therefore great to have the option of a consistent output of all features in the same column order (even if some POSTAG columns might be 0.0) in a df.

KennethEnevoldsen · 2023-04-26T01:37:38Z

KennethEnevoldsen
Apr 26, 2023
Collaborator

I see and agree with the issue. We would have to extract that information from the model, such that at least for a given model the output would be the same. Would that work for your use-case?

0 replies

HLasse · 2023-04-26T09:46:27Z

HLasse
Apr 26, 2023
Maintainer

This is a great idea, thanks for the suggestion.
I've added the functionality in #222. Should be merged and ready to use as soon as CI passes and Kenneth has given it a look :)

1 reply

KennethEnevoldsen Apr 26, 2023
Collaborator

Wonderful!

toltoxgh · 2023-04-26T21:48:57Z

toltoxgh
Apr 26, 2023
Author

Thank you for the quick replies @KennethEnevoldsen and @HLasse

A great use case would be to be able to use the TextDescriptives features quickly in a scikit-learn pipeline for a NLP classification or clustering context, and being able to generate a pandas dataframe with a consistent number of columns and order output for each model would allow it.

I read your publications such as https://arxiv.org/abs/2301.06916 and saw this package has already been successfully applied for such as task. Given that these features are also human interpretable, I can see it being useful in a variety of domains.

3 replies

KennethEnevoldsen Apr 27, 2023
Collaborator

Glad to hear that you have found our package promising! We do to.

A great use case would be to be able to use the TextDescriptives features quickly in a scikit-learn pipeline for a NLP classification or clustering context, and being able to generate a pandas dataframe with a consistent number of columns and order output for each model would allow it.

This should now be the case after @HLasse's PR #222, but if you note something please let us know.

Let us know if you end up using the package!

HLasse Apr 27, 2023
Maintainer

Yep, that should be pretty simple to do now! As long as you make the same function calls the order and number of columns will always be consistent now. I hope you find it useful - any other feedback is appreciated 🚀

@KennethEnevoldsen, we might want to consider making a simple tutorial/functionality for using TextDescriptives in a sklearn pipeline. Seems you need to do a little bit of wrapping (basically creating a class with a .fit and .transform method and wrapping the extraction code in .transform - see e.g. here), but otherwise looks very simple.

EDIT: I procrastinated and went ahead and did it 🤷‍♂️ check out #224

KennethEnevoldsen Apr 27, 2023
Collaborator

Great! Looks good, added a minor thing - but assuming tests pass it will automerge

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pos_prop_POSTAG: "Does not create a key if no tokens in the document fit the POSTAG." #221

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

pos_prop_POSTAG: "Does not create a key if no tokens in the document fit the POSTAG." #221

toltoxgh Apr 25, 2023

Replies: 3 comments · 4 replies

KennethEnevoldsen Apr 26, 2023 Collaborator

HLasse Apr 26, 2023 Maintainer

KennethEnevoldsen Apr 26, 2023 Collaborator

toltoxgh Apr 26, 2023 Author

KennethEnevoldsen Apr 27, 2023 Collaborator

HLasse Apr 27, 2023 Maintainer

KennethEnevoldsen Apr 27, 2023 Collaborator

toltoxgh
Apr 25, 2023

Replies: 3 comments 4 replies

KennethEnevoldsen
Apr 26, 2023
Collaborator

HLasse
Apr 26, 2023
Maintainer

KennethEnevoldsen Apr 26, 2023
Collaborator

toltoxgh
Apr 26, 2023
Author

KennethEnevoldsen Apr 27, 2023
Collaborator

HLasse Apr 27, 2023
Maintainer

KennethEnevoldsen Apr 27, 2023
Collaborator