-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support "Pandas Series Representation" #43
Labels
enhancement
New feature or request
Comments
This was referenced Jul 10, 2020
Closed
Useful code:
|
Useful code for testing the functions:
|
Hi Jonny, we will take care about this issue now. 😃 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This is one of the most interesting (future) aspects of Texthero: the ability to represent any text-dataset with ease, even very large dataset.
Motivation
One of the big limitations of the current version of the Texthero is that the output of the
tfidf
function or whatever other "representation" function is not particularly interpretable. The user do not even know which tf-idf weight is associated with which word/token.The solution is to return a Multiindex Pandas Series where the first level represent the document and the second document represents the word. See this example below:
The advantage of this approach is that:
The drawback is that this Pandas Series cannot be appended directly into the Pandas Dataframe.
We refer to this MultiIndex series where the first level is the document and the second level is the term as
"Pandas Series Representation" (a better name is welcomed!)
Texthero 2.0
Starting from Texthero 2.0 all? "representation" functions will return such Pandas Representation Series. The
pca
/nmf
function will accept as input a Pandas Representation and will (probably) return a flat representation as it does not make sense anymore to have a second level called "pca-component-1".From Pandas Representation Series to Pandas Series
A function
to_flat_series
or something similar will transform the Pandas Representation Series into a (flatten) Pandas Series (as the actual output oftfidf
). This will permit to append the Series into the initial df.From Pandas Representation Series to a document-term matrix
Just by calling
.stack()
on the Pandas Representation Series it will be possible to convert it to a Pandas DataFrame where rows are the documents and every column is a term. Nice, right? We will need to explain clearly how to deal with MultiIndex (basics are not particularly hard)Interested in helping out?
Most of the code has already been written. If you are interested in helping out for this important changes leave a comment. We will be glad to have you onboard!
Your opinion
Your opinion matter; let us know your thoughts!
The text was updated successfully, but these errors were encountered: