You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I used YAKE to extract some keywords from text as follows:
import spacy
import textacy
import textacy.extract.keyterms as ke
en_spacy_model = spacy.load("en_core_web_lg") # large language package
text = "In the text mining tasks, textual representation should be not only efficient but also interpretable, " \
"as this enables an understanding of the operational logic underlying the data mining models. Traditional text " \
"vectorization methods such as TF-IDF and bag-of-words are effective and characterized by intuitive " \
"interpretability, but suffer from the «curse of dimensionality», and they are unable to capture the meanings " \
"of words. On the other hand, modern distributed methods effectively capture the hidden semantics, " \
"but they are computationally intensive, time-consuming, and uninterpretable. This article proposes a new text " \
"vectorization method called Bag of weighted Concepts BoWC that presents a document according to the concepts’ " \
"information it contains. The proposed method creates concepts by clustering word vectors (i.e. word " \
"embedding) then uses the frequencies of these concept clusters to represent document vectors. To enrich the " \
"resulted document representation, a new modified weighting function is proposed for weighting concepts based " \
"on statistics extracted from word embedding information. The generated vectors are characterized by " \
"interpretability, low dimensionality, high accuracy, and low computational costs when used in data mining " \
"tasks. The proposed method has been tested on five different benchmark datasets in two data mining tasks; " \
"document clustering and classification, and compared with several baselines, including Bag-of-words, TF-IDF, " \
"Averaged GloVe, Bag-of-Concepts, and VLAC. The results indicate that BoWC outperforms most baselines and " \
"gives 7% better accuracy on average "
doc = textacy.make_spacy_doc(text.lower(), en_spacy_model)
yake_kw = ke.yake(doc, ngrams=(1,2,3,4), normalize=None, include_pos=("NOUN", "PROPN", "ADJ"), window_size=4, topn=20)
#Print the keywords using Yake algorithm, as implemented in Textacy.
print("Yake output: ")
for e in yake_kw:
print(e[0],"\t", e[1]) # Order ascending, as lower scores means higher importance
The output is as follows:
Yake output:
concepts 0.34785604529774267
mining 0.3573874481263331
bag 0.36656842484621355
document 0.3840296285062857
words 0.4131660220894072
text 0.4151558602661391
tasks 0.42079776707616295
data 0.42684394140294357
method 0.43918219586137885
vectors 0.4438910610888909
idf 0.5368745934076988
information 0.5389345352012269
vectorization 0.5405218742053667
dimensionality 0.5408938012298964
tf 0.5419066125993368
interpretability 0.5432608453679435
representation 0.5509415208495251
clustering 0.5571034040321834
accuracy 0.5585834907670935
new 0.5590334612780952
Although 4 grams are specified, the algorithm seems to favour single words as the top keywords. Is this behaviour expected?
The text was updated successfully, but these errors were encountered:
Hi,
I used YAKE to extract some keywords from text as follows:
The output is as follows:
Although 4 grams are specified, the algorithm seems to favour single words as the top keywords. Is this behaviour expected?
The text was updated successfully, but these errors were encountered: