A solution for the synonym problem in word frequency algorithms
This library contains the official implementation of the synonym-augmented frequency algorithm presented in "A solution for the synonym problem in word frequency algorithms", along with a GUI wrapper and text preprocessing utilities.
The package requires Python 3.7.3 and can be installed through PyPi with the following command:
pip install asolut
Additionaly, the NLTK stopwords
, averaged_perceptron_tagger
and wordnet
resources are needed.
asolut.preprocessing(texts, pos=None, chrsplt="\s|\\\\|/",
keepstopwords=False, mode="normal", chng=True)
Performs basic text preprocessing on a given string. Preprocessing includes tokenization, Part of Speech filtering, stopword removal, special character handling and lemmatization.
texts: str
The text to be preprocessed. Can be any valid string.pos: [str, ...]
, default=None
The parts of speech that should be included in the output. Any word corresponding to a PoS not contained in the list will be discarded. List items must be valid Penn Treebank PoS tags. The actual default value of the parameter, assigned later in the function, is the following list:["JJ", "JJR", "JJS", "NN", "NNS", "NNP", "NNPS", "RB", "RBR", "RBS", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]
(adjectives, adverbs, verbs and nouns).chrsplt: str
, default="\s|\\\\|/"
Regular Expression pattern that defines the character/s on which the text should be split at. Must be a valid RegEx pattern.keepstopwords: bool
, default=False
Specifies whether stop words should be kept (True
) or discarded (False
).mode: {"none", "normal", "extended", "full", "custom (custom pattern)"}
, default="normal"
Defines which special characters contained in words should be removed. Can be"none"
,"normal"
,"extended"
,"full"
, or any valid RegEx pattern preceded by the characters"custom "
(e.g. "custom a|b"). The predefined RegEx patterns are as follows:"none"
: no RegEx pattern (keeps words unchanged)"normal"
:^\W+|\W+$
"extended"
:^[^\w°؋฿₿¢₡₵$₫֏€ƒ₲₾₴₭₺₼₥₦₱£﷼៛ރ₽₨௹₹৲૱₪₸₮₩¥₳₠₢₯₣₤₶ℳ₧₰₷©™®]+|[^\w°؋฿₿¢₡₵$₫֏€ƒ₲₾₴₭₺₼₥₦₱£﷼៛ރ₽₨௹₹৲૱₪₸₮₩¥₳₠₢₯₣₤₶ℳ₧₰₷©™®]+$
"full"
:\W
chng: bool
, default=True
Specifies whether words should be lemmatized (True
) or not (False
).
textlist: [str, ...]
The pre-processed text as a list of tokens.
asolut.freqs(textlist, sortedby="sum", returntype="plot", figtitle="plot", numb=None)
Calculates the frequencies of words by taking into account their synonyms.
textlist: [str, ...]
A list of tokens. Preferably, word-level tokens.sortedby: {"frequencies", "synonym frequencies", "sum"}
, default="sum"
Specifies the type of frequency the output should be ordered by (descending)."frequencies"
: Standard word frequencies."synonym frequencies"
: Solely word synonym frequencies."sum"
: The sum of both synonym and word frequencies.
returntype: {"plot", "data", "both"}
, default="plot"
Specifies the output of the function."plot"
: Creates and saves an interactive html horizontal stacked bar chart. ReturnsNone
."data"
: Returns the resulting information as apandas.DataFrame
object."both"
: Creates and saves the interactive html barplot and returns the information as apandas.DataFrame
object.
If a plot is chosen to be generated, it is of the following format:
figtitle: str
, default="plot"
If a plot was chosen to be created, this parameter specifies the filename under which it will be saved.numb: int
, default=None
Specifies the number of bars depicted in the barplot. The value ofnumb
is given by this function:
where n_unique
is the number of unique words after pre-processing
and numb_input
is the user input for the numb
parameter. The input must be a positive integer.
data: pandas.DataFrame or None
The DataFrame containing the calculated counts. It is of the following format:
Words | Counts | Synonym Counts | List of synonyms |
---|---|---|---|
headphone | 1 | 3 | [earphone, earpiece] |
flower | 1 | 0 | [] |
earphone | 2 | 2 | [earpiece, headphone] |
earpiece | 1 | 3 | [earphone, headphone] |
asolut.gui()
Displays a graphical user interface that serves as a wrapper for the aforementioned functions, in order to make the tool accessible to non developers. Can only generate the horizontal stacked bar chart.