-
Notifications
You must be signed in to change notification settings - Fork 32
Text Analytics
Text Analytics empowers businesses with ‘Social Listening’ capabilities. It allows businesses to tune into structured and unstructured data across emails, text messages, emails, and customer reviews to narrow down on positive and negative topics. In the past decade there has been a data boom and the manual approach to text analytics is proven to be ineffective and unproductive. AutoBrewML covers the Top 3 characteristics of Text Analytics-
Basic feature extraction from text that add information about text, type of text, patterns in text and any special characteristics in text like -
- Get word count
- Lexon (alphabet/number/special charachter) count
- Average word length
- Special characters count
- Upper case words count
We also need to perform some standardization operations on the raw text to be made machine readable as well as uniform across, like -
- Converting text to uniform lower case
- Removing punctuations
- Stop words (or commonly occurring English words is/a/am/are/the) removal (add no extra information to text data)
- Most frequent words appearing throughout corpus removal (as their presence will not of any use in classification of our text data)
- Rare Words removal (Because they’re so rare, the association between them and other words is dominated by noise)
- Spelling correction (this also will help us in reducing multiple copies of same words and treating them differently)
- Stemming (removal of suffices, like “ing”, “ly”, “s” etc to get the base word out of different forms of the same word) or Lemmatization (It is a more effective option than stemming because it converts the word into its root word, rather than just stripping the suffices)
- Most frequent N-Grams search (contiguous sequence of n items from a given sample of text)
- Sentiment Analysis to identify the polarity and subjectivity of each
Text similarity has to determine how 'close' two pieces of text are both in surface closeness (lexical similarity) and meaning (semantic similarity).
-
Term Frequency-inverse document frequency (TF-idf):
this looks at words that appear in both pieces of text, and scores them based on how often they appear. It is a useful tool if you expect the same words to appear in both pieces of text, but some words are more important that others.
Get cosine similarity between two texts-
Step1. Convert the text into vector of numbers (Using TF-IDF scores)
a)TF= Frequency of a word in the given sentence or Term-Frequency
b)IDF=Inverse Doc Frequency is 1/ number of times a word appears accross all documents. This is important because some words like is/am/are/the are present throughout the text and add no value/variability when present in a sentence. So allot these words a lower score by taking the inverse. We can ignore the IDF score as we have removed the stop words and most frequent words accross.
c)TF-IDF score =TF score * IDF score
d)text_to_vector function returns a tuple of { word: Frequency } or TF score. Thus converts a text to vector.
Step2.Calculate cosine similarity of the two vectors
a)cos_sim(vectA,vectB)=dot product=(xa.xb + ya.yb + za.zb)/[(sqrt(xa.xa + ya.ya + za.za)).(sqrt(xb.xb + yb.yb + zb.zb))]
where vectA=(xa,ya,za) ; vectB=(xb,yb,zb)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
search_terms = 'fruit and vegetables'
documents = ['cars drive on the road', 'tomatoes are actually fruit']
doc_vectors = TfidfVectorizer().fit_transform([search_terms] + documents)
cosine_similarities = linear_kernel(doc_vectors[0:1], doc_vectors).flatten()
document_scores = [item.item() for item in cosine_similarities[1:]]
# >>[0.0, 0.190]
# Here, we create the model and ‘fit’ using the text corpus.
# TfidfVectorizer handles the pre-processing using its default tokenizer — this converts strings into lists of single word ‘tokens’. It produces a sparse matrix of document vectors containing the term frequencies.
# We then take the dot product (linear kernel) of the first vector (that contains the search terms) with the documents to determine the similarity. We have to ignore the first similarity result ([1:]) as that is comparing the search terms to themselves.
-
Semantic similarity:
this scores words based on how similar they are, even if they are not exact matches. It borrows techniques from Natural Language Processing (NLP), such as word embeddings. This is useful if the word overlap between texts is limited, such as if you need ‘fruit and vegetables’ to relate to ‘tomatoes’. GloVe, coined from Global Vectors, is a model for distributed word representation. The model is an unsupervised learning algorithm for obtaining vector representations for words. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity.

from scipy import spatial
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-50") #choose from multiple models https://github.com/RaRe-Technologies/gensim-data
s0 = 'Mark zuckerberg owns the facebook company'
s1 = 'Facebook company ceo is mark zuckerberg'
s2 = 'Microsoft is owned by Bill gates'
s3 = 'How to learn japanese'
def preprocess(s):
return [i.lower() for i in s.split()]
def get_vector(s):
return np.sum(np.array([model[i] for i in preprocess(s)]), axis=0)
#Semantic similarity scores
print('s0 vs s1 ->',1 - spatial.distance.cosine(get_vector(s0), get_vector(s1)))
print('s0 vs s2 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s2)))
print('s0 vs s3 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s3)))
#>>s0 vs s1 -> 0.965923011302948
#>>s0 vs s2 -> 0.8659112453460693
#>>s0 vs s3 -> 0.5877998471260071
Text summarization is the process of creating a short, coherent, and fluent summary of a longer text document and involves the outlining of the text's major points. Auto Summarize Text- Extract some important sentences from the whole text and combine them together to form the abstract.
The Word importance and Sentence significance are measured as shown below-
def summarize(text, n):
sents = sent_tokenize(text)
assert n <= len(sents) #Check if the sentences list have atleast n sentences
word_sent = word_tokenize(text.lower())
_stopwords = set(stopwords.words('english') + list(punctuation))
word_sent=[word for word in word_sent if word not in _stopwords]
freq = FreqDist(word_sent)
ranking = defaultdict(int)
for i,sent in enumerate(sents):
for w in word_tokenize(sent.lower()):
if w in freq:
ranking[i] += freq[w]
sents_idx = nlargest(n, ranking, key=ranking.get)
summary_1= [sents[j] for j in sorted(sents_idx)]
summary=""
for i in range(len(summary_1)):
summary=summary + summary_1[i]
return summary
In-text categorization, the system is fed a pre-built set of text examples and their relevant categories. The machine learning algorithm learns how each text is categorized and creates rules for itself. When new text is presented, it applies these rules to categorize the new text into further categories.
Steps-
- Collect all articles which are to be bucketize into n clusters
- Vectorize the articles using TF-IDF Vectorizer, 'Bag of Words' Model (A bag of words is a representation of text that describes the occurrence of words within a document. We just keep track of word counts and disregard the grammatical details and the word order)
- Use the transformed vectors to cluster using K-Means Cluster
- To validate if the clustering has bagged similar articles together, get the most important words out of the same cluster articles (use terms other than stop-words to get term frequency). these words should point to a common central theme. i.e. articles of a clusters run across a common theme as judged via their keywords.
- Train a K-Nearest Neighbor model with all the articles and their cluster label
- Get the article you want to classify, vectorize it
- Use the KNN model to predict the Label class
To refer to the working samples please visit-
Twitter & Blog Post Analysis using AutoBrewML Text Analytics Tools
Datasets used-