title | tags | authors | affiliations | date | bibliography | |||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cntext: a Python tool for text mining |
|
|
|
1 Aug 2022 |
paper.bib |
cntext is a text analysis package that provides semantic distance and semantic projection based on word embedding models. Besides,cntext also provides the traditional methods, such as word count , readability, document similarity, sentiment analysis, etc. The link of cntext repo is https://github.com/hidadeng/cntext
.
Human society is rich in cognitions, such as perception, thinking, attitude and emotion.As a carrier of ideas, texts not only reflect people's mental activities from the individual’s level but also reflects collective culture from the organizational and social level.In the field of social sciences, the main research path is mining personal mental activities and culture changes in society through text data[@tausczik2010psychological]
There are two common text analysis algorithms: dictionary-based method and word-embedding based method. The cntext library contains both types of algorithms.
We can count the occurrences of a certain type of word in the text based on an certain dictionary.For example, using a emotional adjectives dictionary, such as NRC, we can count the occurrences of different emotional words in the text ,and then know the distribution of emotions in the text[@chen2021symbols] .
Compared with the dictionary-based method, word-embedding based method has more efficient word representation ability and retains rich semantic information.So the scope of research topics is more wider, including social prejudice (stereotype) [@garg2018word], cultural cognition [@kozlowski2019geometry; @kurdi2019relationship], semantic change[@hamilton2016diachronic; @rodman2020timely], individual judgment and decision-making psychology[@bhatia2019predicting] . A large number of studies have been published in international journals, such as Nature, Science, PNAS, Academy Management Journal, American Sociological Review, Management Science, etc.
As far as we know, cntext is the only Python package that provides semantic projection. For instance, to recover the similarities in size among nouns in a certain category (e.g., animals), we project their representations onto the line that extends from the word-vector small to the word-vector big; and to order them according to how dangerous they are, we project them onto the line connecting safe and dangerous [@Grand2022SemanticPR].
The functional modules of cntext include:
stats.py
basic text information
- word count
- readability
- built-in dictionary
- sentiment analysis
dictionary.py
build text model & expand dictionary(vocabulary)
- throught Sopmi(mutual information) algorithm
- expand dictionary throught Word2Vec algorithm
- build Glove model
**similarity.py **
document similarity
- cosine algorithm
- jaccard algorithm
- edit distance algorithm
mind.py
digest cognitive(attitude、bias etc) information from word embeddings
- tm.semantic_distance
- tm.semantic_projection
sematic_distance = distance(man, engineer)-distance(woman, engineer)
>>>import cntext as ct
>>>#Note: this is a word2vec format model
>>>tm = ct.Text2Mind(w2v_model_path='glove_w2v.6B.100d.txt')
>>>engineer_words = ['program', 'software', 'computer']
>>>man_words = ["man", "he", "him"]
>>>woman_words = ["woman", "she", "her"]
>>>tm.sematic_distance(words=engineer_words,
>>> c_words1=man_words,
>>> c_words2=woman_words)
-0.38
sematic_distance<0, that is, distance(man, engineer)<distance(woman, engineer). Semantically, engineer is closer to woman and farther away from man, so the corpus implies more discrimination against woman about engineer occupation.
To explain the semantic projection of the word vector model, I use the picture from a Nature paper in 2022[@Grand2022SemanticPR]. Regarding the names of animals, human cognition information about animal size is hidden in the corpus text. By projecting the meaning of LARGE and SMALL with the vectors of different animals, the projection of the animal on the size vector(just like the red line in the bellow picture) is obtained, so the size of the animal can be compared by calculation.
Calculate the projected length of each word vector in the concept vector.Note that the calculation result reflects the direction of concept.Greater than 0 means semantically closer to c_words2.
>>>animal_words = ['mouse', 'cat', 'horse', 'pig', 'whale']
>>>small_words = ["small", "little", "tiny"]
>>>large_words = ["large", "big", "huge"]
>>>
>>>tm.sematic_projection(words=animal_words,
>>> c_words1=small_words,
>>> c_words2=large_words)
[('mouse', -1.68),
('cat', -0.92),
('pig', -0.46),
('whale', -0.24),
('horse', 0.4)]
Regarding the perception of size, humans have implied in the text that mice are smaller and horses are larger.