-
Notifications
You must be signed in to change notification settings - Fork 36
Coherences
The Palmetto library offers classes that can be combined to thousends of different coherences. The framework for these coherences is described in M. Röder, A. Both, and A. Hinneburg: Exploring the Space of Topic Coherence Measures. In Proceedings of the eighth International Conference on Web Search and Data Mining, 2015.
For this publication more than 200 000 coherences have been evaluated based on the Palmetto library. For the program and the web service, these large number has been reduced to the following six most interesting coherences.
C_A is based on a context window, a pairwise comparison of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosinus similarity.
This coherence measure retrieves cooccurrence counts for the given words using a context window with the window size 5. The counts are used to calculated the NPMI of every top word to every other top word, thus, resulting in a single vector for every top word. After that the cosinus similarity between all word pairs is calculated. The coherence is the arithmetic mean of these similarities. (Note that in the original publication several other coherence measures have been described. We have chosen this one because it was the best of these measures in our evaluation)
Proposed in
N. Aletras and M. Stevenson: Evaluating
topic coherence using distributional semantics. In Proceedings
of the 10th International Conference on Computational Semantics
(IWCS'13) Long Papers, pages 13-22, 2013.
C_V is based on a sliding window, a one-set segmentation of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosinus similarity.
This coherence measure retrieves cooccurrence counts for the given words using a sliding window and the window size 110. The counts are used to calculated the NPMI of every top word to every other top word, thus, resulting in a set of vectors—one for every top word. The one-set segmentation of the top words leads to the calculation of the similarity between every top word vector and the sum of all top word vectors. As similarity measure the cosinus is used. The coherence is the arithmetic mean of these similarities. (Note that this was the best coherence measure in our evalution.)
Proposed in
M. Röder, A. Both, and A. Hinneburg:
Exploring the Space of Topic Coherence Measures. In
Proceedings of the eighth International Conference on Web
Search and Data Mining, 2015.
C_P is a based on a sliding window, a one-preceding segmentation of the top words and the confirmation measure of Fitelson's coherence.
Word cooccurrence counts for the given top words are derived using a sliding window and the window size 70. For every top word, the confirmation to its preceding top word is calculated using the confirmation measure of Fitelson's coherence. The coherence is the arithmetic mean of the confirmation measure results.
Proposed in
M. Röder, A. Both, and A. Hinneburg:
Exploring the Space of Topic Coherence Measures. In
Proceedings of the eighth International Conference on Web
Search and Data Mining, 2015.
C_NPMI is an enhanced version of the C_UCI coherence using the normalized pointwise mutual information (NPMI) instead of the pointwise mutual information (PMI).
Proposed in
N. Aletras and M. Stevenson: Evaluating
topic coherence using distributional semantics. In Proceedings
of the 10th International Conference on Computational Semantics
(IWCS'13) Long Papers, pages 13-22, 2013.
C_UCI is a coherence that is based on a sliding window and the pointwise mutual information (PMI) of all word pairs of the given top words.
The word cooccurrence counts are derived using a sliding window with the size 10. For every word pair the PMI is calculated. The arithmetic mean of the PMI values is the result of this coherence. (Note that in the original publication only the sum of these values is calculated)
Proposed in
D. Newman, J. H. Lau, K. Grieser, and T.
Baldwin: Automatic evaluation of topic coherence. In
Human Language Technologies: The 2010 Annual Conferenceof the
North American Chapter of the Association for Computational
Linguistics, pages 100-108. Association for Computational
Linguistics, 2010.
C_UMass is based on document cooccurrence counts, a one-preceding segmentation and a logarithmic conditional probability as confirmation measure.
The main idea of this coherence is that the occurrence of every top word should be supported by every top preceding top word. Thus, the probability of a top word to occur should be higher if a document already contains a higher order top word of the same topic. Therefore, for every word the logarithm of its conditional probability is calculated using every other top word that has a higher order in the ranking of top words as condition. The probabilities are derived using document cooccurrence counts. The single conditional probabilities are summarized using the arithmetic mean. (Note that in the original publication only the sum of these values is calculated)
Proposed in
D. Mimno, H. M. Wallach, E. Talley, M.
Leenders, and A. McCallum: Optimizing semantic coherence
in topic models. In Proceedings of the Conference on Empirical
Methods in Natural Language Processing, pages 262-272.
Association for Computational Linguistics, 2011.