I can't get "R&D" and "SMEs" in the same word cloud #769

aarre · 2024-06-28T01:10:58Z

Description

I work in a field where "R&D" (research and development) and "SMEs" (small and medium enterprises) are important concepts. If I tokenize myself, then Word Cloud displays "r&d" correctly but does not display "smes" even though I have verified they are prominent in my frequency count. If I let Word Cloud tokenize, then it displays "smes" correctly but renders "R&D" as "r d" (i.e., with a space where there should be an ampersand). Is there anything I can do?

Steps/Code to Reproduce

Example:

    with open(text_path, 'r', encoding='utf-8') as file:
        text = file.read()

    frequencies = collections.Counter()
    for word in text.split(" "):
        frequencies[word] += 1
    frequencies = dict(frequencies)

    # This shows "smes" and "r d" (but no ampersand)
    cloud = wordcloud.WordCloud(width=1920, height=1080,
                                background_color='white',
                                stopwords=stop_words.STOP_WORDS,
                                font_path="./assets/fonts/roboto/Roboto-Regular.ttf").generate(text)

    # This shows "r&d" but not "smes"
    cloud = wordcloud.WordCloud(width=1920, height=1080,
                                 background_color='white',
                                 stopwords=set(),
                                 font_path="./assets/fonts/roboto/Roboto-Regular.ttf").generate_from_frequencies(frequencies)

Expected Results

Either way, I should be able to get both "smes" and "r&d" on the same Word Cloud.

Actual Results

As described above, in one case I get "smes" and "r d", and in the other case, I get "r&d" but no "smes".

Versions

Windows-11-10.0.22631-SP0
Python 3.12.4 (tags/v3.12.4:8e8a4ba, Jun 6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)]
NumPy 1.26.4
matplotlib 3.9.0
wordcoud 1.9.3

aarre · 2024-06-28T13:23:50Z

Here is a fully functional example. The version where I calculate my own frequencies works fine, but it is clear that there is an issue with the version where I send the full text to Word Cloud.

#!venv/bin/python

import collections
import matplotlib.pyplot as plt
import wordcloud

if __name__ == "__main__":

    text = " ".join(["sme"] * 10 + ["r&d"] * 10)

    frequencies = collections.Counter()
    for word in text.split(" "):
        frequencies[word] += 1
    frequencies = dict(frequencies)

    cloud = wordcloud.WordCloud().generate(text)
    plt.imshow(cloud)
    plt.tight_layout(pad=0)
    plt.axis('off')
    plt.title('text')
    plt.show(bbox_inches='tight')

    cloud = wordcloud.WordCloud().generate_from_frequencies(frequencies)
    plt.imshow(cloud)
    plt.tight_layout(pad=0)
    plt.axis('off')
    plt.title('frequencies')
    plt.show(bbox_inches='tight')

vvsotnikov · 2024-07-25T06:13:33Z

You can provide a custom regexp:

#!venv/bin/python
import collections
import matplotlib.pyplot as plt
import wordcloud

if __name__ == "__main__":

    text = " ".join(["sme"] * 10 + ["r&d"] * 10)

    frequencies = collections.Counter()
    for word in text.split(" "):
        frequencies[word] += 1
    frequencies = dict(frequencies)

    cloud = wordcloud.WordCloud(regexp=r"\b[\w&][\w'&]+").generate(text)  # Custom regexp includes "&" character
    plt.imshow(cloud)
    plt.tight_layout(pad=0)
    plt.axis('off')
    plt.title('text')
    plt.show(bbox_inches='tight')

    cloud = wordcloud.WordCloud().generate_from_frequencies(frequencies)
    plt.imshow(cloud)
    plt.tight_layout(pad=0)
    plt.axis('off')
    plt.title('frequencies')
    plt.show(bbox_inches='tight')

For the reference, here's the default regexp used by wordcloud:

word_cloud/wordcloud/wordcloud.py

Lines 582 to 583 in 1072b0e

 pattern = r"\w[\w']*" if self.min_word_length <= 1 else r"\w[\w']+" 

 regexp = self.regexp if self.regexp is not None else pattern

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I can't get "R&D" and "SMEs" in the same word cloud #769

I can't get "R&D" and "SMEs" in the same word cloud #769

aarre commented Jun 28, 2024

aarre commented Jun 28, 2024

vvsotnikov commented Jul 25, 2024

I can't get "R&D" and "SMEs" in the same word cloud #769

I can't get "R&D" and "SMEs" in the same word cloud #769

Comments

aarre commented Jun 28, 2024

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

aarre commented Jun 28, 2024

vvsotnikov commented Jul 25, 2024