Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I can't get "R&D" and "SMEs" in the same word cloud #769

Open
aarre opened this issue Jun 28, 2024 · 2 comments
Open

I can't get "R&D" and "SMEs" in the same word cloud #769

aarre opened this issue Jun 28, 2024 · 2 comments

Comments

@aarre
Copy link

aarre commented Jun 28, 2024

Description

I work in a field where "R&D" (research and development) and "SMEs" (small and medium enterprises) are important concepts. If I tokenize myself, then Word Cloud displays "r&d" correctly but does not display "smes" even though I have verified they are prominent in my frequency count. If I let Word Cloud tokenize, then it displays "smes" correctly but renders "R&D" as "r d" (i.e., with a space where there should be an ampersand). Is there anything I can do?

Steps/Code to Reproduce

Example:

    with open(text_path, 'r', encoding='utf-8') as file:
        text = file.read()

    frequencies = collections.Counter()
    for word in text.split(" "):
        frequencies[word] += 1
    frequencies = dict(frequencies)

    # This shows "smes" and "r d" (but no ampersand)
    cloud = wordcloud.WordCloud(width=1920, height=1080,
                                background_color='white',
                                stopwords=stop_words.STOP_WORDS,
                                font_path="./assets/fonts/roboto/Roboto-Regular.ttf").generate(text)

    # This shows "r&d" but not "smes"
    cloud = wordcloud.WordCloud(width=1920, height=1080,
                                 background_color='white',
                                 stopwords=set(),
                                 font_path="./assets/fonts/roboto/Roboto-Regular.ttf").generate_from_frequencies(frequencies)

Expected Results

Either way, I should be able to get both "smes" and "r&d" on the same Word Cloud.

Actual Results

As described above, in one case I get "smes" and "r d", and in the other case, I get "r&d" but no "smes".

Versions

Windows-11-10.0.22631-SP0
Python 3.12.4 (tags/v3.12.4:8e8a4ba, Jun 6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)]
NumPy 1.26.4
matplotlib 3.9.0
wordcoud 1.9.3

@aarre
Copy link
Author

aarre commented Jun 28, 2024

Here is a fully functional example. The version where I calculate my own frequencies works fine, but it is clear that there is an issue with the version where I send the full text to Word Cloud.

#!venv/bin/python

import collections
import matplotlib.pyplot as plt
import wordcloud

if __name__ == "__main__":

    text = " ".join(["sme"] * 10 + ["r&d"] * 10)

    frequencies = collections.Counter()
    for word in text.split(" "):
        frequencies[word] += 1
    frequencies = dict(frequencies)

    cloud = wordcloud.WordCloud().generate(text)
    plt.imshow(cloud)
    plt.tight_layout(pad=0)
    plt.axis('off')
    plt.title('text')
    plt.show(bbox_inches='tight')

    cloud = wordcloud.WordCloud().generate_from_frequencies(frequencies)
    plt.imshow(cloud)
    plt.tight_layout(pad=0)
    plt.axis('off')
    plt.title('frequencies')
    plt.show(bbox_inches='tight')

img_text

img_frequencies

@vvsotnikov
Copy link

You can provide a custom regexp:

#!venv/bin/python
import collections
import matplotlib.pyplot as plt
import wordcloud

if __name__ == "__main__":

    text = " ".join(["sme"] * 10 + ["r&d"] * 10)

    frequencies = collections.Counter()
    for word in text.split(" "):
        frequencies[word] += 1
    frequencies = dict(frequencies)

    cloud = wordcloud.WordCloud(regexp=r"\b[\w&][\w'&]+").generate(text)  # Custom regexp includes "&" character
    plt.imshow(cloud)
    plt.tight_layout(pad=0)
    plt.axis('off')
    plt.title('text')
    plt.show(bbox_inches='tight')

    cloud = wordcloud.WordCloud().generate_from_frequencies(frequencies)
    plt.imshow(cloud)
    plt.tight_layout(pad=0)
    plt.axis('off')
    plt.title('frequencies')
    plt.show(bbox_inches='tight')

For the reference, here's the default regexp used by wordcloud:

pattern = r"\w[\w']*" if self.min_word_length <= 1 else r"\w[\w']+"
regexp = self.regexp if self.regexp is not None else pattern

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants