Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAKE_LANGUAGES empty list in Extract Keywords #1099

Open
gmolledaj opened this issue Feb 21, 2025 · 7 comments
Open

RAKE_LANGUAGES empty list in Extract Keywords #1099

gmolledaj opened this issue Feb 21, 2025 · 7 comments

Comments

@gmolledaj
Copy link

gmolledaj commented Feb 21, 2025

Describe the bug
Insert Extract Keywords widget and an error window appears

To Reproduce
Insert Extract Keywords

Expected behavior
View the widget without error

Orange version:
3.38.1

Text add-on version:
1.16.2

Screenshots

<style type="text/css"> p, li { white-space: pre-wrap; } hr { height: 1px; border-width: 0; } li.unchecked::marker { content: "\2610"; } li.checked::marker { content: "\2612"; } </style>
Exception: ValueError: Combo box does not contain item 'en'
Module: orangewidget.gui:2398
Widget Name: Extract Keywords
Widget Module: orangecontrib.text.widgets.owkeywords:250
Widget Scheme: <node_properties /> <session_state> <window_groups /> </session_state>
Version: 3.38.1
Environment: Python 3.10.16 on Linux 6.11.0-1011-oem #11-Ubuntu SMP PREEMPT_DYNAMIC Thu Dec 19 11:35:20 UTC 2024 x86_64
Installed Packages: AnyQt==0.2.0, Bottleneck==1.4.2, Brotli==1.1.0, Orange3-Text==1.16.2, Orange3-Timeseries==0.6.3, Orange3==3.38.1, PyQt6-Qt6==6.8.1, PyQt6-WebEngine-Qt6==6.8.1, PyQt6-WebEngine==6.8.0, PyQt6==6.8.0, PyQt6_sip==13.9.1, PyYAML==6.0.2, Pygments==2.19.1, QtPy==2.4.2, SecretStorage==3.3.3, XlsxWriter==3.2.0, anyio==4.8.0, asttokens==3.0.0, attrs==24.3.0, backports.tarfile==1.2.0, baycomp==1.0.3, beautifulsoup4==4.12.3, biopython==1.84, catboost==1.2.7, cattrs==24.1.2, certifi==2024.12.14, cffi==1.17.1, chardet==5.2.0, charset-normalizer==3.4.1, click==8.1.8, comm==0.2.2, commonmark==0.9.1, conllu==6.0.0, contourpy==1.3.1, cryptography==44.0.0, cycler==0.12.1, debugpy==1.8.11, decorator==5.1.1, defusedxml==0.7.1, dictdiffer==0.9.0, docutils==0.21.2, docx2txt==0.8, et_xmlfile==2.0.0, exceptiongroup==1.2.2, executing==2.1.0, fonttools==4.55.3, frozendict==2.4.6, gensim==4.3.3, graphviz==0.20.3, h11==0.14.0, html5lib==1.1, httpcore==1.0.7, httpx==0.28.1, idna==3.10, importlib_metadata==8.5.0, ipykernel==6.29.5, ipython==8.31.0, jaraco.classes==3.4.0, jaraco.context==6.0.1, jaraco.functools==4.1.0, jedi==0.19.2, jeepney==0.8.0, jellyfish==1.1.3, joblib==1.4.2, jupyter_client==8.6.3, jupyter_core==5.7.2, keyring==25.6.0, keyrings.alt==5.0.2, kiwisolver==1.4.8, langdetect==1.0.9, lemmagen3==3.5.1, lxml==5.3.0, matplotlib-inline==0.1.7, matplotlib==3.10.0, more-itertools==10.5.0, multitasking==0.0.11, nest-asyncio==1.6.0, networkx==3.4.2, nltk==3.9.1, numpy==1.26.4, oauthlib==3.2.2, odfpy==1.4.1, openTSNE==1.0.2, openpyxl==3.1.5, orange-canvas-core==0.2.5, orange-widget-base==4.25.1, owlready2==0.47, packaging==24.2, pandas-datareader==0.10.0, pandas==2.2.3, parso==0.8.4, patsy==1.0.1, peewee==3.17.8, pexpect==4.9.0, pillow==11.1.0, pip==24.2, platformdirs==4.3.6, plotly==5.24.1, prompt_toolkit==3.0.48, psutil==6.1.1, ptyprocess==0.7.0, pure_eval==0.2.3, pybind11==2.13.6, pycparser==2.22, pyparsing==3.2.1, pypdf==5.1.0, pyqtgraph==0.13.7, python-dateutil==2.9.0.post0, python-louvain==0.16, pytz==2024.2, pyzmq==26.2.0, qasync==0.27.1, qtconsole==5.6.1, regex==2024.11.6, requests-cache==1.2.1, requests-oauthlib==1.3.1, requests==2.32.3, scikit-learn==1.6.1, scipy==1.13.1, segtok==1.5.11, serverfiles==0.3.1, setuptools==75.1.0, shapely==2.0.6, simhash==2.1.2, six==1.17.0, smart-open==7.1.0, sniffio==1.3.1, soupsieve==2.6, stack-data==0.6.3, statsmodels==0.14.4, tabulate==0.9.0, tenacity==9.0.0, threadpoolctl==3.5.0, tornado==6.4.2, tqdm==4.67.1, traitlets==5.14.3, trimesh==4.5.3, tweepy==4.14.0, typing_extensions==4.12.2, tzdata==2024.2, ufal.udpipe==1.3.1.1, url-normalize==1.4.3, urllib3==2.3.0, wcwidth==0.2.13, webencodings==0.5.1, websockets==14.1, wheel==0.44.0, wikipedia==1.4.0, wrapt==1.17.1, xgboost==2.0.3, xlrd==2.0.1, yake==0.4.8, yfinance==0.2.51, zipp==3.21.0
Machine ID: ba9f343d-3664-41fb-8f9d-497705247f36
Stack Trace: Traceback (most recent call last):  File "/home/gmj/miniconda3/envs/orange3/lib/python3.10/site-packages/orangecanvas/scheme/widgetmanager.py", line 408, in __process_init_queue    self.ensure_created(node)  File "/home/gmj/miniconda3/envs/orange3/lib/python3.10/site-packages/orangecanvas/scheme/widgetmanager.py", line 354, in ensure_created    self.__add_widget_for_node(node)  File "/home/gmj/miniconda3/envs/orange3/lib/python3.10/site-packages/orangecanvas/scheme/widgetmanager.py", line 247, in __add_widget_for_node    w = self.create_widget_for_node(node)  File "/home/gmj/miniconda3/envs/orange3/lib/python3.10/site-packages/orangewidget/workflow/widgetsscheme.py", line 300, in create_widget_for_node    widget = self.create_widget_instance(node)  File "/home/gmj/miniconda3/envs/orange3/lib/python3.10/site-packages/orangewidget/workflow/widgetsscheme.py", line 413, in create_widget_instance    widget.init()  File "/home/gmj/miniconda3/envs/orange3/lib/python3.10/site-packages/orangecontrib/text/widgets/owkeywords.py", line 250, in init    self._setup_gui()  File "/home/gmj/miniconda3/envs/orange3/lib/python3.10/site-packages/orangecontrib/text/widgets/owkeywords.py", line 262, in _setup_gui    rake_cb = gui.comboBox(  File "/home/gmj/miniconda3/envs/orange3/lib/python3.10/site-packages/Orange/widgets/gui.py", line 485, in comboBox    return gui_comboBox(  File "/home/gmj/miniconda3/envs/orange3/lib/python3.10/site-packages/orangewidget/gui.py", line 1674, in comboBox    callfront.action(cindex)  File "/home/gmj/miniconda3/envs/orange3/lib/python3.10/site-packages/orangewidget/gui.py", line 2398, in action    raise ValueError("Combo box does not contain item " + repr(value))ValueError: Combo box does not contain item 'en'
Local Variables: OrderedDict([('self',              <orangewidget.gui.CallFrontComboBoxModel object at 0x7966ae2c8400>),             ('value', 'en')])
Exception: ValueError: Combo box does not contain item 'en' Module: orangewidget.gui:2398 Widget Name: Extract Keywords Widget Module: orangecontrib.text.widgets.owkeywords:250 Widget Scheme: [ ](file:///%3C%3Fxml%20version%3D%271.0%27%20encoding%3D%27utf-8%27%3F%3E%0A%3Cscheme%20version%3D%222.0%22%20title%3D%22%22%20description%3D%22%22%3E%0A%09%3Cnodes%3E%0A%09%09%3Cnode%20id%3D%220%22%20name%3D%22Extract%20Keywords%22%20qualified_name%3D%22orangecontrib.text.widgets.owkeywords.OWKeywords%22%20project_name%3D%22Orange3-Text%22%20version%3D%22%22%20title%3D%22Extract%20Keywords%22%20position%3D%22%28150%2C%20150%29%22%20/%3E%0A%09%3C/nodes%3E%0A%09%3Clinks%20/%3E%0A%09%3Cannotations%20/%3E%0A%09%3Cthumbnail%20/%3E%0A%09%3Cnode_properties%20/%3E%0A%09%3Csession_state%3E%0A%09%09%3Cwindow_groups%20/%3E%0A%09%3C/session_state%3E%0A%3C/scheme%3E%0A) Version: 3.38.1 Environment: Python 3.10.16 on Linux 6.11.0-1011-oem #11-Ubuntu SMP PREEMPT_DYNAMIC Thu Dec 19 11:35:20 UTC 2024 x86_64 Installed Packages: AnyQt==0.2.0, Bottleneck==1.4.2, Brotli==1.1.0, Orange3-Text==1.16.2, Orange3-Timeseries==0.6.3, Orange3==3.38.1, PyQt6-Qt6==6.8.1, PyQt6-WebEngine-Qt6==6.8.1, PyQt6-WebEngine==6.8.0, PyQt6==6.8.0, PyQt6_sip==13.9.1, PyYAML==6.0.2, Pygments==2.19.1, QtPy==2.4.2, SecretStorage==3.3.3, XlsxWriter==3.2.0, anyio==4.8.0, asttokens==3.0.0, attrs==24.3.0, backports.tarfile==1.2.0, baycomp==1.0.3, beautifulsoup4==4.12.3, biopython==1.84, catboost==1.2.7, cattrs==24.1.2, certifi==2024.12.14, cffi==1.17.1, chardet==5.2.0, charset-normalizer==3.4.1, click==8.1.8, comm==0.2.2, commonmark==0.9.1, conllu==6.0.0, contourpy==1.3.1, cryptography==44.0.0, cycler==0.12.1, debugpy==1.8.11, decorator==5.1.1, defusedxml==0.7.1, dictdiffer==0.9.0, docutils==0.21.2, docx2txt==0.8, et_xmlfile==2.0.0, exceptiongroup==1.2.2, executing==2.1.0, fonttools==4.55.3, frozendict==2.4.6, gensim==4.3.3, graphviz==0.20.3, h11==0.14.0, html5lib==1.1, httpcore==1.0.7, httpx==0.28.1, idna==3.10, importlib_metadata==8.5.0, ipykernel==6.29.5, ipython==8.31.0, jaraco.classes==3.4.0, jaraco.context==6.0.1, jaraco.functools==4.1.0, jedi==0.19.2, jeepney==0.8.0, jellyfish==1.1.3, joblib==1.4.2, jupyter_client==8.6.3, jupyter_core==5.7.2, keyring==25.6.0, keyrings.alt==5.0.2, kiwisolver==1.4.8, langdetect==1.0.9, lemmagen3==3.5.1, lxml==5.3.0, matplotlib-inline==0.1.7, matplotlib==3.10.0, more-itertools==10.5.0, multitasking==0.0.11, nest-asyncio==1.6.0, networkx==3.4.2, nltk==3.9.1, numpy==1.26.4, oauthlib==3.2.2, odfpy==1.4.1, openTSNE==1.0.2, openpyxl==3.1.5, orange-canvas-core==0.2.5, orange-widget-base==4.25.1, owlready2==0.47, packaging==24.2, pandas-datareader==0.10.0, pandas==2.2.3, parso==0.8.4, patsy==1.0.1, peewee==3.17.8, pexpect==4.9.0, pillow==11.1.0, pip==24.2, platformdirs==4.3.6, plotly==5.24.1, prompt_toolkit==3.0.48, psutil==6.1.1, ptyprocess==0.7.0, pure_eval==0.2.3, pybind11==2.13.6, pycparser==2.22, pyparsing==3.2.1, pypdf==5.1.0, pyqtgraph==0.13.7, python-dateutil==2.9.0.post0, python-louvain==0.16, pytz==2024.2, pyzmq==26.2.0, qasync==0.27.1, qtconsole==5.6.1, regex==2024.11.6, requests-cache==1.2.1, requests-oauthlib==1.3.1, requests==2.32.3, scikit-learn==1.6.1, scipy==1.13.1, segtok==1.5.11, serverfiles==0.3.1, setuptools==75.1.0, shapely==2.0.6, simhash==2.1.2, six==1.17.0, smart-open==7.1.0, sniffio==1.3.1, soupsieve==2.6, stack-data==0.6.3, statsmodels==0.14.4, tabulate==0.9.0, tenacity==9.0.0, threadpoolctl==3.5.0, tornado==6.4.2, tqdm==4.67.1, traitlets==5.14.3, trimesh==4.5.3, tweepy==4.14.0, typing_extensions==4.12.2, tzdata==2024.2, ufal.udpipe==1.3.1.1, url-normalize==1.4.3, urllib3==2.3.0, wcwidth==0.2.13, webencodings==0.5.1, websockets==14.1, wheel==0.44.0, wikipedia==1.4.0, wrapt==1.17.1, xgboost==2.0.3, xlrd==2.0.1, yake==0.4.8, yfinance==0.2.51, zipp==3.21.0 Machine ID: ba9f343d-3664-41fb-8f9d-497705247f36 Stack Trace: Traceback (most recent call last): File "/home/gmj/miniconda3/envs/orange3/lib/python3.10/site-packages/orangecanvas/scheme/widgetmanager.py", line 408, in __process_init_queue self.ensure_created(node) File "/home/gmj/miniconda3/envs/orange3/lib/python3.10/site-packages/orangecanvas/scheme/widgetmanager.py", line 354, in ensure_created self.__add_widget_for_node(node) File "/home/gmj/miniconda3/envs/orange3/lib/python3.10/site-packages/orangecanvas/scheme/widgetmanager.py", line 247, in __add_widget_for_node w = self.create_widget_for_node(node) File "/home/gmj/miniconda3/envs/orange3/lib/python3.10/site-packages/orangewidget/workflow/widgetsscheme.py", line 300, in create_widget_for_node widget = self.create_widget_instance(node) File "/home/gmj/miniconda3/envs/orange3/lib/python3.10/site-packages/orangewidget/workflow/widgetsscheme.py", line 413, in create_widget_instance widget.__init__() File "/home/gmj/miniconda3/envs/orange3/lib/python3.10/site-packages/orangecontrib/text/widgets/owkeywords.py", line 250, in __init__ self._setup_gui() File "/home/gmj/miniconda3/envs/orange3/lib/python3.10/site-packages/orangecontrib/text/widgets/owkeywords.py", line 262, in _setup_gui rake_cb = gui.comboBox( File "/home/gmj/miniconda3/envs/orange3/lib/python3.10/site-packages/Orange/widgets/gui.py", line 485, in comboBox return gui_comboBox( File "/home/gmj/miniconda3/envs/orange3/lib/python3.10/site-packages/orangewidget/gui.py", line 1674, in comboBox callfront.action(cindex) File "/home/gmj/miniconda3/envs/orange3/lib/python3.10/site-packages/orangewidget/gui.py", line 2398, in action raise ValueError("Combo box does not contain item " + repr(value)) ValueError: Combo box does not contain item 'en'

Local Variables:
OrderedDict([('self',
<orangewidget.gui.CallFrontComboBoxModel object at 0x7966ae2c8400>),
('value', 'en')])

Operating system:
LinuxMint 22

Additional context
If before error I use print("RAKE_LANGUAGES contenido:", RAKE_LANGUAGES)
the output is: RAKE_LANGUAGES contenido: set()

@gmolledaj
Copy link
Author

gmolledaj commented Feb 22, 2025

Same behavior with Orange3 version 3.37.0 and Text 1.16.1 in Windows 10

Workaround in .../orange3/lib/python3.10/site-packages/orangecontrib/text/widgets add line 263:
RAKE_LANGUAGES = {'en', 'es'}

or better add all lines:

        try:
            print(RAKE_LANGUAGES)
        except (NameError, UnboundLocalError):
            print("The RAKE_LANGUAGES variable is not assigned.")
            RAKE_LANGUAGES = {'en'}

But this is just a hack to get out of the way, also add the language code you are going to work with if it is supported.

@gmolledaj
Copy link
Author

gmolledaj commented Feb 22, 2025

I see that in Preprocess Text there are no dictionaries in Filtering - Stopwords either.
nltk_data does not exist in any of the paths of

import nltk
print (nltk.data.path)

I download the dictionaries:

nltk.download('stopwords')

But it still doesn't detect them even after completely exiting and re-entering the python environment.

@gmolledaj
Copy link
Author

gmolledaj commented Feb 22, 2025

I have seen that in Preprocess Text the list of languages ​​in Filtering, in Stopwords, did not appear either. Checking, I see that an error occurs in filter.py

            return {
                StopwordsFilter.lang_to_iso(file.title())
                for file in os.listdir(stopwords._get_root())
                if file.islower()
            }
        except LookupError:  # when no NLTK data is available
            return set()

The empty set returns there.

@gmolledaj
Copy link
Author

The error in Filtering is caused by the language Albanian because it will not exist in LANG2ISO.
In language.py, line 90
+ "sq": "Albanian"

but the error in Extract Keywords continue...

@gmolledaj
Copy link
Author

gmolledaj commented Feb 22, 2025

SOLVED:
orangecontrib/text/keywords/init.py
line 33

+def get_rake_languages():
+    return StopwordsFilter.supported_languages()

orangecontrib/text/widgets/ownkeywords.py
line 25

-    YAKE_LANGUAGES, RAKE_LANGUAGES
+    YAKE_LANGUAGES, RAKE_LANGUAGES, get_rake_languages

line 263

+ RAKE_LANGUAGES = get_rake_languages()

@gmolledaj
Copy link
Author

gmolledaj commented Feb 22, 2025

diff --git a/orangecontrib/text/keywords/__init__.py b/orangecontrib/text/keywords/__init__.py
index 6b72bad..76032e9 100644
--- a/orangecontrib/text/keywords/__init__.py
+++ b/orangecontrib/text/keywords/__init__.py
@@ -30,6 +30,8 @@ YAKE_LANGUAGES = [
 ]
 # fmt: on
 
+def get_rake_languages():
+    return StopwordsFilter.supported_languages()
 
 def tfidf_keywords(
     corpus: Corpus, progress_callback: Callable = None
diff --git a/orangecontrib/text/widgets/owkeywords.py b/orangecontrib/text/widgets/owkeywords.py
index 92f926e..ae5844d 100644
--- a/orangecontrib/text/widgets/owkeywords.py
+++ b/orangecontrib/text/widgets/owkeywords.py
@@ -22,7 +22,7 @@ from Orange.widgets.widget import Input, Output, OWWidget, Msg
 
 from orangecontrib.text import Corpus
 from orangecontrib.text.keywords import ScoringMethods, AggregationMethods, \
-    YAKE_LANGUAGES, RAKE_LANGUAGES
+    YAKE_LANGUAGES, RAKE_LANGUAGES, get_rake_languages
 from orangecontrib.text.language import LanguageModel
 from orangecontrib.text.preprocess import BaseNormalizer
 from orangecontrib.text.widgets.utils.words import create_words_table, \
@@ -260,6 +260,7 @@ class OWKeywords(OWWidget, ConcurrentWidgetMixin):
             model=LanguageModel(include_none=False, languages=YAKE_LANGUAGES),
             callback=self.__on_yake_lang_changed
         )
+        RAKE_LANGUAGES = get_rake_languages()
         rake_cb = gui.comboBox(
             self.controlArea,
             self,

@gmolledaj
Copy link
Author

I have added changes so that when a new language appears in NLTK it will notify you in the terminal but load the languages ​​present in ISO2LANG:

diff --git a/orangecontrib/text/keywords/__init__.py b/orangecontrib/text/keywords/__init__.py
index 6b72bad..76032e9 100644
--- a/orangecontrib/text/keywords/__init__.py
+++ b/orangecontrib/text/keywords/__init__.py
@@ -30,6 +30,8 @@ YAKE_LANGUAGES = [
 ]
 # fmt: on
 
+def get_rake_languages():
+    return StopwordsFilter.supported_languages()
 
 def tfidf_keywords(
     corpus: Corpus, progress_callback: Callable = None
diff --git a/orangecontrib/text/preprocess/filter.py b/orangecontrib/text/preprocess/filter.py
index d7854d9..e4ba898 100644
--- a/orangecontrib/text/preprocess/filter.py
+++ b/orangecontrib/text/preprocess/filter.py
@@ -115,7 +115,11 @@ class StopwordsFilter(BaseTokenFilter, FileWordListMixin):
         -------
         ISO language code for input language
         """
-        return LANG2ISO[StopwordsFilter.NLTK2LANG.get(language, language)]
+        try:
+            return LANG2ISO[StopwordsFilter.NLTK2LANG.get(language, language)]
+        except LookupError:
+            print ('Missing language in ISO2LANG: '+language)
+            return ('None')
 
     @classmethod
     @property
@@ -128,14 +132,13 @@ class StopwordsFilter(BaseTokenFilter, FileWordListMixin):
         -------
         Set of all languages supported by NLTK
         """
-        try:
-            return {
-                StopwordsFilter.lang_to_iso(file.title())
-                for file in os.listdir(stopwords._get_root())
-                if file.islower()
-            }
-        except LookupError:  # when no NLTK data is available
-            return set()
+        languages_list = {
+            StopwordsFilter.lang_to_iso(file.title())
+            for file in os.listdir(stopwords._get_root())
+            if file.islower()
+        }
+        languages_list = {element for element in languages_list if "None" not in element}
+        return languages_list
 
     def _check(self, token):
         return token not in self.__stopwords and token not in self._lexicon
diff --git a/orangecontrib/text/widgets/owkeywords.py b/orangecontrib/text/widgets/owkeywords.py
index 92f926e..ae5844d 100644
--- a/orangecontrib/text/widgets/owkeywords.py
+++ b/orangecontrib/text/widgets/owkeywords.py
@@ -22,7 +22,7 @@ from Orange.widgets.widget import Input, Output, OWWidget, Msg
 
 from orangecontrib.text import Corpus
 from orangecontrib.text.keywords import ScoringMethods, AggregationMethods, \
-    YAKE_LANGUAGES, RAKE_LANGUAGES
+    YAKE_LANGUAGES, RAKE_LANGUAGES, get_rake_languages
 from orangecontrib.text.language import LanguageModel
 from orangecontrib.text.preprocess import BaseNormalizer
 from orangecontrib.text.widgets.utils.words import create_words_table, \
@@ -260,6 +260,7 @@ class OWKeywords(OWWidget, ConcurrentWidgetMixin):
             model=LanguageModel(include_none=False, languages=YAKE_LANGUAGES),
             callback=self.__on_yake_lang_changed
         )
+        RAKE_LANGUAGES = get_rake_languages()
         rake_cb = gui.comboBox(
             self.controlArea,
             self,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant