Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't Load Russian Language Model due to Caching Bug #357

Closed
jonwiggins opened this issue Mar 3, 2022 · 4 comments
Closed

Can't Load Russian Language Model due to Caching Bug #357

jonwiggins opened this issue Mar 3, 2022 · 4 comments
Labels

Comments

@jonwiggins
Copy link
Contributor

jonwiggins commented Mar 3, 2022

steps to reproduce

>>> import textacy
>>> nlp = textacy.load_spacy_lang("ru_core_news_lg")

yields

... python3.7/site-packages/textacy/cache.py", line 38, in _get_size
    size += sum((_get_size(i, seen) for i in obj))
TypeError: 'type' object is not iterable

expected vs. actual behavior

The Russian Language model should be loaded

possible solution?

This is because the library which the model uses for Russian and Ukrainian morphology (pymorphy2: https://github.com/kmike/pymorphy2) has a class that exhibits some odd behavior.

(Pdb) obj
<class 'pymorphy2.analyzer.Parse'>
(Pdb) type(obj)
<class 'type'>
(Pdb) hasattr(obj, "__iter__")
True

The caching logic (https://github.com/chartbeat-labs/textacy/blob/main/src/textacy/cache.py)

 elif hasattr(obj, "__iter__") and not isinstance(obj, (str, bytes, bytearray)):
        size += sum((_get_size(i, seen) for i in obj))

Thinks that obj is iterable, and then fails.

A similar bug has been reported with OrderedDict on the original repo for this caching code (bosswissam/pysize#8) but has not been fixed.

I think that there are three possible fixes.

  1. Change the check for this to be ... and not isinstance(obj, (str, bytes, bytearray, type)). This will fix this specific instance of the bug, but will not fix any possible occurrence of a strange object which has __iter__ but cannot be iterated.
  2. Wrap this operation in a try/except, this will probably have some performance hit.
  3. Change the first part of the condition to isinstance(obj, collections.Iterable).

The python documentation (https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) actually mentions that the third option will not work in all cases, and seems to recommend using a try/except to see if the object can be iterated over. Although I feel like that is a less elegant solution.

@bdewilde If you have feelings towards one of these solutions I'm happy to open a PR to make the change.

environment

  • operating system: MacOS
  • python version: 3.7.5
  • spacy version: 3.2.1
  • installed spacy models: Several
  • textacy version: 0.12
@jonwiggins jonwiggins added the bug label Mar 3, 2022
@bdewilde
Copy link
Collaborator

bdewilde commented Mar 6, 2022

hi @jonwiggins ! thank you tons for finding and digging into this bug. it looks like pymorphy2.analyzer.Parse inherits from collections.namedtuple, and that has the unexpected __iter__/Iterable behavior. i checked, and i think your option 3 stays closest to the original code and should do the trick in this case:

>>> import collections
>>> import pymorphy2
>>> isinstance(pymorphy2.analyzer.Parse, collections.abc.Iterable)
False
>>> hasattr(pymorphy2.analyzer.Parse, "__iter__")
True

i would be delighted to accept your PR with the change. i don't have any immediate plans to cut a new release, but if this is a blocker on cb work, i could probably do a small bugfix release... let me know 👍

@jonwiggins
Copy link
Contributor Author

Sounds good. FWIW, I merged option 2 into the pysize repo: bosswissam/pysize#18

Upon review option 3 isn't as great because it is possible for iterable objects to not have collections.abc.Iterable as their parent class. https://stackoverflow.com/questions/32799980/what-exactly-does-iterable-mean-in-python-why-isnt-my-object-which-implement

It's not a huge blocker but it is something I'd like to get fixed.

@bdewilde
Copy link
Collaborator

bdewilde commented Mar 6, 2022

@jonwiggins ah, nice! let's follow the source reference's (your) example and wrap the call in a try/except, with one minor difference: set the logging message's level to WARNING, since the implications for textacy's usage aren't too serious. thanks in advance!

@bdewilde
Copy link
Collaborator

bdewilde commented Mar 6, 2022

fixed in main, closing!

@bdewilde bdewilde closed this as completed Mar 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants