Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OverflowError: value too large to convert to int32_t #1225

Closed
rulai-huajunzeng opened this issue Jul 26, 2017 · 3 comments
Closed

OverflowError: value too large to convert to int32_t #1225

rulai-huajunzeng opened this issue Jul 26, 2017 · 3 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@rulai-huajunzeng
Copy link

Quite similar to issue #589 but I have to open a new one for the old one was closed. The steps to reproduce as below:

~/my_dir $ pip show spacy
Name: spacy
Version: 1.8.2
Summary: Industrial-strength Natural Language Processing (NLP) with Python and Cython
Home-page: https://spacy.io
Author: Matthew Honnibal
Author-email: matt@explosion.ai
License: MIT
Location: /usr/lib/python2.7/site-packages
Requires: numpy, murmurhash, cymem, preshed, thinc, plac, six, pathlib, ujson, dill, requests, regex, ftfy
~/my_dir $ python
Python 2.7.13 (default, Dec 22 2016, 09:22:15) 
[GCC 6.2.1 20160822] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import spacy
>>> nlp = spacy.en.English()
>>> nlp.vocab.strings.set_frozen(True)
>>> nlp(u'Whataasdfsdaf')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/spacy/language.py", line 320, in __call__
    doc = self.make_doc(text)
  File "/usr/lib/python2.7/site-packages/spacy/language.py", line 293, in <lambda>
    self.make_doc = lambda text: self.tokenizer(text)
  File "spacy/tokenizer.pyx", line 165, in spacy.tokenizer.Tokenizer.__call__ (spacy/tokenizer.cpp:5486)
  File "spacy/tokenizer.pyx", line 205, in spacy.tokenizer.Tokenizer._tokenize (spacy/tokenizer.cpp:6060)
  File "spacy/tokenizer.pyx", line 279, in spacy.tokenizer.Tokenizer._attach_tokens (spacy/tokenizer.cpp:7129)
  File "spacy/vocab.pyx", line 246, in spacy.vocab.Vocab.get (spacy/vocab.cpp:6986)
  File "spacy/vocab.pyx", line 269, in spacy.vocab.Vocab._new_lexeme (spacy/vocab.cpp:7249)
OverflowError: value too large to convert to int32_t

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Jul 26, 2017
@honnibal
Copy link
Member

Thanks for the report! The set_frozen mechanism has been a stop-gap, and I'm not immediately sure what's changed here that's broken it. I'll likely fix the underlying problem for spaCy 2, rather than repairing this. The situation around the streaming data memory growth is much better in spaCy 2, because the integer IDs are now hash values, rather than strings.

@honnibal
Copy link
Member

Please see #1424

In short: the streaming data memory growth is finally fixed properly in spaCy v2 🎉 . This means the flakey set_frozen functionality could be deleted from the StringStore, resolving this issue.

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

2 participants