Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError when serializing a doc object after adding a new entity label #514

Closed
emsrc opened this issue Oct 3, 2016 · 3 comments
Closed
Labels
bug Bugs and behaviour differing from documentation

Comments

@emsrc
Copy link

emsrc commented Oct 3, 2016

I'm trying to add new entity labels and add new entity spans accordingly. However, this results in a KeyError when using doc.to_bytes(). Minimal code example below:

# python3 + spacy 0.101.0

import spacy

nlp = spacy.load('en')

doc = nlp('This is a sentence about pasta.')

label = 'Food'
nlp.entity.add_label(label)
label_id = nlp.vocab.strings[label]

print(label_id)

doc.ents = [(label_id, 5,6)]

print(doc.ents)

byte_string = doc.to_bytes()

Output:

6832
(pasta,)
Traceback (most recent call last):
  File "/Users/work/Projects/ScienceIE/scienceie17/exps/crf0/minimal.py", line 18, in <module>
    byte_string = doc.to_bytes()
  File "spacy/tokens/doc.pyx", line 418, in spacy.tokens.doc.Doc.to_bytes (spacy/tokens/doc.cpp:10687)
  File "spacy/serialize/packer.pyx", line 110, in spacy.serialize.packer.Packer.pack (spacy/serialize/packer.cpp:5687)
  File "spacy/serialize/huffman.pyx", line 61, in spacy.serialize.huffman.HuffmanCodec.encode (spacy/serialize/huffman.cpp:2535)
KeyError: 6832
@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Oct 21, 2016
honnibal added a commit that referenced this issue Oct 23, 2016
…d. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness.
@honnibal
Copy link
Member

Added a fix for this, but the situation's pretty messy. The serializer expects a list of attribute frequencies, so that it can build a Huffman tree. So it wants to know what entity labels are available, and how common they are. Once the Huffman trees are built, they can't be modified without changing the encoding.

The result is that if you serialize some documents, add an entity label, and then serialize some more, the two sets of documents won't be consistently encoded. So uh...don't do that :p.

I suggest trying to add your custom entity labels as soon as possible after loading the pipeline. That's probably the best way to work around the brittleness here, until the underlying design improves. The serializer is probably rather over-engineered.

@emsrc
Copy link
Author

emsrc commented Oct 24, 2016

Got it. Thanks!

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

2 participants