KeyError when serializing a doc object after adding a new entity label #514

emsrc · 2016-10-03T13:11:10Z

I'm trying to add new entity labels and add new entity spans accordingly. However, this results in a KeyError when using doc.to_bytes(). Minimal code example below:

# python3 + spacy 0.101.0

import spacy

nlp = spacy.load('en')

doc = nlp('This is a sentence about pasta.')

label = 'Food'
nlp.entity.add_label(label)
label_id = nlp.vocab.strings[label]

print(label_id)

doc.ents = [(label_id, 5,6)]

print(doc.ents)

byte_string = doc.to_bytes()

Output:

6832
(pasta,)
Traceback (most recent call last):
  File "/Users/work/Projects/ScienceIE/scienceie17/exps/crf0/minimal.py", line 18, in <module>
    byte_string = doc.to_bytes()
  File "spacy/tokens/doc.pyx", line 418, in spacy.tokens.doc.Doc.to_bytes (spacy/tokens/doc.cpp:10687)
  File "spacy/serialize/packer.pyx", line 110, in spacy.serialize.packer.Packer.pack (spacy/serialize/packer.cpp:5687)
  File "spacy/serialize/huffman.pyx", line 61, in spacy.serialize.huffman.HuffmanCodec.encode (spacy/serialize/huffman.cpp:2535)
KeyError: 6832

The text was updated successfully, but these errors were encountered:

…d. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness.

honnibal · 2016-10-23T15:49:53Z

Added a fix for this, but the situation's pretty messy. The serializer expects a list of attribute frequencies, so that it can build a Huffman tree. So it wants to know what entity labels are available, and how common they are. Once the Huffman trees are built, they can't be modified without changing the encoding.

The result is that if you serialize some documents, add an entity label, and then serialize some more, the two sets of documents won't be consistently encoded. So uh...don't do that :p.

I suggest trying to add your custom entity labels as soon as possible after loading the pipeline. That's probably the best way to work around the brittleness here, until the underlying design improves. The serializer is probably rather over-engineered.

emsrc · 2016-10-24T06:33:16Z

Got it. Thanks!

lock · 2018-05-09T07:38:49Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the bug Bugs and behaviour differing from documentation label Oct 21, 2016

honnibal added a commit that referenced this issue Oct 23, 2016

Test Issue #514: Serialization fails after adding a new entity label.

4de30a8

honnibal added a commit that referenced this issue Oct 23, 2016

Test Issue #514: Serializer fails when new entity type has been added.

79aa03f

honnibal added a commit that referenced this issue Oct 23, 2016

Fix issue #514 -- serializer fails when new entity type has been adde…

3e688e6

…d. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness.

honnibal closed this as completed Oct 23, 2016

honnibal mentioned this issue May 7, 2017

💫 Improve annotation serialisation #1045

Closed

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError when serializing a doc object after adding a new entity label #514

KeyError when serializing a doc object after adding a new entity label #514

emsrc commented Oct 3, 2016

honnibal commented Oct 23, 2016

emsrc commented Oct 24, 2016

lock bot commented May 9, 2018

KeyError when serializing a doc object after adding a new entity label #514

KeyError when serializing a doc object after adding a new entity label #514

Comments

emsrc commented Oct 3, 2016

honnibal commented Oct 23, 2016

emsrc commented Oct 24, 2016

lock bot commented May 9, 2018