Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization: store_user_data is ignored in DocBin constructor #9190

Closed
nrodnova opened this issue Sep 12, 2021 · 3 comments
Closed

Serialization: store_user_data is ignored in DocBin constructor #9190

nrodnova opened this issue Sep 12, 2021 · 3 comments
Labels
bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects

Comments

@nrodnova
Copy link
Contributor

How to reproduce the behaviour

When I try to create a DocBin out of documents that have Spans in custom attributes, I get an exception, even though I set store_user_data = False. The exception is thrown in DocBin.add() function, line 113 of _serialize.py file:

self.user_data.append(srsly.msgpack_dumps(doc.user_data))

The exception is thrown because srsly.msgpack_dumps(doc.user_data) cannot serialize Spans. By mistake, it is called even when self.store_user_data is set to False

The following code reproduces the behavior:

import spacy
from spacy.tokens import Doc
from spacy.tokens import DocBin
nlp = spacy.load('en_core_web_sm')
Doc.set_extension(name = 'test', default = None, force = True)

text = "Hello, world!"
doc = nlp(text)
doc._.test = doc[0].sent # Assign a span to the doc's custom attribute
docs = [doc]

## Create DocBin
attributes = ("ORTH", "TAG", "HEAD", "DEP", "ENT_IOB", "ENT_TYPE", "ENT_ID", "ENT_KB_ID", "LEMMA", "MORPH", "POS", "LOWER", "SENT_START", "SENT_END", "IDX", 'IS_PUNCT', 'LIKE_NUM', 'IS_BRACKET', "IS_LEFT_PUNCT", "IS_RIGHT_PUNCT")
doc_bin = DocBin(attrs = attributes, store_user_data = False, docs = docs)

The error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-5796ac30343d> in <module>
      1 ## Store
      2 attributes = ("ORTH", "TAG", "HEAD", "DEP", "ENT_IOB", "ENT_TYPE", "ENT_ID", "ENT_KB_ID", "LEMMA", "MORPH", "POS", "LOWER", "SENT_START", "SENT_END", "IDX", 'IS_PUNCT', 'LIKE_NUM', 'IS_BRACKET', "IS_LEFT_PUNCT", "IS_RIGHT_PUNCT")
----> 3 doc_bin = DocBin(attrs = attributes, store_user_data = False, docs = docs)
      4 #serialized_docs = doc_bin.to_bytes()

~/python-virtualenv/eon-spacy-3/lib/python3.7/site-packages/spacy/tokens/_serialize.py in __init__(self, attrs, store_user_data, docs)
     78         self.store_user_data = store_user_data
     79         for doc in docs:
---> 80             self.add(doc)
     81 
     82     def __len__(self) -> int:

~/python-virtualenv/eon-spacy-3/lib/python3.7/site-packages/spacy/tokens/_serialize.py in add(self, doc)
    109             self.strings.add(token.ent_kb_id_)
    110         self.cats.append(doc.cats)
--> 111         self.user_data.append(srsly.msgpack_dumps(doc.user_data))
    112         self.span_groups.append(doc.spans.to_bytes())
    113         for key, group in doc.spans.items():

~/python-virtualenv/eon-spacy-3/lib/python3.7/site-packages/srsly/_msgpack_api.py in msgpack_dumps(data)
     12     RETURNS (bytes): The serialized bytes.
     13     """
---> 14     return msgpack.dumps(data, use_bin_type=True)
     15 
     16 

~/python-virtualenv/eon-spacy-3/lib/python3.7/site-packages/srsly/msgpack/__init__.py in packb(o, **kwargs)
     53     Pack an object and return the packed bytes.
     54     """
---> 55     return Packer(**kwargs).pack(o)
     56 
     57 

~/python-virtualenv/eon-spacy-3/lib/python3.7/site-packages/srsly/msgpack/_packer.pyx in srsly.msgpack._packer.Packer.pack()

~/python-virtualenv/eon-spacy-3/lib/python3.7/site-packages/srsly/msgpack/_packer.pyx in srsly.msgpack._packer.Packer.pack()

~/python-virtualenv/eon-spacy-3/lib/python3.7/site-packages/srsly/msgpack/_packer.pyx in srsly.msgpack._packer.Packer.pack()

~/python-virtualenv/eon-spacy-3/lib/python3.7/site-packages/srsly/msgpack/_packer.pyx in srsly.msgpack._packer.Packer._pack()

~/python-virtualenv/eon-spacy-3/lib/python3.7/site-packages/srsly/msgpack/_packer.pyx in srsly.msgpack._packer.Packer._pack()

TypeError: can not serialize 'spacy.tokens.span.Span' object

Your Environment

  • spaCy version: 3.0.6
  • Platform: Darwin-20.3.0-x86_64-i386-64bit
  • Python version: 3.7.9
  • Pipelines: en_core_web_lg (3.0.0), en_core_web_md (3.0.0), en_core_web_sm (3.0.0)
@polm polm added feat / doc Feature: Doc, Span and Token objects bug Bugs and behaviour differing from documentation labels Sep 13, 2021
@polm
Copy link
Contributor

polm commented Sep 13, 2021

Thanks for the report, I confirmed this is still an issue with 3.1.2. (I also got some separate errors related to the attributes, though I need to look at that more.)

We'll take a look at fixing this.

polm added a commit to polm/spaCy that referenced this issue Sep 16, 2021
svlandeg pushed a commit that referenced this issue Oct 1, 2021
* Don't store user data if told not to (fix #9190)

* Add unit tests for the store_user_data setting
@polm
Copy link
Contributor

polm commented Oct 9, 2021

This should be fixed by #9226. Thanks again for the report!

@polm polm closed this as completed Oct 9, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Nov 9, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects
Projects
None yet
Development

No branches or pull requests

2 participants