Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use metaclass to subclass errors to allow better pickling #9593

Merged
merged 1 commit into from
Nov 3, 2021
Merged

Use metaclass to subclass errors to allow better pickling #9593

merged 1 commit into from
Nov 3, 2021

Conversation

BramVanroy
Copy link
Contributor

@BramVanroy BramVanroy commented Nov 2, 2021

When trying to pickle with dill, as happens under the hood when processing a HuggingFace dataset, I found that spaCy couldn't be pickled (related issue: huggingface/datasets#3178). I found that the cause was dill.Pickler that recursively pickled all objects (Pickler(file, recurse=True).dump(obj)). This is preferred in datasets: such iterative pickling allows for fingerprinting, i.e. keeping track of all the processing done to a dataset, deterministically. If the fingerprint is the same as an earlier processing attempt, then a cache can be used and the processing (e.g. tokenizing with spaCy) does not have to be done again. Very useful and great to save time and processing.

The culprit in the spaCy codebase that could not be pickled in this way, was this part

spaCy/spacy/errors.py

Lines 4 to 15 in f1bc655

def add_codes(err_cls):
"""Add error codes to string messages via class attribute names."""
class ErrorsWithCodes(err_cls):
def __getattribute__(self, code):
msg = super(ErrorsWithCodes, self).__getattribute__(code)
if code.startswith("__"): # python system attributes like __class__
return msg
else:
return "[{code}] {msg}".format(code=code, msg=msg)
return ErrorsWithCodes()

The class ErrorsWithCodes cannot be pickled here because it does not have a fixed signature in the global scope (its superclass is not known beforehand). After digging through the Internet (thank you Stack Overflow!) I found that a metaclass is the way to go, as implemented in this PR. In my opinion, this is also more structurally clear than before. Regular subclassing a metaclass instead of dynamically created a class based on a subclass with a decorator. The metaclass defines the magic methods that are required to access the class' attributes, rather than accessing instance attributes.

I modified the error test accordingly, which completes successfully.

After this modification, recursive dill pickling works flawlessly as well.

closes #9584

Types of change

Enhancement

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@BramVanroy BramVanroy changed the title Use metaclass to decorate errors to allow better pickling Use metaclass to subclass errors to allow better pickling Nov 2, 2021
@adrianeboyd adrianeboyd requested a review from ines November 2, 2021 16:02
@svlandeg svlandeg added enhancement Feature requests and improvements feat / serialize Feature: Serialization, saving and loading labels Nov 2, 2021
@honnibal
Copy link
Member

honnibal commented Nov 3, 2021

Thanks, the change makes sense to me, and it's nice to support pickling with dill.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / serialize Feature: Serialization, saving and loading
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pickling with dill and recurse=True fails
4 participants