Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load exceptions last in Tokenizer.from_bytes #12553

Merged

Conversation

adrianeboyd
Copy link
Contributor

Description

In Tokenizer.from_bytes, the exceptions should be loaded last so that they are only processed once as part of loading the model.

The exceptions are tokenized as phrase matcher patterns in the background and the internal tokenization needs to be synced with all the remaining tokenizer settings. If the exceptions are not loaded last, there are speed regressions for Tokenizer.from_bytes/disk vs. Tokenizer.add_special_case as the caches are reloaded more than necessary during deserialization.

Types of change

Bug fix.

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

In `Tokenizer.from_bytes`, the exceptions should be loaded last so that
they are only processed once as part of loading the model.

The exceptions are tokenized as phrase matcher patterns in the
background and the internal tokenization needs to be synced with all the
remaining tokenizer settings. If the exceptions are not loaded last,
there are speed regressions for `Tokenizer.from_bytes/disk` vs.
`Tokenizer.add_special_case` as the caches are reloaded more than
necessary during deserialization.
@adrianeboyd adrianeboyd added bug Bugs and behaviour differing from documentation feat / tokenizer Feature: Tokenizer perf / speed Performance: speed labels Apr 20, 2023
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@svlandeg svlandeg merged commit dc0a1a9 into explosion:master Apr 20, 2023
adrianeboyd added a commit to adrianeboyd/spaCy that referenced this pull request May 12, 2023
In `Tokenizer.from_bytes`, the exceptions should be loaded last so that
they are only processed once as part of loading the model.

The exceptions are tokenized as phrase matcher patterns in the
background and the internal tokenization needs to be synced with all the
remaining tokenizer settings. If the exceptions are not loaded last,
there are speed regressions for `Tokenizer.from_bytes/disk` vs.
`Tokenizer.add_special_case` as the caches are reloaded more than
necessary during deserialization.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation feat / tokenizer Feature: Tokenizer perf / speed Performance: speed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding many special cases to Tokenizer greatly degrades startup performance
2 participants