Load exceptions last in Tokenizer.from_bytes #12553

adrianeboyd · 2023-04-20T07:16:01Z

Description

In Tokenizer.from_bytes, the exceptions should be loaded last so that they are only processed once as part of loading the model.

The exceptions are tokenized as phrase matcher patterns in the background and the internal tokenization needs to be synced with all the remaining tokenizer settings. If the exceptions are not loaded last, there are speed regressions for Tokenizer.from_bytes/disk vs. Tokenizer.add_special_case as the caches are reloaded more than necessary during deserialization.

Types of change

Bug fix.

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

In `Tokenizer.from_bytes`, the exceptions should be loaded last so that they are only processed once as part of loading the model. The exceptions are tokenized as phrase matcher patterns in the background and the internal tokenization needs to be synced with all the remaining tokenizer settings. If the exceptions are not loaded last, there are speed regressions for `Tokenizer.from_bytes/disk` vs. `Tokenizer.add_special_case` as the caches are reloaded more than necessary during deserialization.

svlandeg

Nice catch!

In `Tokenizer.from_bytes`, the exceptions should be loaded last so that they are only processed once as part of loading the model. The exceptions are tokenized as phrase matcher patterns in the background and the internal tokenization needs to be synced with all the remaining tokenizer settings. If the exceptions are not loaded last, there are speed regressions for `Tokenizer.from_bytes/disk` vs. `Tokenizer.add_special_case` as the caches are reloaded more than necessary during deserialization.

adrianeboyd added bug Bugs and behaviour differing from documentation feat / tokenizer Feature: Tokenizer perf / speed Performance: speed labels Apr 20, 2023

adrianeboyd linked an issue Apr 20, 2023 that may be closed by this pull request

Adding many special cases to Tokenizer greatly degrades startup performance #12534

Closed

svlandeg approved these changes Apr 20, 2023

View reviewed changes

svlandeg merged commit dc0a1a9 into explosion:master Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load exceptions last in Tokenizer.from_bytes #12553

Load exceptions last in Tokenizer.from_bytes #12553

adrianeboyd commented Apr 20, 2023

svlandeg left a comment

Load exceptions last in Tokenizer.from_bytes #12553

Load exceptions last in Tokenizer.from_bytes #12553

Conversation

adrianeboyd commented Apr 20, 2023

Description

Types of change

Checklist

svlandeg left a comment

Choose a reason for hiding this comment