Skip to content

Commit

Permalink
Load exceptions last in Tokenizer.from_bytes (explosion#12553)
Browse files Browse the repository at this point in the history
In `Tokenizer.from_bytes`, the exceptions should be loaded last so that
they are only processed once as part of loading the model.

The exceptions are tokenized as phrase matcher patterns in the
background and the internal tokenization needs to be synced with all the
remaining tokenizer settings. If the exceptions are not loaded last,
there are speed regressions for `Tokenizer.from_bytes/disk` vs.
`Tokenizer.add_special_case` as the caches are reloaded more than
necessary during deserialization.
  • Loading branch information
adrianeboyd committed May 12, 2023
1 parent 7bf1db8 commit 357fdd4
Showing 1 changed file with 4 additions and 2 deletions.
6 changes: 4 additions & 2 deletions spacy/tokenizer.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -834,10 +834,12 @@ cdef class Tokenizer:
self.token_match = re.compile(data["token_match"]).match
if "url_match" in data and isinstance(data["url_match"], str):
self.url_match = re.compile(data["url_match"]).match
if "rules" in data and isinstance(data["rules"], dict):
self.rules = data["rules"]
if "faster_heuristics" in data:
self.faster_heuristics = data["faster_heuristics"]
# always load rules last so that all other settings are set before the
# internal tokenization for the phrase matcher
if "rules" in data and isinstance(data["rules"], dict):
self.rules = data["rules"]
return self


Expand Down

0 comments on commit 357fdd4

Please sign in to comment.