You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I got the codes from here, and I tried to save the trained tokenizer, it said:
Exception: Custom Normalizer cannot be serialized
How can I resolve this exception?
The custom normalizer is as follows:
class CustomNormalizer:
def normalize(self, normalized: NormalizedString):
# Most of these can be replaced by a `Sequence` combining some provided Normalizer,
# (ie Sequence([ NFKC(), Replace(Regex("\s+"), " "), Lowercase() ])
# and it should be the prefered way. That being said, here is an example of the kind
# of things that can be done here:
try:
if normalized is None:
noramlized = NormalizedString("")
else:
normalized.nfkc()
normalized.filter(lambda char: not char.isnumeric())
normalized.replace(Regex("\s+"), " ")
normalized.lowercase()
except TypeError as te:
print("CustomNormalizer TypeError:", te)
print(normalized)
And the custom Tokenizer is as follows:
model = models.WordPiece(unk_token="[UNK]")
tokenizer = Tokenizer(model)
tokenizer.normalizer = Normalizer.custom(CustomNormalizer())
trainer = trainers.WordPieceTrainer(
vocab_size=2500,
special_tokens=special_tokens,
show_progress=True
)
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer, length=len(dataset))
# Save the Tokenizer result
tokenizer.save('saved.json') # in this line, it gives Exception
The text was updated successfully, but these errors were encountered:
I got the codes from here, and I tried to save the trained tokenizer, it said:
How can I resolve this exception?
The custom normalizer is as follows:
And the custom Tokenizer is as follows:
The text was updated successfully, but these errors were encountered: