Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception: Custom Normalizer cannot be serialized #1361

Closed
shivanraptor opened this issue Oct 9, 2023 · 1 comment
Closed

Exception: Custom Normalizer cannot be serialized #1361

shivanraptor opened this issue Oct 9, 2023 · 1 comment

Comments

@shivanraptor
Copy link

I got the codes from here, and I tried to save the trained tokenizer, it said:

Exception: Custom Normalizer cannot be serialized

How can I resolve this exception?

The custom normalizer is as follows:

class CustomNormalizer:
    def normalize(self, normalized: NormalizedString):
        # Most of these can be replaced by a `Sequence` combining some provided Normalizer,
        # (ie Sequence([ NFKC(), Replace(Regex("\s+"), " "), Lowercase() ])
        # and it should be the prefered way. That being said, here is an example of the kind
        # of things that can be done here:
        try:
            if normalized is None:
                noramlized = NormalizedString("")
            else:
                normalized.nfkc()
                normalized.filter(lambda char: not char.isnumeric())
                normalized.replace(Regex("\s+"), " ")
                normalized.lowercase()
        except TypeError as te:
            print("CustomNormalizer TypeError:", te)
            print(normalized)

And the custom Tokenizer is as follows:

model = models.WordPiece(unk_token="[UNK]")
tokenizer = Tokenizer(model)
tokenizer.normalizer = Normalizer.custom(CustomNormalizer())
trainer = trainers.WordPieceTrainer(
    vocab_size=2500, 
    special_tokens=special_tokens,
    show_progress=True
)
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer, length=len(dataset))

# Save the Tokenizer result
tokenizer.save('saved.json') # in this line, it gives Exception
@shivanraptor
Copy link
Author

Similar case as #581

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant