You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The tokenizer cannot handle just any character sequence. There are scenarios where a user would like to preserve these symbols in the Document that is produced.
I haven't had a hand in the implementation of the tokenizer, but is it possible to reinsert these symbols as tokens in the doc we return? lemmata, tags, etc. should probably default to some sort of UNK tag.
Related: if processors moves to a transformers backbone for annotation as planned, will the tokenizer be replaced by a wordpiece tokenizer or will predictions be re-mapped to the word-like tokens recognized by the current tokenizer?
The text was updated successfully, but these errors were encountered:
Although if processors moves to using transformers, re-introducing tokens after annotation might be confusing when inspecting things like attention weights.
The tokenizer cannot handle just any character sequence. There are scenarios where a user would like to preserve these symbols in the Document that is produced.
I haven't had a hand in the implementation of the tokenizer, but is it possible to reinsert these symbols as tokens in the doc we return? lemmata, tags, etc. should probably default to some sort of
UNK
tag.Related: if processors moves to a transformers backbone for annotation as planned, will the tokenizer be replaced by a wordpiece tokenizer or will predictions be re-mapped to the word-like tokens recognized by the current tokenizer?
The text was updated successfully, but these errors were encountered: