Optionally preserve unrecognized tokens #680

myedibleenso · 2022-11-19T01:02:20Z

The tokenizer cannot handle just any character sequence. There are scenarios where a user would like to preserve these symbols in the Document that is produced.

I haven't had a hand in the implementation of the tokenizer, but is it possible to reinsert these symbols as tokens in the doc we return? lemmata, tags, etc. should probably default to some sort of UNK tag.

Related: if processors moves to a transformers backbone for annotation as planned, will the tokenizer be replaced by a wordpiece tokenizer or will predictions be re-mapped to the word-like tokens recognized by the current tokenizer?

The text was updated successfully, but these errors were encountered:

MihaiSurdeanu · 2022-11-19T18:55:25Z

We will preserve our word-level tokenizer, and then use subword tokenization just within words. So, from the outside, the API stays the same.
I have mixed feelings about preserving weird symbols. Tokenization and simpler tasks may work, but this will mess up parsing for sure...

myedibleenso · 2022-11-20T02:13:47Z

I have mixed feelings about preserving weird symbols. Tokenization and simpler tasks may work, but this will mess up parsing for sure...

Agreed. It needs to be done in a way that doesn't affect annotation (see #681)

myedibleenso · 2022-11-20T02:15:57Z

Although if processors moves to using transformers, re-introducing tokens after annotation might be confusing when inspecting things like attention weights.

kwalcock · 2023-02-23T15:40:11Z

See #716 and #290. The UNK value is presently an empty string.

myedibleenso added the enhancement label Nov 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally preserve unrecognized tokens #680

Optionally preserve unrecognized tokens #680

myedibleenso commented Nov 19, 2022

MihaiSurdeanu commented Nov 19, 2022

myedibleenso commented Nov 20, 2022

myedibleenso commented Nov 20, 2022

kwalcock commented Feb 23, 2023

Optionally preserve unrecognized tokens #680

Optionally preserve unrecognized tokens #680

Comments

myedibleenso commented Nov 19, 2022

MihaiSurdeanu commented Nov 19, 2022

myedibleenso commented Nov 20, 2022

myedibleenso commented Nov 20, 2022

kwalcock commented Feb 23, 2023