Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New mappings on Encoding #200

Merged
merged 20 commits into from
Mar 26, 2020
Merged

New mappings on Encoding #200

merged 20 commits into from
Mar 26, 2020

Conversation

n1t0
Copy link
Member

@n1t0 n1t0 commented Mar 19, 2020

Provide some more mappings on the Encoding in order to easily identify words after tokenization.

It also exposes a method encode_tokenized on the BaseTokenizer to allow skipping the usual Normalizer and PreTokenizer. This is especially useful for NER like datasets, where the pre-tokenization has already been done, and we want to attribute labels to pre-tokenized words.

Still missing:

  • Node bindings
  • Node typings

@n1t0 n1t0 requested a review from mfuntowicz March 19, 2020 03:10
@n1t0 n1t0 force-pushed the encoding-mappings branch from 8d23093 to b3fd06f Compare March 19, 2020 03:11
@n1t0 n1t0 force-pushed the encoding-mappings branch 3 times, most recently from 26a9e50 to 4c8ba75 Compare March 26, 2020 16:03
@n1t0 n1t0 force-pushed the encoding-mappings branch from ea72fba to e9667a7 Compare March 26, 2020 19:43
@n1t0 n1t0 merged commit 1bdff71 into master Mar 26, 2020
@n1t0 n1t0 deleted the encoding-mappings branch March 26, 2020 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants