New mappings on Encoding #200

n1t0 · 2020-03-19T03:10:14Z

Provide some more mappings on the Encoding in order to easily identify words after tokenization.

It also exposes a method encode_tokenized on the BaseTokenizer to allow skipping the usual Normalizer and PreTokenizer. This is especially useful for NER like datasets, where the pre-tokenization has already been done, and we want to attribute labels to pre-tokenized words.

Still missing:

Node bindings
Node typings

…ding.

… in .py files not only .pyi

n1t0 requested a review from mfuntowicz March 19, 2020 03:10

n1t0 force-pushed the encoding-mappings branch from 8d23093 to b3fd06f Compare March 19, 2020 03:11

mfuntowicz approved these changes Mar 23, 2020

View reviewed changes

n1t0 force-pushed the encoding-mappings branch 3 times, most recently from 26a9e50 to 4c8ba75 Compare March 26, 2020 16:03

n1t0 and others added 20 commits March 26, 2020 15:42

Rust - Add some more mappings on Encoding

b699fe4

Python - Bind new Encoding's mappings

8de6ef5

Python - Expose encode method on Model

a397a1d

Rust - Fix mappings and add tests

595e799

Node - Expose encode on JsModel and JsEncoding new mappings

f922028

Rust - Update mappings API

e9babc8

Python - Update mappings API

1150751

Node - Update mappings API

2f310f3

Python - Add Model.encode_batch and improve typings

eec74ca

Node - Add Model.encodeBatch and make everything async

f79ae40

Add some new merging capability on Encoding

9ce8955

Expose post_process on the Tokenizer

9bd9e0b

Node - Bindings for Encoding mappings

ce3cf78

Node - Bindings for tokenized encoding

7055281

Forward type_id in encode_tokenized/encode_tokenized_batch python bin…

68405a6

…ding.

TokenizedSequence / TokenizedSequenceWithOffsets needs to be declared…

39958a2

… in .py files not only .pyi

Python - fix style

14e3ab3

Python - last fixes on Encoding bindings/typings

4341c79

Node - Merge encodings

0408567

Node - tokenizer.postProcess bindings

e9667a7

n1t0 force-pushed the encoding-mappings branch from ea72fba to e9667a7 Compare March 26, 2020 19:43

n1t0 merged commit 1bdff71 into master Mar 26, 2020

n1t0 deleted the encoding-mappings branch March 26, 2020 19:58

dav009 mentioned this pull request Jun 17, 2020

Several problems with named entites predicted with the ner pipeline huggingface/transformers#5077

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New mappings on Encoding #200

New mappings on Encoding #200

n1t0 commented Mar 19, 2020 •

edited

Loading

New mappings on Encoding #200

New mappings on Encoding #200

Conversation

n1t0 commented Mar 19, 2020 • edited Loading

n1t0 commented Mar 19, 2020 •

edited

Loading