Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Encoding mappings for pairs of sequence #506

Merged
merged 6 commits into from
Nov 6, 2020
Merged

Conversation

n1t0
Copy link
Member

@n1t0 n1t0 commented Nov 4, 2020

Fix #486

Improve the mappings XX_to_XX methods by adding support for pairs of sequence:

  • token_to_XX methods now return a tuple with the sequence index and the expected result for pairs of sequence
  • char_to_XX and word_to_XX now accept a new argument sequence_index to specify the target input sequence
  • Also add a token_to_sequence method that can tell for any token, the associated input sequence.

@n1t0 n1t0 requested review from Narsil and thomwolf November 4, 2020 20:30
Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice I would really think needs to be changed it the variant return type, it's really messy for users I think. I gave an example of something which I feel would both keep backcompat and be cleaner for users.

set_sequence_id being exposed feels odd to me, I think we should probably add a realistic test where this might be used in production to convey why it's there (because destroying sequence sources being irreversible, we need a good reason to expose that IMO). Otherwise we should remove it until a convincing scenario is presented I feel.

bindings/python/py_src/tokenizers/__init__.pyi Outdated Show resolved Hide resolved
tokenizers/src/tokenizer/encoding.rs Show resolved Hide resolved
tokenizers/src/tokenizer/encoding.rs Show resolved Hide resolved
tokenizers/src/processors/bert.rs Show resolved Hide resolved
@n1t0 n1t0 merged commit 57d162b into master Nov 6, 2020
@n1t0 n1t0 deleted the encoding-pair-sequence branch November 6, 2020 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve Encoding mappings for pairs of sequence
2 participants