Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tokenizer] Unify tokenizer _pad #9280

Merged
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Merge remote-tracking branch 'paddlenlp/develop' into dev_20241016_up…
…date_tokenizer__pad
DrownFish19 committed Oct 17, 2024
commit a183b40af5f8d2e5ead35f6864e604fb7d812dfb
3 changes: 2 additions & 1 deletion paddlenlp/transformers/chatglm_v2/tokenizer.py
Original file line number Diff line number Diff line change
@@ -257,7 +257,8 @@ def _pad(
- PaddingStrategy.LONGEST Pad to the longest sequence in the batch
- PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
- PaddingStrategy.DO_NOT_PAD: Do not pad
The tokenizer padding sides are defined in self.padding_side:
The tokenizer padding sides are defined in `padding_side` argument:

- 'left': pads on the left of the sequences
- 'right': pads on the right of the sequences
pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
2 changes: 1 addition & 1 deletion paddlenlp/transformers/llama/tokenizer.py
Original file line number Diff line number Diff line change
@@ -15,7 +15,7 @@

import os
from shutil import copyfile
from typing import Dict, List, Literal, Optional, Tuple, Union
from typing import Dict, List, Optional, Tuple, Union

import sentencepiece as spm

12 changes: 11 additions & 1 deletion tests/transformers/test_tokenizer_common.py
Original file line number Diff line number Diff line change
@@ -28,6 +28,7 @@
from typing import Any, Dict, List, Tuple

import numpy as np
from parameterized import parameterized

from paddlenlp.transformers import PretrainedTokenizer
from paddlenlp.transformers.tokenizer_utils import AddedToken, Trie
@@ -1555,7 +1556,16 @@ def test_padding_with_attn_mask_startend_row_indices(self):
padded_features["attn_mask_startend_row_indices"][1], np.array([[0, 0, 0, 5, 5, 3]], np.int32)
)

def test_encode_plus_with_padding(self):
@parameterized.expand([(True,), (False,)])
def test_encode_plus_with_padding(self, use_padding_as_call_kwarg: bool):
"""
This test checks that padding works as expected when tokenizing a sequence.
Padding is expected to have no effect when the input is a single sequence and
the padding-strategy is not `max_length`. Otherwise it pads to the specified max-length
using tokenizer classes `padding_side` attribute. Also, we check that passing `padding_side`
as call time kwarg works same way as when one sets `tokenizer.padding_side` attribute.
"""

tokenizers = self.get_tokenizers(do_lower_case=False)
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
Loading
You are viewing a condensed version of this merge commit. You can view the full changes here.