Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: TokenTokenizer存在分词忽略空格的问题 #37

Closed
xiangking opened this issue Mar 24, 2022 · 1 comment · Fixed by #42
Closed

Fix: TokenTokenizer存在分词忽略空格的问题 #37

xiangking opened this issue Mar 24, 2022 · 1 comment · Fixed by #42
Assignees
Labels
bug Something isn't working

Comments

@xiangking
Copy link
Owner

xiangking commented Mar 24, 2022

Environment info

Python 3.8.10
ark-nlp 0.0.7

Information

tokenizer.tokenize('森麥康 小米3 M4 M5 5C 5X 5S 5Splus mi 6 6X电源开机音量按键排线侧键 小米5C 开机音量排线')

>>> 
['森',
 '麥',
 '康',
 '小',
 '米',
 '3',
 'm',
 '4',
 'm',
 '5',
 '5',
 'c',
 '5',
 'x',
 '5',
 's',
 '5',
 's',
 'p',
 'l',
 'u',
 's',
 'm',
 'i',
 '6',
 '6',
 'x',
 '电',
 '源',
 '开',
 '机',
 '音',
 '量',
 '按',
 '键',
 '排',
 '线',
 '侧',
 '键',
 '小',
 '米',
 '5',
 'c',
 '开',
 '机',
 '音',
 '量',
 '排',
 '线']
@xiangking xiangking added the bug Something isn't working label Mar 24, 2022
@xiangking xiangking self-assigned this Mar 24, 2022
@xiangking
Copy link
Owner Author

可使用下面方法重写类,下一版本会修复该bug

from ark_nlp.processor.tokenizer.transfomer import TransfomerTokenizer


class TokenTokenizer(TransfomerTokenizer):
    """
    Transfomer文本编码器,用于按字符进行分词、ID化、填充等操作

    Args:
        vocab: transformers词典类对象、词典地址或词典名,用于实现文本分词和ID化
        max_seq_len (:obj:`int`): 预设的文本最大长度
    """  # noqa: ignore flake8"

    def tokenize(self, text, **kwargs):
        tokens = []
        for token_ in text:
            tokenized_token_ = self.vocab.tokenize(token_)
            if tokenized_token_ == []:
                tokens.extend([token_])
            else:
                tokens.extend(tokenized_token_)
            
        return tokens

    def sequence_to_ids(self, sequence, **kwargs):
        return self.sentence_to_ids(sequence, **kwargs)

xiangking pushed a commit that referenced this issue Mar 26, 2022
This was referenced Mar 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant