We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python 3.8.10 ark-nlp 0.0.7
tokenizer.tokenize('森麥康 小米3 M4 M5 5C 5X 5S 5Splus mi 6 6X电源开机音量按键排线侧键 小米5C 开机音量排线') >>> ['森', '麥', '康', '小', '米', '3', 'm', '4', 'm', '5', '5', 'c', '5', 'x', '5', 's', '5', 's', 'p', 'l', 'u', 's', 'm', 'i', '6', '6', 'x', '电', '源', '开', '机', '音', '量', '按', '键', '排', '线', '侧', '键', '小', '米', '5', 'c', '开', '机', '音', '量', '排', '线']
The text was updated successfully, but these errors were encountered:
可使用下面方法重写类,下一版本会修复该bug
from ark_nlp.processor.tokenizer.transfomer import TransfomerTokenizer class TokenTokenizer(TransfomerTokenizer): """ Transfomer文本编码器,用于按字符进行分词、ID化、填充等操作 Args: vocab: transformers词典类对象、词典地址或词典名,用于实现文本分词和ID化 max_seq_len (:obj:`int`): 预设的文本最大长度 """ # noqa: ignore flake8" def tokenize(self, text, **kwargs): tokens = [] for token_ in text: tokenized_token_ = self.vocab.tokenize(token_) if tokenized_token_ == []: tokens.extend([token_]) else: tokens.extend(tokenized_token_) return tokens def sequence_to_ids(self, sequence, **kwargs): return self.sentence_to_ids(sequence, **kwargs)
Sorry, something went wrong.
fix(tokenizer): 修复TokenTokenizer无法解决空格
72391f8
Closes #37
fix(tokenizer): 修复TokenTokenizer无法解决空格的问题
4956c88
xiangking
Successfully merging a pull request may close this issue.
Environment info
Information
The text was updated successfully, but these errors were encountered: