something wrong in Chinese ？ #43

bojone · 2018-01-17T11:40:32Z

in python 2.7:

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword(u'北京')
keyword_processor.add_keyword(u'欢迎')
keyword_processor.add_keyword(u'你')
keyword_processor.extract_keywords(u'北京欢迎你')

return [u'北京', u'你']，missing u'欢迎' ?

The text was updated successfully, but these errors were encountered:

vi3k6i5 · 2018-01-17T14:09:07Z

There is no tokeniser for Chinese in built. hence this would be happening. That's my guess. I will look at it and get back.

…

On Wed, 17 Jan 2018 at 17:10 苏剑林 ***@***.***> wrote: in python 2.7: from flashtext import KeywordProcessor keyword_processor = KeywordProcessor() keyword_processor.add_keyword(u'北京') keyword_processor.add_keyword(u'欢迎') keyword_processor.add_keyword(u'你') keyword_processor.extract_keywords(u'北京欢迎你') return [u'北京', u'你']，missing u'欢迎' ? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#43>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC-NwqSniROI-GtdOlV9Z15Mx9wopkuRks5tLdwxgaJpZM4RhNhH> .

bojone · 2018-01-17T14:12:36Z

does it seems ridiculous that a string matching tool must have a tokenizer ?

vi3k6i5 · 2018-01-17T14:17:28Z

It has tokenisation for English. Not for Chinese. It's a one man project, and I only needed to deal with English. + I don't know any of the Chinese writing style. This tool is mostly built with English in mind. If you want you can improve it to work with Chinese. Sorry if this sounds like a stupid decision from my side. But you have to consider that I have spents 100+ hours on this project already for 0 pay. Can't spend all my life on this. So I had to make some decisions to simplify it.

…

On Wed 17 Jan, 2018, 19:42 苏剑林, ***@***.***> wrote: does it seems ridiculous that a string matching tool must have a tokenizer ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#43 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC-NwtXOzkzBHTr5ywl-ZiIAnUu24IZLks5tLf_VgaJpZM4RhNhH> .

bojone · 2018-01-17T14:38:08Z

oh, sorry, I am not blaming you.

As I know, many string matching tool work with English letter as a mini unit. I am confused that why you would design it in word level.

vi3k6i5 · 2018-01-17T14:41:22Z

I designed it at a character level. But when a word ends and when not that is a word level thing. For example: hi how are you? (Word ends at spaces) Where as '.net is awesome (word does not end with . Rather starts with it.) So when does a word end for that I need to have an idea of word tokenisation. If this is confusing let me know I will try to give a better example.

…

On Wed 17 Jan, 2018, 20:08 苏剑林, ***@***.***> wrote: oh, sorry, I am not blaming you. As I know, many string matching tool work with English letter as a mini unit. I am confused that why you would design it in word level. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#43 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC-NwkWK5W4fDfjPdbHVFAjq7z1GSXnJks5tLgXRgaJpZM4RhNhH> .

bojone · 2018-01-18T01:14:01Z

maybe you can separatie the tokenizer and allow us to write our own tokenizer?

like
https://whoosh.readthedocs.io/en/latest/analysis.html

bojone · 2018-01-18T01:19:19Z

I suggest (just a suggestion ^_^) that just design it as a pure AC automata, like
https://github.com/WojciechMula/pyahocorasick/
is more useful and more feasible. pyahocorasick is written in C, and I'd like to see a pure python version.

vi3k6i5 · 2018-01-18T05:27:42Z

Cool, Thanks for the suggestion. I will definitely take it into consideration :)

datalee · 2018-01-19T01:52:13Z

@vi3k6i5 also,there is some issues with chinese from file:
processor.add_keyword_from_file('D:/keywords.txt')

UnicodeDecodeError Traceback (most recent call last)
in ()
----> 1 processor.add_keyword_from_file('D:/keywords.txt')

D:\Program Files\Python35\lib\site-packages\flashtext\keyword.py in add_keyword_from_file(self, keyword_file)
313 raise IOError("Invalid file path {}".format(keyword_file))
314 with open(keyword_file)as f:
--> 315 for line in f:
316 if '=>' in line:
317 keyword, clean_name = line.split('=>')

UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 26: illegal multibyte sequence

madneal · 2018-01-19T07:33:08Z

@datalee You should firstly analyzed the problem instead of at the author directly. First, you have not provided the keywords.txt. And it would not be very difficult to find the reason. As indicated, it may be related to encoding of the file. For example, if you should open the file by with open('keyword_file', encodfing='utf8') as there may Chinese words in the file. It is important to find the reason by yourself instead of just pasting error and at the author.

vi3k6i5 · 2018-01-19T07:36:12Z

@datalee please provide the file if possible.
As @neal1991 pointed out, there might be encoding issue.

vi3k6i5 · 2018-01-19T07:39:50Z

Also file keywords.txt content should be in the format Documentation Link

java_2e=>java
java programing=>java
product management=>product management
product management techniques=>product management

or

java_2e
java programing
product management
product management techniques

datalee · 2018-01-19T07:41:40Z

yes,i know must be encoding issue, but i don't find add_keyword_from_file has the parameter of setting it.so......

add_keyword_from_file() got an unexpected keyword argument 'encoding'

vi3k6i5 · 2018-01-19T08:08:21Z

There is a pull request for this: #40
will try and get that pushed out soon.

vi3k6i5 · 2018-01-19T08:24:35Z

Fix is added in master branch:

Please do pip install -U git+https://github.com/vi3k6i5/flashtext.git

you can pass a parameter encoding when loading flashtext from a file.

vi3k6i5 · 2018-01-19T08:32:17Z

@datalee let me know if that solves your problem for loading the file, and post back if there is any other issue. Thanks :)

jimmydong · 2018-01-26T08:07:35Z

keyword_processor.add_keyword('测试')
keywords_found = keyword_processor.extract_keywords('简单测试')

return ['测试']

keyword_processor.add_keyword('测试')
keywords_found = keyword_processor.extract_keywords('3测试')

return nothing

:-(

vi3k6i5 · 2018-01-26T08:36:33Z

I am really sorry, I can't read Chinese. I want to help, but I genuinely can't help :( :( I don't even know how to debug this problem :(

…

On Fri, 26 Jan 2018 at 13:37 JimmyDong ***@***.***> wrote: keyword_processor.add_keyword('测试') keywords_found = keyword_processor.extract_keywords('简单测试') return ['测试'] keyword_processor.add_keyword('测试') keywords_found = keyword_processor.extract_keywords('a测试') return nothing :-( — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#43 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC-Nwsx-F1huDhSXbyzAD09XLomiuvjgks5tOYfIgaJpZM4RhNhH> .

jimmydong · 2018-01-26T10:12:34Z

The reason is : there is no space between Chinese words.

So, I remove digits and letters from no_word_boundaries :

self.non_word_boundaries = set(string.digits + string.ascii_letters + '_')

change to:

self.non_word_boundaries = set('_')

It works well.

vi3k6i5 · 2018-01-26T10:56:49Z

Ok.. cool.. let me see how I can incorporate that in the main code. If that works better then we can switch to this approach all together.. Thanks for the input.

…

On Fri 26 Jan, 2018, 15:42 JimmyDong, ***@***.***> wrote: The reason is : there is no space between Chinese words. So, I remove digits and letters from no_word_boundaries : self.non_word_boundaries = set(string.digits + string.ascii_letters + '_') change to: self.non_word_boundaries = set(string.digits + string.ascii_letters + '_') It works well. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#43 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC-NwpL0MF2xBp3yhQsYcd-7Xo-Txv_0ks5tOaUTgaJpZM4RhNhH> .

bojone · 2018-01-27T04:58:31Z

@vi3k6i5 I think the best you can do is separate the tokenizer, no matter English or Chinese. You can allow us to design our own tokenizer and pass it into flashtext

leepand · 2018-04-12T08:39:48Z

just add sentence segmention for Chinese
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword(u'北京')
keyword_processor.add_keyword(u'欢迎')
keyword_processor.add_keyword(u'你')
keyword_processor.add_keyword(u'测试')

import jieba
def safe_unicode(text):
"""
Attempts to convert a string to unicode format
"""
# convert to text to be "Safe"!
if isinstance(text,unicode):
return text
else:
return text.decode('utf-8')
for i in keyword_processor.extract_keywords(safe_unicode(' '.join(jieba.lcut('简单测试')))):
print i
for j in keyword_processor.extract_keywords(safe_unicode(' '.join(jieba.lcut('北京欢迎你')))):
print j
测试
北京
欢迎
你

wuxiaobo · 2018-09-21T01:18:25Z

in python 2.7:

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword(u'北京')
keyword_processor.add_keyword(u'欢迎')
keyword_processor.add_keyword(u'你')
keyword_processor.extract_keywords(u'北京欢迎你')

return [u'北京', u'你']，missing u'欢迎' ?

@leepand @leepand 你好，我也是中国的使用者，只需要修改源代码第532行 idx = sequence_end_pos ，修改为 idx = sequence_end_pos -1，即可，
代码if name == 'main':
kp=KeywordProcessor()
kp.add_keyword('北京')
kp.add_keyword('欢迎')
kp.add_keyword('你')
text = '北京欢迎你'
tl=kp.extract_keywords(text)
print(tl)

输出：['北京', '欢迎', '你']

shun-zheng · 2018-10-31T14:58:53Z

I'm considering using Chinese characters to mimic English Words and it seems to work fine.
(In python 3.6)
`
string = '北京欢迎您 ! 北京欢迎您 !'

keyword_proc = KeywordProcessor()

keyword_proc.add_keyword('北京')

keyword_proc.add_keyword('欢迎')

keyword_proc.add_keyword('您')

keywords = keyword_proc.extract_keywords(string, span_info=True)

`

Output:

[('北京', 0, 3),
('欢迎', 4, 7),
('您', 8, 9),
('北京', 12, 15),
('欢迎', 16, 19),
('您', 20, 21)]

ljhust · 2019-02-20T09:24:58Z

in python 2.7:
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword(u'北京')
keyword_processor.add_keyword(u'欢迎')
keyword_processor.add_keyword(u'你')
keyword_processor.extract_keywords(u'北京欢迎你')
return [u'北京', u'你']，missing u'欢迎' ?

@leepand @leepand 你好，我也是中国的使用者，只需要修改源代码第532行 idx = sequence_end_pos ，修改为 idx = sequence_end_pos -1，即可，
代码if name == 'main':
kp=KeywordProcessor()
kp.add_keyword('北京')
kp.add_keyword('欢迎')
kp.add_keyword('你')
text = '北京欢迎你'
tl=kp.extract_keywords(text)
print(tl)

输出：['北京', '欢迎', '你']

这个方法亲试确实可以

另外，这个是在523行

leopku · 2019-03-12T11:46:38Z

Any P.R. to fixing this issue?

Tangzy7 · 2019-09-06T08:32:22Z

keyword_processor.add_keyword('测试')
keywords_found = keyword_processor.extract_keywords('简单测试')

return ['测试']

keyword_processor.add_keyword('测试')
keywords_found = keyword_processor.extract_keywords('3测试')

return nothing

:-(

请问汉字混数字时候识别不了的问题解决了吗？楼下说得方法解决不了数字汉字混合的时候

hello-lan · 2020-11-27T06:39:45Z

keyword_processor.add_keyword('测试')
keywords_found = keyword_processor.extract_keywords('简单测试')
return ['测试']
keyword_processor.add_keyword('测试')
keywords_found = keyword_processor.extract_keywords('3测试')
return nothing
:-(
请问汉字混数字时候识别不了的问题解决了吗？楼下说得方法解决不了数字汉字混合的时候

好多坑，确实加数字时识别不了

sunshichen · 2021-02-03T07:19:23Z

keyword_processor.add_keyword('测试')
keywords_found = keyword_processor.extract_keywords('简单测试')
return ['测试']
keyword_processor.add_keyword('测试')
keywords_found = keyword_processor.extract_keywords('3测试')
return nothing
:-(
请问汉字混数字时候识别不了的问题解决了吗？楼下说得方法解决不了数字汉字混合的时候
好多坑，确实加数字时识别不了

You can remove number characters inside of "non word boundaries". E.g.

from flashtext import KeywordProcessor

string = '北京3欢迎'

extracter = KeywordProcessor()
extracter.set_non_word_boundaries(set('-')) # Only keep '-'
extracter.add_keyword('欢迎')
print(extracter.extract_keywords(string))

Output:

['欢迎']

sportzhang · 2021-06-09T02:31:30Z

in python 2.7:
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword(u'北京')
keyword_processor.add_keyword(u'欢迎')
keyword_processor.add_keyword(u'你')
keyword_processor.extract_keywords(u'北京欢迎你')
return [u'北京', u'你']，missing u'欢迎' ?

@leepand @leepand 你好，我也是中国的使用者，只需要修改源代码第532行 idx = sequence_end_pos ，修改为 idx = sequence_end_pos -1，即可，
代码if name == 'main':
kp=KeywordProcessor()
kp.add_keyword('北京')
kp.add_keyword('欢迎')
kp.add_keyword('你')
text = '北京欢迎你'
tl=kp.extract_keywords(text)
print(tl)

输出：['北京', '欢迎', '你']

python3.6环境下，亲测有效，感谢！

srnthsrdhrn mentioned this issue May 6, 2021

Support for Devanagri and Indian Languages #123

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

something wrong in Chinese ？ #43

something wrong in Chinese ？ #43

bojone commented Jan 17, 2018 •

edited

Loading

vi3k6i5 commented Jan 17, 2018 via email

bojone commented Jan 17, 2018

vi3k6i5 commented Jan 17, 2018 via email

bojone commented Jan 17, 2018

vi3k6i5 commented Jan 17, 2018 via email

bojone commented Jan 18, 2018

bojone commented Jan 18, 2018

vi3k6i5 commented Jan 18, 2018

datalee commented Jan 19, 2018

madneal commented Jan 19, 2018

vi3k6i5 commented Jan 19, 2018

vi3k6i5 commented Jan 19, 2018 •

edited

Loading

datalee commented Jan 19, 2018 •

edited

Loading

vi3k6i5 commented Jan 19, 2018

vi3k6i5 commented Jan 19, 2018 •

edited

Loading

vi3k6i5 commented Jan 19, 2018

jimmydong commented Jan 26, 2018 •

edited

Loading

vi3k6i5 commented Jan 26, 2018 via email

jimmydong commented Jan 26, 2018 •

edited

Loading

vi3k6i5 commented Jan 26, 2018 via email

bojone commented Jan 27, 2018

leepand commented Apr 12, 2018

wuxiaobo commented Sep 21, 2018

shun-zheng commented Oct 31, 2018 •

edited

Loading

ljhust commented Feb 20, 2019

leopku commented Mar 12, 2019

Tangzy7 commented Sep 6, 2019

hello-lan commented Nov 27, 2020

sunshichen commented Feb 3, 2021

sportzhang commented Jun 9, 2021

something wrong in Chinese ？ #43

something wrong in Chinese ？ #43

Comments

bojone commented Jan 17, 2018 • edited Loading

vi3k6i5 commented Jan 17, 2018 via email

bojone commented Jan 17, 2018

vi3k6i5 commented Jan 17, 2018 via email

bojone commented Jan 17, 2018

vi3k6i5 commented Jan 17, 2018 via email

bojone commented Jan 18, 2018

bojone commented Jan 18, 2018

vi3k6i5 commented Jan 18, 2018

datalee commented Jan 19, 2018

madneal commented Jan 19, 2018

vi3k6i5 commented Jan 19, 2018

vi3k6i5 commented Jan 19, 2018 • edited Loading

datalee commented Jan 19, 2018 • edited Loading

vi3k6i5 commented Jan 19, 2018

vi3k6i5 commented Jan 19, 2018 • edited Loading

vi3k6i5 commented Jan 19, 2018

jimmydong commented Jan 26, 2018 • edited Loading

vi3k6i5 commented Jan 26, 2018 via email

jimmydong commented Jan 26, 2018 • edited Loading

vi3k6i5 commented Jan 26, 2018 via email

bojone commented Jan 27, 2018

leepand commented Apr 12, 2018

wuxiaobo commented Sep 21, 2018

shun-zheng commented Oct 31, 2018 • edited Loading

ljhust commented Feb 20, 2019

leopku commented Mar 12, 2019

Tangzy7 commented Sep 6, 2019

hello-lan commented Nov 27, 2020

sunshichen commented Feb 3, 2021

sportzhang commented Jun 9, 2021

bojone commented Jan 17, 2018 •

edited

Loading

vi3k6i5 commented Jan 19, 2018 •

edited

Loading

datalee commented Jan 19, 2018 •

edited

Loading

vi3k6i5 commented Jan 19, 2018 •

edited

Loading

jimmydong commented Jan 26, 2018 •

edited

Loading

jimmydong commented Jan 26, 2018 •

edited

Loading

shun-zheng commented Oct 31, 2018 •

edited

Loading