Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

something wrong in Chinese ? #43

Open
bojone opened this issue Jan 17, 2018 · 30 comments
Open

something wrong in Chinese ? #43

bojone opened this issue Jan 17, 2018 · 30 comments

Comments

@bojone
Copy link

bojone commented Jan 17, 2018

in python 2.7:

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword(u'北京')
keyword_processor.add_keyword(u'欢迎')
keyword_processor.add_keyword(u'你')
keyword_processor.extract_keywords(u'北京欢迎你')

return [u'北京', u'你'],missing u'欢迎' ?

@vi3k6i5
Copy link
Owner

vi3k6i5 commented Jan 17, 2018 via email

@bojone
Copy link
Author

bojone commented Jan 17, 2018

does it seems ridiculous that a string matching tool must have a tokenizer ?

@vi3k6i5
Copy link
Owner

vi3k6i5 commented Jan 17, 2018 via email

@bojone
Copy link
Author

bojone commented Jan 17, 2018

oh, sorry, I am not blaming you.

As I know, many string matching tool work with English letter as a mini unit. I am confused that why you would design it in word level.

@vi3k6i5
Copy link
Owner

vi3k6i5 commented Jan 17, 2018 via email

@bojone
Copy link
Author

bojone commented Jan 18, 2018

maybe you can separatie the tokenizer and allow us to write our own tokenizer?

like
https://whoosh.readthedocs.io/en/latest/analysis.html

@bojone
Copy link
Author

bojone commented Jan 18, 2018

I suggest (just a suggestion ^_^) that just design it as a pure AC automata, like
https://github.com/WojciechMula/pyahocorasick/
is more useful and more feasible. pyahocorasick is written in C, and I'd like to see a pure python version.

@vi3k6i5
Copy link
Owner

vi3k6i5 commented Jan 18, 2018

Cool, Thanks for the suggestion. I will definitely take it into consideration :)

@datalee
Copy link

datalee commented Jan 19, 2018

@vi3k6i5 also,there is some issues with chinese from file:
processor.add_keyword_from_file('D:/keywords.txt')


UnicodeDecodeError Traceback (most recent call last)
in ()
----> 1 processor.add_keyword_from_file('D:/keywords.txt')

D:\Program Files\Python35\lib\site-packages\flashtext\keyword.py in add_keyword_from_file(self, keyword_file)
313 raise IOError("Invalid file path {}".format(keyword_file))
314 with open(keyword_file)as f:
--> 315 for line in f:
316 if '=>' in line:
317 keyword, clean_name = line.split('=>')

UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 26: illegal multibyte sequence

@madneal
Copy link

madneal commented Jan 19, 2018

@datalee You should firstly analyzed the problem instead of at the author directly. First, you have not provided the keywords.txt. And it would not be very difficult to find the reason. As indicated, it may be related to encoding of the file. For example, if you should open the file by with open('keyword_file', encodfing='utf8') as there may Chinese words in the file. It is important to find the reason by yourself instead of just pasting error and at the author.

@vi3k6i5
Copy link
Owner

vi3k6i5 commented Jan 19, 2018

@datalee please provide the file if possible.
As @neal1991 pointed out, there might be encoding issue.

@vi3k6i5
Copy link
Owner

vi3k6i5 commented Jan 19, 2018

Also file keywords.txt content should be in the format Documentation Link

java_2e=>java
java programing=>java
product management=>product management
product management techniques=>product management

or

java_2e
java programing
product management
product management techniques

@datalee
Copy link

datalee commented Jan 19, 2018

yes,i know must be encoding issue, but i don't find add_keyword_from_file has the parameter of setting it.so......

add_keyword_from_file() got an unexpected keyword argument 'encoding'

@vi3k6i5
Copy link
Owner

vi3k6i5 commented Jan 19, 2018

There is a pull request for this: #40
will try and get that pushed out soon.

@vi3k6i5
Copy link
Owner

vi3k6i5 commented Jan 19, 2018

Fix is added in master branch:

Please do pip install -U git+https://github.com/vi3k6i5/flashtext.git

you can pass a parameter encoding when loading flashtext from a file.

@vi3k6i5
Copy link
Owner

vi3k6i5 commented Jan 19, 2018

@datalee let me know if that solves your problem for loading the file, and post back if there is any other issue. Thanks :)

@jimmydong
Copy link

jimmydong commented Jan 26, 2018

keyword_processor.add_keyword('测试')
keywords_found = keyword_processor.extract_keywords('简单测试')

return ['测试']

keyword_processor.add_keyword('测试')
keywords_found = keyword_processor.extract_keywords('3测试')

return nothing

:-(

@vi3k6i5
Copy link
Owner

vi3k6i5 commented Jan 26, 2018 via email

@jimmydong
Copy link

jimmydong commented Jan 26, 2018

The reason is : there is no space between Chinese words.

So, I remove digits and letters from no_word_boundaries :

self.non_word_boundaries = set(string.digits + string.ascii_letters + '_')

change to:

self.non_word_boundaries = set('_')

It works well.

@vi3k6i5
Copy link
Owner

vi3k6i5 commented Jan 26, 2018 via email

@bojone
Copy link
Author

bojone commented Jan 27, 2018

@vi3k6i5 I think the best you can do is separate the tokenizer, no matter English or Chinese. You can allow us to design our own tokenizer and pass it into flashtext

@leepand
Copy link

leepand commented Apr 12, 2018

just add sentence segmention for Chinese
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword(u'北京')
keyword_processor.add_keyword(u'欢迎')
keyword_processor.add_keyword(u'你')
keyword_processor.add_keyword(u'测试')

import jieba
def safe_unicode(text):
"""
Attempts to convert a string to unicode format
"""
# convert to text to be "Safe"!
if isinstance(text,unicode):
return text
else:
return text.decode('utf-8')
for i in keyword_processor.extract_keywords(safe_unicode(' '.join(jieba.lcut('简单测试')))):
print i
for j in keyword_processor.extract_keywords(safe_unicode(' '.join(jieba.lcut('北京欢迎你')))):
print j
测试
北京
欢迎

@wuxiaobo
Copy link

in python 2.7:

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword(u'北京')
keyword_processor.add_keyword(u'欢迎')
keyword_processor.add_keyword(u'你')
keyword_processor.extract_keywords(u'北京欢迎你')

return [u'北京', u'你'],missing u'欢迎' ?

@leepand @leepand 你好,我也是中国的使用者,只需要修改源代码第532行 idx = sequence_end_pos ,修改为 idx = sequence_end_pos -1,即可,
代码if name == 'main':
kp=KeywordProcessor()
kp.add_keyword('北京')
kp.add_keyword('欢迎')
kp.add_keyword('你')
text = '北京欢迎你'
tl=kp.extract_keywords(text)
print(tl)

输出:['北京', '欢迎', '你']

@shun-zheng
Copy link

shun-zheng commented Oct 31, 2018

I'm considering using Chinese characters to mimic English Words and it seems to work fine.
(In python 3.6)
`
string = '北 京 欢 迎 您 ! 北 京 欢 迎 您 !'

keyword_proc = KeywordProcessor()

keyword_proc.add_keyword('北 京')

keyword_proc.add_keyword('欢 迎')

keyword_proc.add_keyword('您')

keywords = keyword_proc.extract_keywords(string, span_info=True)

`

Output:

[('北 京', 0, 3),
('欢 迎', 4, 7),
('您', 8, 9),
('北 京', 12, 15),
('欢 迎', 16, 19),
('您', 20, 21)]

@ljhust
Copy link

ljhust commented Feb 20, 2019

in python 2.7:
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword(u'北京')
keyword_processor.add_keyword(u'欢迎')
keyword_processor.add_keyword(u'你')
keyword_processor.extract_keywords(u'北京欢迎你')
return [u'北京', u'你'],missing u'欢迎' ?

@leepand @leepand 你好,我也是中国的使用者,只需要修改源代码第532行 idx = sequence_end_pos ,修改为 idx = sequence_end_pos -1,即可,
代码if name == 'main':
kp=KeywordProcessor()
kp.add_keyword('北京')
kp.add_keyword('欢迎')
kp.add_keyword('你')
text = '北京欢迎你'
tl=kp.extract_keywords(text)
print(tl)

输出:['北京', '欢迎', '你']

这个方法亲试确实可以

另外, 这个是在523行

@leopku
Copy link

leopku commented Mar 12, 2019

Any P.R. to fixing this issue?

@Tangzy7
Copy link

Tangzy7 commented Sep 6, 2019

keyword_processor.add_keyword('测试')
keywords_found = keyword_processor.extract_keywords('简单测试')

return ['测试']

keyword_processor.add_keyword('测试')
keywords_found = keyword_processor.extract_keywords('3测试')

return nothing

:-(

请问汉字混数字时候识别不了的问题解决了吗? 楼下说得方法解决不了数字汉字混合的时候

@hello-lan
Copy link

keyword_processor.add_keyword('测试')
keywords_found = keyword_processor.extract_keywords('简单测试')

return ['测试']

keyword_processor.add_keyword('测试')
keywords_found = keyword_processor.extract_keywords('3测试')

return nothing
:-(

请问汉字混数字时候识别不了的问题解决了吗? 楼下说得方法解决不了数字汉字混合的时候

好多坑,确实加数字时识别不了

@sunshichen
Copy link

keyword_processor.add_keyword('测试')
keywords_found = keyword_processor.extract_keywords('简单测试')

return ['测试']

keyword_processor.add_keyword('测试')
keywords_found = keyword_processor.extract_keywords('3测试')

return nothing
:-(

请问汉字混数字时候识别不了的问题解决了吗? 楼下说得方法解决不了数字汉字混合的时候

好多坑,确实加数字时识别不了

You can remove number characters inside of "non word boundaries". E.g.

from flashtext import KeywordProcessor

string = '北京3欢迎'

extracter = KeywordProcessor()
extracter.set_non_word_boundaries(set('-')) # Only keep '-'
extracter.add_keyword('欢迎')
print(extracter.extract_keywords(string))

Output:

['欢迎']

@sportzhang
Copy link

in python 2.7:
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword(u'北京')
keyword_processor.add_keyword(u'欢迎')
keyword_processor.add_keyword(u'你')
keyword_processor.extract_keywords(u'北京欢迎你')
return [u'北京', u'你'],missing u'欢迎' ?

@leepand @leepand 你好,我也是中国的使用者,只需要修改源代码第532行 idx = sequence_end_pos ,修改为 idx = sequence_end_pos -1,即可,
代码if name == 'main':
kp=KeywordProcessor()
kp.add_keyword('北京')
kp.add_keyword('欢迎')
kp.add_keyword('你')
text = '北京欢迎你'
tl=kp.extract_keywords(text)
print(tl)

输出:['北京', '欢迎', '你']

python3.6环境下,亲测有效,感谢!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests