Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

不用Trie,减少内存加快速度;优化代码细节 #187

Merged
merged 2 commits into from
Oct 19, 2014

Conversation

gumblex
Copy link
Contributor

@gumblex gumblex commented Oct 18, 2014

对于get_DAG()函数来说,用Trie数据结构,特别是在Python环境,内存使用量过大。经实验,可构造一个前缀集合解决问题。

该集合储存词语及其前缀,如set(['数', '数据', '数据结', '数据结构'])。在句子中按字正向查找词语,在前缀列表中就继续查找,直到不在前缀列表中或超出句子范围。大约比原词库增加40%词条。

该版本通过各项测试,与原版本分词结果相同。测试:一本5.7M的小说,用默认字典,64位Ubuntu,Python 2.7.6。
Trie:第一次加载2.8秒,缓存加载1.1秒;内存277.4MB,平均速率724kB/s
前缀字典:第一次加载2.1秒,缓存加载0.4秒;内存99.0MB,平均速率781kB/s

此方法解决纯Python中Trie空间效率低下的问题。
同时改善了一些代码的细节,遵循PEP8的格式,优化了几个逻辑判断。

加入了__main__.py,可直接使用python -m jieba进行分词。

usage: python -m jieba [options] filename

Jieba command line interface.

positional arguments:
  filename              input file

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d [DELIM], --delimiter [DELIM]
                        use DELIM instead of ' / ' for word delimiter; use a
                        space if it is without DELIM
  -a, --cut-all         full pattern cutting
  -n, --no-hmm          don't use the Hidden Markov Model
  -q, --quiet           don't print loading messages to stderr

If no filename specified, use STDIN instead.

若采纳,请适当修改版本号、修订历史、说明等。Python 3的适配稍后发布。

@gumblex gumblex mentioned this pull request Oct 19, 2014
@fxsjy
Copy link
Owner

fxsjy commented Oct 19, 2014

@gumblex ,赞!

fxsjy added a commit that referenced this pull request Oct 19, 2014
不用Trie,减少内存加快速度;优化代码细节
@fxsjy fxsjy merged commit 4a93f21 into fxsjy:master Oct 19, 2014
fxsjy added a commit that referenced this pull request Oct 19, 2014
@fxsjy
Copy link
Owner

fxsjy commented Oct 19, 2014

@gumblex , 实测了一把,的确内存占用少很多,且速度有提升。

@kslr
Copy link

kslr commented Oct 20, 2014

内存减少的真是太棒了

@kevingo
Copy link

kevingo commented Oct 21, 2014

It's a good modified.

@xwzhong
Copy link

xwzhong commented May 28, 2016

nice

@pengcao
Copy link

pengcao commented Jan 9, 2018

great

@yzho0907
Copy link

yzho0907 commented Aug 9, 2018

After python3.6 optimize the basic dict and python trie tree is actually based on dict, does trie tree in python3.6+ perform better or at least better than python2.+??

@chuanfanyoudong
Copy link

之前内存消耗较大是因为tril树的每个节点也是字典导致的字典嵌套字典吗

@sugarac
Copy link

sugarac commented Mar 11, 2019

@gumblex 你好,想请教下,为什么要把前缀也存起来呢,不在字典的词语前缀,词频永远是0吧?

@shaheming
Copy link

@gumblex 你好,想请教下,为什么要把前缀也存起来呢,不在字典的词语前缀,词频永远是0吧?
可以参考 https://www.cnblogs.com/zhbzz2007/p/6084196.html
是这样的如果将例如对 「去北京大学玩」分词。
分词是构建一个 DAG 图。会循环整个句子中的每一个字,并且从当前的字开始往下遍历,看是否可以构成新的词。例如从 「北」 开始,它自己是在字典中的 OK, 「北京」OK。「北京大」,如果此时没有将 「北京大」放入字典中,那么就不会遍历到 「北京大学」整个词了。

这里是用了 set 来代替前坠树。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants