This repository has been archived by the owner on Mar 21, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 25
A Chinese Word Segmentation(中文分词) routine in pure Ruby
License
yzhang/rseg
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Introduction ======== Rseg is a Chinese Word Segmentation(中文分词) routine in pure Ruby. The algorithm is based on this article: http://xiecc.blog.163.com/blog/static/14032200671110224190/ Usage ======== Rseg now support two modes: inline and C/S mode. 1. Inline mode > require 'rubygems' > require 'rseg' > Rseg.segment("需要分词的文章") ['需要', '分词', '的', '文章'] The first call to Rseg#segment will need about 30 seconds to load the dictionary, the second call will be very fast, you can also call Rseg#load to load dictionaries manually. 2. C/S mode $ rseg_server == Sinatra/0.9.4 has taken the stage on 4100 This will start rseg server on http://localhost:4100 You can visit it via your browser or the rseg command. $ rseg '需要分词的文章' 需要 分词 的 文章 You can also access server with the Rseg#remote_segment $ irb > require 'rubygems' > require 'rseg' > Rseg.remote_segment("需要分词的文章") # This will be very fast ['需要', '分词', '的', '文章'] Performance ======== About 5M character/s on my Macbook (Intel Core 2 Duo 2GHz/4G mem). License ======== Rseg includes two built-in dictionaries: * CC-CEDICT (http://cc-cedict.org/wiki/) with Creative Commons Attribution-Share Alike 3.0 License (http://creativecommons.org/licenses/by-sa/3.0/) * Wikipedia Chinese article title list (http://download.wikimedia.org/zhwiki/) with Creative Commons Attribution-Share Alike 3.0 License(http://creativecommons.org/licenses/by-sa/3.0/) The codes and others in Rseg are licensed under MIT license. Feedback ======== All feedback are welcome, Yuanyi Zhang(zhangyuanyi#gmail.com)
About
A Chinese Word Segmentation(中文分词) routine in pure Ruby
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published