GarthTB / word-freq-counter Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

中文语料的盲分词词频统计工具：10亿字仅需1分钟！

Apache-2.0 license

0 stars 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.idea		.idea
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Repository files navigation

中文语料的盲分词词频统计工具

在我的个人电脑上，约10亿字的中文互联网语料，统计2字词，不加标点符号，大约1分钟即可统计完毕。

语料文件须为UTF-8编码。默认中文范围为4e00-9fff（16进制）。

统计原理：

每次进行两轮统计。假设要统计n字词：

第一轮：统计整个语料中，所有相邻的n个汉字组合出现的次数。
第二轮：相邻的(2n-1)个汉字组合构建为一个窗口，每个窗口中有n个词，滑动步长为n。根据第一轮统计的结果，挑出每个窗口中词频最高的词（最可能是词）。

更新日志

v0.2.0 - 20250128

优化：提速并减小体积。

v0.1.0 - 20250128

发布！