Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

非中文字符笔画信息? #3

Open
ShuGao0810 opened this issue Sep 19, 2018 · 2 comments
Open

非中文字符笔画信息? #3

ShuGao0810 opened this issue Sep 19, 2018 · 2 comments

Comments

@ShuGao0810
Copy link

博主好,对于中英文混合型的语料,如何处理非中文字符的笔画信息?
语料分词后会得到一些包含英文字符的词,比如“A股”、“CEO”等,我是直接把非中文字符的笔画设为空,即stroke.py中的char2stroke[c]改成char2stroke.get(c, ''),不知道博主有没有其他更好的方法?

@qwfy
Copy link
Contributor

qwfy commented Sep 19, 2018

非中文按照定义没有笔画,如果一定要把英文加进去的话,可以考虑把每个字母用那五个笔画表示出来,工作量不大,毕竟只有二十六个字母,不过我不认为会有很好的效果,毕竟这种词占少数,可能会被当作低频词丢掉,除非语料是专业领域;而且这个笔画没意义。

另一种想法是训练前语料规范化,分词后,将英文翻译成中文,使用时查找词向量时也翻译下。

@qwfy
Copy link
Contributor

qwfy commented Sep 19, 2018

如果你要进行各种实验的话,建议先测一下代码的速度,将慢的地方改进下,会节省很多时间。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants