We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
博主好,对于中英文混合型的语料,如何处理非中文字符的笔画信息? 语料分词后会得到一些包含英文字符的词,比如“A股”、“CEO”等,我是直接把非中文字符的笔画设为空,即stroke.py中的char2stroke[c]改成char2stroke.get(c, ''),不知道博主有没有其他更好的方法?
The text was updated successfully, but these errors were encountered:
非中文按照定义没有笔画,如果一定要把英文加进去的话,可以考虑把每个字母用那五个笔画表示出来,工作量不大,毕竟只有二十六个字母,不过我不认为会有很好的效果,毕竟这种词占少数,可能会被当作低频词丢掉,除非语料是专业领域;而且这个笔画没意义。
另一种想法是训练前语料规范化,分词后,将英文翻译成中文,使用时查找词向量时也翻译下。
Sorry, something went wrong.
如果你要进行各种实验的话,建议先测一下代码的速度,将慢的地方改进下,会节省很多时间。
No branches or pull requests
博主好,对于中英文混合型的语料,如何处理非中文字符的笔画信息?
语料分词后会得到一些包含英文字符的词,比如“A股”、“CEO”等,我是直接把非中文字符的笔画设为空,即stroke.py中的char2stroke[c]改成char2stroke.get(c, ''),不知道博主有没有其他更好的方法?
The text was updated successfully, but these errors were encountered: