Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If it possible that can build customized .gram #1

Open
TsinamLeung opened this issue May 19, 2020 · 6 comments
Open

If it possible that can build customized .gram #1

TsinamLeung opened this issue May 19, 2020 · 6 comments

Comments

@TsinamLeung
Copy link

At present , there are zh-hant & zh-hans
I wondering if I can add a zh-yue/zh-can grammer model.

@lotem
Copy link
Owner

lotem commented May 19, 2020

Sorry, I can't help you. I don't have resources for making a new model.

zh-hant and zh-hans are for Traditional Chinese and Simplified Chinese. They differ in the script, or writing system. They have nothing to do with the phonetic system. While "zh-yue" clearly refers to a Chinese dialect.
If you want a model customized for a dialect's colloquial writing instead of the common form of speech, you'll first need a large amount of training data: textual materials that reflect the colloquial form.

@laubonghaudoi
Copy link

We have the training data, what next?

@tanxpyox
Copy link

tanxpyox commented May 22, 2020

^Yeah, we have group of linguists who are currently compiling our curated Cantonese texts into a corpus (also with the help of universities and interests groups), but we would like to know with these texts, how exactly could we compile them into a .gram file that is usable by librime-octagram ourselves, or at least to which specification must we organise our data so that such a compilation could be made possible.

@lotem
Copy link
Owner

lotem commented May 23, 2020

Next, you are on your own.
I no longer have the training pipeline that I used to create the model long time ago.
Training data I used was several gigabytes of Chinese text scraped from the web. Simplified/Trandition language models were build from the same training data, with just-in-time OpenCC convertion on the input. Then, the algorithms were manually fine-tuned for what came out as training result. Huge project for an amateur. The existing implementation is a very fragile balance between algorithms and the language model data. Therefore I didn't plan it to be a repeatable process or a general pipeline for any given training data.

Most probably, the said compiled text by a group of linguists isn't the same thing in terms of quantity. You may design your own language model based on the corpus you have.

The place to plug a language model in is to subclass rime::Grammar.
I've only built one, so the interface may not be general enough. It serves as a starting point though.

@tanxpyox
Copy link

I no longer have the training pipeline that I used to create the model long time ago.

urgh, we'll sort it out ourselves then, thx.

@faywong
Copy link

faywong commented Dec 18, 2024

@tanxpyox
Hi, There's a power and latest chinese grammer model for use.
and Also the howto doc about training model.

You can refer it to build your own train pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants