If it possible that can build customized .gram #1

TsinamLeung · 2020-05-19T07:07:38Z

At present , there are zh-hant & zh-hans
I wondering if I can add a zh-yue/zh-can grammer model.

lotem · 2020-05-19T16:22:42Z

Sorry, I can't help you. I don't have resources for making a new model.

zh-hant and zh-hans are for Traditional Chinese and Simplified Chinese. They differ in the script, or writing system. They have nothing to do with the phonetic system. While "zh-yue" clearly refers to a Chinese dialect.
If you want a model customized for a dialect's colloquial writing instead of the common form of speech, you'll first need a large amount of training data: textual materials that reflect the colloquial form.

laubonghaudoi · 2020-05-22T06:09:31Z

We have the training data, what next?

tanxpyox · 2020-05-22T06:11:02Z

^Yeah, we have group of linguists who are currently compiling our curated Cantonese texts into a corpus (also with the help of universities and interests groups), but we would like to know with these texts, how exactly could we compile them into a .gram file that is usable by librime-octagram ourselves, or at least to which specification must we organise our data so that such a compilation could be made possible.

lotem · 2020-05-23T16:04:04Z

Next, you are on your own.
I no longer have the training pipeline that I used to create the model long time ago.
Training data I used was several gigabytes of Chinese text scraped from the web. Simplified/Trandition language models were build from the same training data, with just-in-time OpenCC convertion on the input. Then, the algorithms were manually fine-tuned for what came out as training result. Huge project for an amateur. The existing implementation is a very fragile balance between algorithms and the language model data. Therefore I didn't plan it to be a repeatable process or a general pipeline for any given training data.

Most probably, the said compiled text by a group of linguists isn't the same thing in terms of quantity. You may design your own language model based on the corpus you have.

The place to plug a language model in is to subclass rime::Grammar.
I've only built one, so the interface may not be general enough. It serves as a starting point though.

tanxpyox · 2020-05-24T03:40:53Z

I no longer have the training pipeline that I used to create the model long time ago.

urgh, we'll sort it out ourselves then, thx.

faywong · 2024-12-18T06:55:33Z

@tanxpyox
Hi, There's a power and latest chinese grammer model for use.
and Also the howto doc about training model.

You can refer it to build your own train pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If it possible that can build customized .gram #1

If it possible that can build customized .gram #1

TsinamLeung commented May 19, 2020

lotem commented May 19, 2020 •

edited

Loading

laubonghaudoi commented May 22, 2020

tanxpyox commented May 22, 2020 •

edited

Loading

lotem commented May 23, 2020

tanxpyox commented May 24, 2020

faywong commented Dec 18, 2024

If it possible that can build customized .gram #1

If it possible that can build customized .gram #1

Comments

TsinamLeung commented May 19, 2020

lotem commented May 19, 2020 • edited Loading

laubonghaudoi commented May 22, 2020

tanxpyox commented May 22, 2020 • edited Loading

lotem commented May 23, 2020

tanxpyox commented May 24, 2020

faywong commented Dec 18, 2024

lotem commented May 19, 2020 •

edited

Loading

tanxpyox commented May 22, 2020 •

edited

Loading