Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc for mandarin lm. #298

Merged
merged 2 commits into from
Sep 21, 2017
Merged

Add doc for mandarin lm. #298

merged 2 commits into from
Sep 21, 2017

Conversation

pkuyym
Copy link
Contributor

@pkuyym pkuyym commented Sep 19, 2017

fix #297

@pkuyym pkuyym requested a review from kuke September 19, 2017 14:04
Copy link
Collaborator

@kuke kuke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost LGTM

* English punctuations and chinese punctuations are removed.
* Insert a whitespace character between two tokens.

Please notice that the released language model only contains chinese simplified characters. When preprocessing done we can begin to train the language model. The key training parameters are '-o 5 --prune 0 1 2 4 4'. Please refer above section for the meaning of each parameter. We also convert the arpa file to binary file using default settings.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chinese-->Chinese
When --> After
parameters/parameters-->arguments/argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


* The beginning and trailing whitespace characters are removed.
* English punctuations and chinese punctuations are removed.
* Insert a whitespace character between two tokens.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert a whitespace character between two tokens. --> A whitespace character between two tokens is inserted. for consistence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

TODO: any other requirements or tips to add?
#### Mandarin LM

Different from word-based language model, mandarin language model is character-based where each token is a chinese character. We use an internal corpus to train the released mandarin language model. This corpus contains billions of tokens. The preprocessing has small difference from english language model and all steps are:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different from word-based language model-->Different from English language model
english-->English
chinese-->Chinese
small-->tiny
all steps are-->main steps include

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

@kuke kuke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pkuyym pkuyym merged commit 88edc4c into PaddlePaddle:develop Sep 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add doc for mandarin LM
2 participants