Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train_lm.py usage #65

Open
danpovey opened this issue Sep 3, 2016 · 5 comments
Open

train_lm.py usage #65

danpovey opened this issue Sep 3, 2016 · 5 comments

Comments

@danpovey
Copy link
Owner

danpovey commented Sep 3, 2016

The usage message of train_lm.py (see below) does not agree with what the program actually
does. The usage message suggests the output goes to lm_dir, but it goes to a subdirectory.
I think you should rename lm_dir in the args to work_dir. And the usage message should explain what the location of the actual lm_dir output will be. There should be an "epilog" provided to the usage message, with an example usage- preferably a couple of example usages, one with a vocab and one with num-words specified.
Also, you are using the 'basename' of the wordlist as part of the name of the lm_dir. What if the wordlist has a suffix, like foo.txt? Then foo.txt will become part of that name. It seems to me not ideal. Maybe strip any final suffix.

-------
usage: train_lm.py [-h] [--wordlist WORDLIST] [--num-words NUM_WORDS] [--num-splits NUM_SPLITS] [--warm-start-ratio WARM_START_RATIO]
                   [--min-counts MIN_COUNTS] [--limit-unk-history {true,false}] [--fold-dev-into FOLD_DEV_INTO]
                   [--bypass-metaparameter-optimization BYPASS_METAPARAMETER_OPTIMIZATION] [--verbose {true,false}] [--cleanup {true,false}]
                   [--keep-int-data {true,false}] [--max-memory MAX_MEMORY]
                   text_dir order lm_dir

This script trains an n-gram language model with <order> from <text-dir> and writes out the model to <lm-dir>. The output model dir is in pocolm-
format, user can call format_arpa_lm.py with <lm-dir> to get a ARPA-format model. Pruning a model could be achieve by call prune_lm_dir.py with
<lm-dir>.
@danpovey
Copy link
Owner Author

danpovey commented Sep 3, 2016

also cleanuped->cleaned up

@danpovey
Copy link
Owner Author

danpovey commented Sep 3, 2016

How about adding a final optional 4th argument called lm_dir (the 3rd argument being 'work_dir'), so the user can specify where they want the final LM to be written? This will make life easier for callers, as they won't have to figure out where pocolm would put their stuff.

@wantee
Copy link
Contributor

wantee commented Sep 4, 2016

OK, I will add the work_dir and lm_dir argument.
Regards to the wordlist name, I know it is not ideal. But I think we should not remove the final suffix if it is meaningful. For example, we have 3 different wordlist and named them as 'vocab.1', 'vocab.2' and 'vocab.3', they can't be distinguished if we remove the suffix.

@danpovey
Copy link
Owner Author

danpovey commented Sep 4, 2016

OK, don't remove the suffix then.. if people want control they can add the
lm_dir argument.
Also, you won't have to create a subdirectory 'work' once you add the
work_dir argument.
Dan

On Sat, Sep 3, 2016 at 10:13 PM, Wang Jian notifications@github.com wrote:

OK, I will add the work_dir and lm_dir argument.
Regards to the wordlist name, I know it is not ideal. But I think we
should not remove the final suffix if it is meaningful. For example, we
have 3 different wordlist and named them as 'vocab.1', 'vocab.2' and
'vocab.3', they can't be distinguished if we remove the suffix.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#65 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADJVu4t6LbA7jGFFwwCXvmfhZYGdHPH_ks5qmilSgaJpZM4J0Xzp
.

@wantee
Copy link
Contributor

wantee commented Sep 4, 2016

Yes, of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants