Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max memory #41

Open
danpovey opened this issue Aug 6, 2016 · 0 comments
Open

Max memory #41

danpovey opened this issue Aug 6, 2016 · 0 comments

Comments

@danpovey
Copy link
Owner

danpovey commented Aug 6, 2016

We need a way to limit how much memory the toolkit will use.
The only operation that uses a lot of memory is 'sort', and I believe this is only called in two places-- when generating counts, and in ARPA generation.
In both cases, the way we can control it is by using the --buffer-size=X option to 'sort', e.g. --buffer-size=10G.
The tricky thing here is we'd like to be able to pass in a --max-memory=X option from the top-level scripts, such as train_lm.py, and have them just do the right thing, while bearing in mind that some of the scripts may invoke 'sort' multiple times in parallel in some instances [so the memory requirement needs to be divided appropriately, e.g. changing 100G to 25G]. You can just treat any letter at the end as an arbitrary string. Please don't assume there will always be a letter, since a simple numeric argument can be treated as a number of bytes.

Please note that the --dump-counts-parallel option will change the number you need to divide the memory specification by, because it causes the different datasets to be processed in parallel.

@keli78, I think this would be something suitable for you to do. Let's have @wantee do the initial phases of review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant