Max memory #41

danpovey · 2016-08-06T20:03:28Z

We need a way to limit how much memory the toolkit will use.
The only operation that uses a lot of memory is 'sort', and I believe this is only called in two places-- when generating counts, and in ARPA generation.
In both cases, the way we can control it is by using the --buffer-size=X option to 'sort', e.g. --buffer-size=10G.
The tricky thing here is we'd like to be able to pass in a --max-memory=X option from the top-level scripts, such as train_lm.py, and have them just do the right thing, while bearing in mind that some of the scripts may invoke 'sort' multiple times in parallel in some instances [so the memory requirement needs to be divided appropriately, e.g. changing 100G to 25G]. You can just treat any letter at the end as an arbitrary string. Please don't assume there will always be a letter, since a simple numeric argument can be treated as a number of bytes.

Please note that the --dump-counts-parallel option will change the number you need to divide the memory specification by, because it causes the different datasets to be processed in parallel.

@keli78, I think this would be something suitable for you to do. Let's have @wantee do the initial phases of review.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Max memory #41

Max memory #41

danpovey commented Aug 6, 2016

Max memory #41

Max memory #41

Comments

danpovey commented Aug 6, 2016