You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need a way to limit how much memory the toolkit will use.
The only operation that uses a lot of memory is 'sort', and I believe this is only called in two places-- when generating counts, and in ARPA generation.
In both cases, the way we can control it is by using the --buffer-size=X option to 'sort', e.g. --buffer-size=10G.
The tricky thing here is we'd like to be able to pass in a --max-memory=X option from the top-level scripts, such as train_lm.py, and have them just do the right thing, while bearing in mind that some of the scripts may invoke 'sort' multiple times in parallel in some instances [so the memory requirement needs to be divided appropriately, e.g. changing 100G to 25G]. You can just treat any letter at the end as an arbitrary string. Please don't assume there will always be a letter, since a simple numeric argument can be treated as a number of bytes.
Please note that the --dump-counts-parallel option will change the number you need to divide the memory specification by, because it causes the different datasets to be processed in parallel.
@keli78, I think this would be something suitable for you to do. Let's have @wantee do the initial phases of review.
The text was updated successfully, but these errors were encountered:
We need a way to limit how much memory the toolkit will use.
The only operation that uses a lot of memory is 'sort', and I believe this is only called in two places-- when generating counts, and in ARPA generation.
In both cases, the way we can control it is by using the --buffer-size=X option to 'sort', e.g. --buffer-size=10G.
The tricky thing here is we'd like to be able to pass in a --max-memory=X option from the top-level scripts, such as train_lm.py, and have them just do the right thing, while bearing in mind that some of the scripts may invoke 'sort' multiple times in parallel in some instances [so the memory requirement needs to be divided appropriately, e.g. changing 100G to 25G]. You can just treat any letter at the end as an arbitrary string. Please don't assume there will always be a letter, since a simple numeric argument can be treated as a number of bytes.
Please note that the --dump-counts-parallel option will change the number you need to divide the memory specification by, because it causes the different datasets to be processed in parallel.
@keli78, I think this would be something suitable for you to do. Let's have @wantee do the initial phases of review.
The text was updated successfully, but these errors were encountered: