Model loading is very memory hungry #9

ianroberts · 2021-03-16T23:53:03Z

Taking the spoken Italian model as an example, the process of loading the model into memory (ASPTagger.load) causes memory usage of the python process to briefly rise to nearly 4GB. Once the model is loaded memory usage drops to a more reasonable 1.7GB and remains there in the steady state.

The format used to store models on disk is gzip-compressed JSON, with the weight numbers stored as base85-encoded strings. This format is rather inefficient to load, since we must

load the entire JSON array-of-arrays into memory
duplicate the vocabulary list to turn it into a set
zip together the parallel lists of feature names and weights, and for each entry base-85 decode the weight and add the pair to a dict
then throw away the original lists of vocabulary, features and weights

If the feature name/weight pairs were instead serialized together (either as a {"feature":"base85-weight",...} object or as a transposed list-of-2-element-lists) then it would be possible to parse the model file in a single pass in a streaming fashion, eliminating the need to make multiple copies of potentially very large arrays in memory.

The text was updated successfully, but these errors were encountered:

tsproisl · 2021-03-17T15:55:38Z

I've been unhappy with that for a long time. There are mainly two reasons why I haven't changed the model format, so far:

While it is annoying, it does not seem to be a problem in practice as most people seem to have enough RAM (or do not complain if they haven't).
I don't want to render any existing models useless.

Of course, dealing with 2 is just a matter of making the tagger recognize the format and handle the model file appropriately. It's just that it hasn't been a top priority for me.

ianroberts · 2021-03-17T19:06:48Z

Sure. It has only become an issue for me because I'm working with a project that wants to expose a web service based on your tagger in a platform that uses Kubernetes. I need to apply memory limits to the pod definitions but for this service I have to make the pod request 4GB even though it only needs 1.7GB after the startup phase.

For this particular use case I've developed a workaround where I transform the model into a gzipped pickle format file, which is quite a bit larger than the original gzipped JSON but loads faster and with virtually no additional memory overhead. However it occurred to me today that it's actually possible to implement a more efficient streaming load of the current model format using ijson, I can submit a PR for this if you like?

tsproisl · 2021-03-18T07:33:31Z

Ah, the ijson solution is nice! A PR would be most welcome. The only thing that needs to be taken into account is that this will produce garbage on Python versions <3.7 that have ijson installed. I see two possible solutions: Either always fall back to the standard parser for these older versions or use a collections.OrderedDict instead of a dict if the version is <3.7.

ianroberts · 2021-03-18T11:06:21Z

PR submitted - I've made it use the optimised algorithm on CPython 3.6+ or (any) Python 3.7+, which are the ones where dict iteration order is guaranteed, and fall back to the original algorithm on earlier versions.

tsproisl · 2021-03-18T13:32:29Z

Thank you! I've updated the README and created a new release.

ianroberts mentioned this issue Mar 18, 2021

Alternative model loading logic using ijson #10

Merged

tsproisl closed this as completed in #10 Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model loading is very memory hungry #9

Model loading is very memory hungry #9

ianroberts commented Mar 16, 2021

tsproisl commented Mar 17, 2021

ianroberts commented Mar 17, 2021

tsproisl commented Mar 18, 2021

ianroberts commented Mar 18, 2021 •

edited

Loading

tsproisl commented Mar 18, 2021

Model loading is very memory hungry #9

Model loading is very memory hungry #9

Comments

ianroberts commented Mar 16, 2021

tsproisl commented Mar 17, 2021

ianroberts commented Mar 17, 2021

tsproisl commented Mar 18, 2021

ianroberts commented Mar 18, 2021 • edited Loading

tsproisl commented Mar 18, 2021

ianroberts commented Mar 18, 2021 •

edited

Loading