Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model loading is very memory hungry #9

Closed
ianroberts opened this issue Mar 16, 2021 · 5 comments · Fixed by #10
Closed

Model loading is very memory hungry #9

ianroberts opened this issue Mar 16, 2021 · 5 comments · Fixed by #10

Comments

@ianroberts
Copy link
Contributor

Taking the spoken Italian model as an example, the process of loading the model into memory (ASPTagger.load) causes memory usage of the python process to briefly rise to nearly 4GB. Once the model is loaded memory usage drops to a more reasonable 1.7GB and remains there in the steady state.

The format used to store models on disk is gzip-compressed JSON, with the weight numbers stored as base85-encoded strings. This format is rather inefficient to load, since we must

  • load the entire JSON array-of-arrays into memory
  • duplicate the vocabulary list to turn it into a set
  • zip together the parallel lists of feature names and weights, and for each entry base-85 decode the weight and add the pair to a dict
  • then throw away the original lists of vocabulary, features and weights

If the feature name/weight pairs were instead serialized together (either as a {"feature":"base85-weight",...} object or as a transposed list-of-2-element-lists) then it would be possible to parse the model file in a single pass in a streaming fashion, eliminating the need to make multiple copies of potentially very large arrays in memory.

@tsproisl
Copy link
Owner

I've been unhappy with that for a long time. There are mainly two reasons why I haven't changed the model format, so far:

  1. While it is annoying, it does not seem to be a problem in practice as most people seem to have enough RAM (or do not complain if they haven't).
  2. I don't want to render any existing models useless.

Of course, dealing with 2 is just a matter of making the tagger recognize the format and handle the model file appropriately. It's just that it hasn't been a top priority for me.

@ianroberts
Copy link
Contributor Author

Sure. It has only become an issue for me because I'm working with a project that wants to expose a web service based on your tagger in a platform that uses Kubernetes. I need to apply memory limits to the pod definitions but for this service I have to make the pod request 4GB even though it only needs 1.7GB after the startup phase.

For this particular use case I've developed a workaround where I transform the model into a gzipped pickle format file, which is quite a bit larger than the original gzipped JSON but loads faster and with virtually no additional memory overhead. However it occurred to me today that it's actually possible to implement a more efficient streaming load of the current model format using ijson, I can submit a PR for this if you like?

@tsproisl
Copy link
Owner

Ah, the ijson solution is nice! A PR would be most welcome. The only thing that needs to be taken into account is that this will produce garbage on Python versions <3.7 that have ijson installed. I see two possible solutions: Either always fall back to the standard parser for these older versions or use a collections.OrderedDict instead of a dict if the version is <3.7.

@ianroberts
Copy link
Contributor Author

ianroberts commented Mar 18, 2021

PR submitted - I've made it use the optimised algorithm on CPython 3.6+ or (any) Python 3.7+, which are the ones where dict iteration order is guaranteed, and fall back to the original algorithm on earlier versions.

@tsproisl
Copy link
Owner

Thank you! I've updated the README and created a new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants