Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: memmap support when loading models #225

Open
ogrisel opened this issue Nov 29, 2022 · 6 comments
Open

Feature request: memmap support when loading models #225

ogrisel opened this issue Nov 29, 2022 · 6 comments
Labels
persistence Secure persistence feature

Comments

@ogrisel
Copy link

ogrisel commented Nov 29, 2022

Make it possible to unzip and memmap the numpy arrays in a .skops file to make it possible to load models with their large parameter arrays in shared memory between several Python processes concurrently on a given host.

This would make it possible to:

  • reduce memory needs when handling concurrent requests for a server API running on several Python workers for instance using gunicorn or uwsgi,
  • reduce memory usage when batch processing a large number of prediction tasks on a cluster system with several Python workers per host, for instance with dask, ray, pyspark and co.
  • make it faster to load a model by benefiting from the warm starting effect of a previously memmaped model for instance in a serverless environment, assuming several concurrent "function calls" can share a common working directory on a given host.
@ogrisel ogrisel changed the title Feature request: Feature request: memmap support when loading models Nov 29, 2022
@adrinjalali adrinjalali added the persistence Secure persistence feature label Nov 30, 2022
@adrinjalali
Copy link
Member

Yep, that sounds reasonable. I haven't done this though. Could you please give me a pointer? Maybe the place where you do this in joblib?

@ogrisel
Copy link
Author

ogrisel commented Nov 30, 2022

In joblib this is too complex because we have all the numpy buffers in the main pickle stream. In skops this is simpler, it would be a matter of:

@BenjaminBossan
Copy link
Collaborator

Very good suggestion, thanks.

I have no experience with memmapping, but if someone else takes a stab at this feature, feel free to ignore my questions :)

  1. How would we test that it actually works? I imagine that's not trivial.
  2. AFAICT, the option doesn't exist for scipy sparse matrices, or is there something?
  3. The link points to something unrelated (not a perma link), I guess it was meant to highlight: https://github.com/skops-dev/skops/blob/7242265eb7744b8895f2f0c7cb62f59d0a656eb6/skops/io/_numpy.py#LL79C23-L79C30

@adrinjalali
Copy link
Member

@ogrisel I have been thinking about this, and there's an issue because of which I've been putting this aside.

When it comes to loading from multiple processes, each process would try to unzip the file, we'd need to check if it's already done and use that, which we might be able to figure out a way to do so, but not straightforward I think.

When we unzip the files onto disk, they're gonna stay there and we never would clean them up. In the long run on non-linux machines this is kinda bad (at list on Windows I don't think temp files are ever cleaned up). This is the main issue I'm not sure how to solve.

@ogrisel
Copy link
Author

ogrisel commented Jan 24, 2023

It would be possible to register those folders for automated garbage collection with the loky ressourcetracker.

The standard library also has a ressource tracker in the concurrent futures module but if i remember correctly it does not come with a clean up function for folders.

@ogrisel
Copy link
Author

ogrisel commented Mar 14, 2023

The class I had in mind is https://github.com/joblib/loky/blob/047d80623b7cf2d43500ed56ee0320b3e41c3f81/loky/backend/resource_tracker.py. However this is not considered public API, so maybe it's a bad idea to rely on this for skops.

It also exists in the multiprocessing.resource_tracer module of the standard library, but we would need to register new cleanup function for file and folder resource types which are not part of the default _CLEANUP_FUNCS dict in multiprocessing. Unfortunately, this dict is private and apparently there is no public API to add new functions to that dict as far as I know. So again, this might be risky to rely on this from a maintenance point of view.

Maybe the safest solution is to not try to be too clever and let the user handle the clean-up explicitly by themselves depending on their application lifecycle.

For the original unzipping, maybe the skops could write a special sentinel emtpy file name ready_to_load once the unzipping process has completed. If a concurrent process is told to unzip and load a file to a folder that already exists but does not have the ready sentinel flag, then skops could either:

  • wait and retry later, until the ready_to_load sentinel file appears,
  • or load the .skops file in memory, without attempting to memmap and issue a warning to the user telling them that memmapping failed because the target folder wasn't ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
persistence Secure persistence feature
Projects
None yet
Development

No branches or pull requests

3 participants