Skip to content

Add a guide for multi-model endpoints #986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
deliahu opened this issue Apr 29, 2020 · 2 comments · Fixed by #1081
Closed

Add a guide for multi-model endpoints #986

deliahu opened this issue Apr 29, 2020 · 2 comments · Fixed by #1081
Labels
docs Improvements or additions to documentation enhancement New feature or request good first issue Good for newcomers
Milestone

Comments

@deliahu
Copy link
Member

deliahu commented Apr 29, 2020

Description

Multi-model endpoints are possible using the Python Predictor, but we don't yet have an example of how to do this.

#619 tracks adding support for a model cache, so that all models need not be able to fit in memory at the same time

@deliahu deliahu added enhancement New feature or request docs Improvements or additions to documentation labels Apr 29, 2020
@nicmer
Copy link

nicmer commented May 4, 2020

I am just wondering whether that is already possible and "only" requires documentation or whether the issue suggests the enhancement. That would be a really great feature. I was thinking whether it is possible to implement a multi model endpoint using lru.cache or something in that direction, but would be certainly more happy if there is already some solution.

@deliahu
Copy link
Member Author

deliahu commented May 5, 2020

@nicmer your understanding is correct: currently this is possible, but we have not added a guide or built-in solution for it. We (and some of our users) have successfully deployed multi-model endpoints without any caching, i.e. by simply loading multiple models in __init__() and then selecting the model based on the request body. However we have not explored what using a cache would look like.

I envisioned that we would try it out and add a guide with some sample code as a first step. Then we would have a better sense of how to build it into the product. I'm thinking maybe it would be a library we ship with that can be imported and used in the Predictor implementation. For example, in __init__() you could do something like:

self.model_cache = cortex.cache.init("path/to/s3/prefix", max_models=100, cache_to_disk=true, preload=true, ttl=timedelta(hours=24))

And then in predict() you could do

model = self.model_cache.get(model_path)

Another option could be to fold this into the API configuration. I think we'll have a better sense of the best way to build it in once we have a working example.

One of the questions that we have not yet researched is how easy is it to unload models from memory? Does it depend on the model framework? Is there anything special we'd have to do on GPU, and is it generalizable or would we have to rely on the user to provide an unload() implementation?

Let us know if you make any progress on this, we'd love to hear if you have any ideas/tips, as well as take a look at anything that you think would be useful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Improvements or additions to documentation enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants