Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serve a collection of custom models based on LRU #619

Closed
vishalbollu opened this issue Nov 29, 2019 · 1 comment · Fixed by #1428
Closed

Serve a collection of custom models based on LRU #619

vishalbollu opened this issue Nov 29, 2019 · 1 comment · Fixed by #1428
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@vishalbollu
Copy link
Contributor

vishalbollu commented Nov 29, 2019

Description

Add support to serve many different models where each model fulfills a subset of possible input (i.e. city-based models). Because each model is designed for only a subset of input queries, certain models may be queried more often than others. Serve the top N most queried models. Load and unload models based on LRU.

Here are the different use cases that could be handled:

  • Thousands of models or just a few
  • All models fit into memory or not
  • A static list of models, or dynamic (e.g. point to an s3 prefix, and then you can add new models after the API is deployed)

Implementation

cron:

  1. update tree
  2. for each model in memory, unload it if not it not in tree
  3. for each model in memory which has a latest timestamp: if there is a new version && (timestamp on latest is newer than oldest timestamp currently in cache or the cache has space): download it and load in memory

request:

  • if not in tree:
    option 1: error
    option 2: if in S3: update tree; else error
  • if not on disk: download model
  • if not in memory: load into memory
    • if cache is too big, evict based on LRU
  • predict()

python:

  • user defines load_model(self, disk_path):

Open Questions

  • Where to put Python cache helper
  • How to unload models from memory? Anything special for GPU?
  • pre-download and/or pre-load during init()?

config questions

  • Should cron interval be configurable?
  • should the default have a cache size, or infinite (i.e. no eviction)?
  • model_dir would be a configurable field of an S3 path / local path that points to a big pool of models. This is where all required models are pulled in from. The name of the model (or its unique identifier) represents a name of a directory within the given model_dir path, within which multiple versions of the said model can be found.
  • The model disk cache size can be >= the model cache size (which resides in memory). A disk_model_cache_size field should exist in the Cortex config. All cached models must fit on the disk/memory. There should also be a model_cache_size field that would control the number of models that can be fit in memory at any point in time.
  • Should be able to point to the dynamic list (model_dir) or to a list of static models (models). The static list of models won't have the updating mechanism for models, and thus no version-selection is possible when making predictions.
  • How should we handle OOM issues when making predictions? When a prediction is made, some memory has to be allocated for tensors, and this could exceed the available system memory (RAM or VRAM).

Notes

  • LRU memory cache and disk cache
  • volumes are not shared across replicas
  • threads_per_process > 1 is supported for TensorFlow and Python
  • processes_per_replica > 1 is not supported on Python, maybe supported on TensorFlow (if easy)
  • When serving, the requester may decide to use the latest version of a given model or a specific version of it (i.e. v1). If it's not specified, then resort to using the latest.
  • latest has its own timestamp, separate from each version. When evicting from the cache, the latest timestamp will be associated with the latest model (even though there may not be a timestamp associated with that model)

Additional Context

@vishalbollu vishalbollu added enhancement New feature or request blocked Blocked on another task or external event and removed blocked Blocked on another task or external event labels Nov 29, 2019
@RobertLucian
Copy link
Member

RobertLucian commented May 15, 2020

Grabbing this one. Part of the trick in making this work will be in reloading and unloading the model configs for the Tensorflow Predictor on-the-fly and reliably. I reckon things will be simpler for the ONNX and Python Predictors. This one goes hand-in-hand with #890.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants