You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add support to serve many different models where each model fulfills a subset of possible input (i.e. city-based models). Because each model is designed for only a subset of input queries, certain models may be queried more often than others. Serve the top N most queried models. Load and unload models based on LRU.
Here are the different use cases that could be handled:
Thousands of models or just a few
All models fit into memory or not
A static list of models, or dynamic (e.g. point to an s3 prefix, and then you can add new models after the API is deployed)
Implementation
cron:
update tree
for each model in memory, unload it if not it not in tree
for each model in memory which has a latest timestamp: if there is a new version && (timestamp on latest is newer than oldest timestamp currently in cache or the cache has space): download it and load in memory
request:
if not in tree:
option 1: error
option 2: if in S3: update tree; else error
if not on disk: download model
if not in memory: load into memory
if cache is too big, evict based on LRU
predict()
python:
user defines load_model(self, disk_path):
Open Questions
Where to put Python cache helper
How to unload models from memory? Anything special for GPU?
pre-download and/or pre-load during init()?
config questions
Should cron interval be configurable?
should the default have a cache size, or infinite (i.e. no eviction)?
model_dir would be a configurable field of an S3 path / local path that points to a big pool of models. This is where all required models are pulled in from. The name of the model (or its unique identifier) represents a name of a directory within the given model_dir path, within which multiple versions of the said model can be found.
The model disk cache size can be >= the model cache size (which resides in memory). A disk_model_cache_size field should exist in the Cortex config. All cached models must fit on the disk/memory. There should also be a model_cache_size field that would control the number of models that can be fit in memory at any point in time.
Should be able to point to the dynamic list (model_dir) or to a list of static models (models). The static list of models won't have the updating mechanism for models, and thus no version-selection is possible when making predictions.
How should we handle OOM issues when making predictions? When a prediction is made, some memory has to be allocated for tensors, and this could exceed the available system memory (RAM or VRAM).
Notes
LRU memory cache and disk cache
volumes are not shared across replicas
threads_per_process > 1 is supported for TensorFlow and Python
processes_per_replica > 1 is not supported on Python, maybe supported on TensorFlow (if easy)
When serving, the requester may decide to use the latest version of a given model or a specific version of it (i.e. v1). If it's not specified, then resort to using the latest.
latest has its own timestamp, separate from each version. When evicting from the cache, the latest timestamp will be associated with the latest model (even though there may not be a timestamp associated with that model)
Grabbing this one. Part of the trick in making this work will be in reloading and unloading the model configs for the Tensorflow Predictor on-the-fly and reliably. I reckon things will be simpler for the ONNX and Python Predictors. This one goes hand-in-hand with #890.
Description
Add support to serve many different models where each model fulfills a subset of possible input (i.e. city-based models). Because each model is designed for only a subset of input queries, certain models may be queried more often than others. Serve the top N most queried models. Load and unload models based on LRU.
Here are the different use cases that could be handled:
Implementation
cron:
request:
option 1: error
option 2: if in S3: update tree; else error
python:
load_model(self, disk_path)
:Open Questions
init()
?config questions
model_dir
would be a configurable field of an S3 path / local path that points to a big pool of models. This is where all required models are pulled in from. The name of the model (or its unique identifier) represents a name of a directory within the givenmodel_dir
path, within which multiple versions of the said model can be found.disk_model_cache_size
field should exist in the Cortex config. All cached models must fit on the disk/memory. There should also be amodel_cache_size
field that would control the number of models that can be fit in memory at any point in time.model_dir
) or to a list of static models (models
). The static list of models won't have the updating mechanism for models, and thus no version-selection is possible when making predictions.Notes
latest
version of a given model or a specific version of it (i.e.v1
). If it's not specified, then resort to using thelatest
.Additional Context
The text was updated successfully, but these errors were encountered: