-
Notifications
You must be signed in to change notification settings - Fork 607
deploy Tensorflow serving or docker #813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@Arnold1 Given a TensorFlow model, Cortex will serve your model using TensorFlow serving and make it available as a web service. The documentation for how to deploy TensorFlow models with Cortex can be found here. You can configure Cortex to run on your desired instance type including spot instances by specifying your own cluster configuration. Cortex will automatically spin up more replicas to meet the needs of traffic to your web service. Cortex uses Kubernetes (AWS EKS) under the hood for workload orchestration across multiple ec2 instances and it can scale up the number instances in your cluster when necessary. Feel free to provide more information so that we can explore your use case further. |
@vishalbollu thanks for quick reply. it looks like I have to use python for tensorflow serving. is there a way to use the docker container instead - and a way to load the model? if not when will it be available? |
@Arnold1 When using the TensorFlow runtime, you can specify the s3 path to your exported TensorFlow model (in the - name: iris-classifier
predictor:
type: tensorflow
path: predictor.py
model: s3://cortex-examples/tensorflow/iris-classifier/nn At runtime, there will be two containers (per replica): one is a TensorFlow serving container which downloads the model from the path you specified and serves it, the other is a Python container which receives the prediction request, does any necessary transformations, and sends the transformed sample to the TensorFlow Serving container by calling labels = ["setosa", "versicolor", "virginica"]
class TensorFlowPredictor:
def __init__(self, tensorflow_client, config):
self.client = tensorflow_client
def predict(self, payload):
prediction = self.client.predict(payload)
predicted_class_id = int(prediction["class_ids"][0])
return labels[predicted_class_id] The full example can be seen here. Does this address your question? |
@deliahu ok. how can i avoid the use of python is my question? I dont want to use a container with python. |
@Arnold1 it is currently not possible to bypass the python container, however the actual model inference is performed on the official TensorFlow serving docker container (here is the Cortex The Python container is just a wrapper, so that pre- and post-inference processing can be handled easily. If you don't require any request processing, you can have a simple implementation for Does that make sense? Is there a certain use case you are trying to implement which doesn't work well with this design? |
@deliahu yeah understood. so that python container is some sort of proxy right? the tf serving and python docker come as a pair - scaling up to handle more traffic requires both to scale up!?
did you measure about how much overhead it has forwarding to tf serving? what's the max ups (queries per second) you have tested this? are there any plans to re-write the python container in c++/go? |
@Arnold1 your diagram is correct :) Also, yes, a replica is the scalable unit, and a replica contains both the Python and TensorFlow Serving container, so both containers scale up together. This approach makes autoscaling a bit simpler and easier to configure (we're hoping to release v0.14 soon, which updates our autoscaling to be request-based rather than CPU-based). In addition, keeping both containers on the same pod ensures that the request from the Python container to the TensorFlow Serving container always has low latency. If you are concerned about the TensorFlow Serving container being idle while the request is in the pre/post processing phase, once we release v0.14 (which we hope to do this week), you will be able to control the on-replica parallelism via the If you would like to control the parallelism now (i.e. on v0.13), you can do so like this: - name: iris-classifier
predictor:
type: tensorflow
path: predictor.py
model: s3://cortex-examples/tensorflow/iris-classifier/nn
config:
waitress_threads: 4 I just did a quick check on the latency between the Python container and the TensorFlow Serving container. TensorFlow Serving doesn't seem to have a health check API (tensorflow/serving#671), so I called an actual API ( The max queries per second will depend on how long a single inference takes and how parallelizable it is (e.g. if it is not CPU/GPU bound, running multiple threads and/or workers will increase the throughput). How long the inference takes depends on the the pre/post processing, the model itself, and the resources allocated to the replica (e.g. if it runs on a GPU or CPU, what type of GPU, etc). For GPU workloads, we have seen the best performance per dollar with the We currently do not have plans to re-write the Python container in c++/go. Is your motivation for this mostly about reducing latency? |
hi, thanks for the detailed reply. motivation is mostly optimize for throughput but keep a reasonable latency. I guess gpu will also help to optimize for throughput? how is it when I use batching in TensorFlow Serving - will the python docker container not limit the batching - and be the limiting factor to achieve high throughput? |
Yes, using a GPU will help with throughput, assuming your model is configured properly to utilize it (some of the TensorFlow APIs provide GPU support out of the box, like the pre-made estimators). If you are referring to TensorFlow Serving's server-side batching feature (code here), we have #152 to add support for this. Still, even without this feature, as long as you run the Python container with enough threads (i.e. |
Closing due to inactivity. Feel free to follow up here or on Gitter if you have any additional questions. |
I don't see how multiple in-flight requests can be sent through the GPU concurrently using Python. Can you provide an example of that? In theory perhaps, but it certainly can't be done using Tensorflow or PyTorch. |
@iborko With our TensorFlow predictor type, there are two containers running: one (we can call it "API server") receives prediction requests and does pre/post processing (this is where your The API server can be configured to run with multiple threads and/or workers, allowing incoming requests to be processed concurrently (here is the documentation on that, which has changed since I explained how to configure it above). Therefore TF Serving requests can be sent concurrently. This is helpful if there are any preprocessing or postprocessing steps (which happen in the API server), especially if they involve network requests. So this is what I meant by "multiple in-flight requests can be sent through to the GPU concurrently". Whether multiple requests can be processed by TF Serving concurrently is a different matter. My understanding is that unless server-side batching is enabled, requests will be processed sequentially. We have #152 to add support for this. This means that leveraging concurrency in the API server can have major benefits when there is significant pre/post processing (especially involving network requests), and smaller benefits if the prediction request goes straight through to TF Serving without pre/post processing. With our Python predictor type, the discussion above pretty much applies the same. In this setup, there is only one container: the API server which receives the request and calls your Does that make sense? Let me know if you still have questions! |
Hello,
How can I deploy a tensorflow serving container or a docker container using cortex? on a ec2 spot instance ... or somewhere else...
is there a way to spin up more tensorflow serving / containers in case one cannot process all the traffic?
Thanks
The text was updated successfully, but these errors were encountered: