Skip to content

deploy Tensorflow serving or docker #813

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Arnold1 opened this issue Feb 14, 2020 · 12 comments
Closed

deploy Tensorflow serving or docker #813

Arnold1 opened this issue Feb 14, 2020 · 12 comments
Labels
question Further information is requested

Comments

@Arnold1
Copy link

Arnold1 commented Feb 14, 2020

Hello,

How can I deploy a tensorflow serving container or a docker container using cortex? on a ec2 spot instance ... or somewhere else...

is there a way to spin up more tensorflow serving / containers in case one cannot process all the traffic?

Thanks

@Arnold1 Arnold1 added the question Further information is requested label Feb 14, 2020
@vishalbollu
Copy link
Contributor

@Arnold1 Given a TensorFlow model, Cortex will serve your model using TensorFlow serving and make it available as a web service. The documentation for how to deploy TensorFlow models with Cortex can be found here. You can configure Cortex to run on your desired instance type including spot instances by specifying your own cluster configuration. Cortex will automatically spin up more replicas to meet the needs of traffic to your web service. Cortex uses Kubernetes (AWS EKS) under the hood for workload orchestration across multiple ec2 instances and it can scale up the number instances in your cluster when necessary.

Feel free to provide more information so that we can explore your use case further.

@Arnold1
Copy link
Author

Arnold1 commented Feb 17, 2020

@vishalbollu thanks for quick reply. it looks like I have to use python for tensorflow serving. is there a way to use the docker container instead - and a way to load the model? if not when will it be available?

@deliahu
Copy link
Member

deliahu commented Feb 17, 2020

@Arnold1 When using the TensorFlow runtime, you can specify the s3 path to your exported TensorFlow model (in the model key within the predictor configuration). Here is an example:

- name: iris-classifier
  predictor:
    type: tensorflow
    path: predictor.py
    model: s3://cortex-examples/tensorflow/iris-classifier/nn

At runtime, there will be two containers (per replica): one is a TensorFlow serving container which downloads the model from the path you specified and serves it, the other is a Python container which receives the prediction request, does any necessary transformations, and sends the transformed sample to the TensorFlow Serving container by calling tensorflow_client.predict(). Here is an example:

labels = ["setosa", "versicolor", "virginica"]

class TensorFlowPredictor:
    def __init__(self, tensorflow_client, config):
        self.client = tensorflow_client

    def predict(self, payload):
        prediction = self.client.predict(payload)
        predicted_class_id = int(prediction["class_ids"][0])
        return labels[predicted_class_id]

The full example can be seen here.

Does this address your question?

@Arnold1
Copy link
Author

Arnold1 commented Feb 17, 2020

@deliahu ok. how can i avoid the use of python is my question? I dont want to use a container with python.

@deliahu
Copy link
Member

deliahu commented Feb 17, 2020

@Arnold1 it is currently not possible to bypass the python container, however the actual model inference is performed on the official TensorFlow serving docker container (here is the Cortex Dockerfile for the serving container for CPUs and for GPUs). In the code above, self.client.predict(payload) sends the request to the TensorFlow serving container running on the same Kubernetes pod. Since the Python and TensorFlow serving containers are on the same Kubernetes pod, the latency of the network request between the two is very small (they are always guaranteed to be on the same instance).

The Python container is just a wrapper, so that pre- and post-inference processing can be handled easily. If you don't require any request processing, you can have a simple implementation for predictor.py like the iris one above (even simpler since that one has a little bit of post-processing).

Does that make sense? Is there a certain use case you are trying to implement which doesn't work well with this design?

@Arnold1
Copy link
Author

Arnold1 commented Feb 18, 2020

@deliahu yeah understood. so that python container is some sort of proxy right? the tf serving and python docker come as a pair - scaling up to handle more traffic requires both to scale up!?

                                                (model from s3)
client -> python container (preprocessing) -> tf serving container -> python container (post processing) -> client

did you measure about how much overhead it has forwarding to tf serving? what's the max ups (queries per second) you have tested this?

are there any plans to re-write the python container in c++/go?

@deliahu
Copy link
Member

deliahu commented Feb 18, 2020

@Arnold1 your diagram is correct :) Also, yes, a replica is the scalable unit, and a replica contains both the Python and TensorFlow Serving container, so both containers scale up together. This approach makes autoscaling a bit simpler and easier to configure (we're hoping to release v0.14 soon, which updates our autoscaling to be request-based rather than CPU-based). In addition, keeping both containers on the same pod ensures that the request from the Python container to the TensorFlow Serving container always has low latency.

If you are concerned about the TensorFlow Serving container being idle while the request is in the pre/post processing phase, once we release v0.14 (which we hope to do this week), you will be able to control the on-replica parallelism via the workers_per_replica and threads_per_worker configuration (each worker has it's own process, and each has threads_per_worker threads). I'll be writing up an explanation of these configuration parameters soon for the v0.14 release, but in the mean time you can see a short description of them here.

If you would like to control the parallelism now (i.e. on v0.13), you can do so like this:

- name: iris-classifier
  predictor:
    type: tensorflow
    path: predictor.py
    model: s3://cortex-examples/tensorflow/iris-classifier/nn
    config:
      waitress_threads: 4

I just did a quick check on the latency between the Python container and the TensorFlow Serving container. TensorFlow Serving doesn't seem to have a health check API (tensorflow/serving#671), so I called an actual API (GetModelStatus()), and the total round trip time was ~0.2 milliseconds. I am not sure how much of this time was spent in the TensorFlow Serving container versus in-transit.

The max queries per second will depend on how long a single inference takes and how parallelizable it is (e.g. if it is not CPU/GPU bound, running multiple threads and/or workers will increase the throughput). How long the inference takes depends on the the pre/post processing, the model itself, and the resources allocated to the replica (e.g. if it runs on a GPU or CPU, what type of GPU, etc). For GPU workloads, we have seen the best performance per dollar with the g4dn.xlarge instances (using spot instance saves a lot of money too of course).

We currently do not have plans to re-write the Python container in c++/go. Is your motivation for this mostly about reducing latency?

@Arnold1
Copy link
Author

Arnold1 commented Feb 24, 2020

hi, thanks for the detailed reply. motivation is mostly optimize for throughput but keep a reasonable latency. I guess gpu will also help to optimize for throughput?

how is it when I use batching in TensorFlow Serving - will the python docker container not limit the batching - and be the limiting factor to achieve high throughput?

@deliahu
Copy link
Member

deliahu commented Feb 24, 2020

Yes, using a GPU will help with throughput, assuming your model is configured properly to utilize it (some of the TensorFlow APIs provide GPU support out of the box, like the pre-made estimators).

If you are referring to TensorFlow Serving's server-side batching feature (code here), we have #152 to add support for this. Still, even without this feature, as long as you run the Python container with enough threads (i.e. waitress_threads from above), the Python container will not be the limiting factor, since multiple in-flight requests can be sent through to the GPU concurrently.

@deliahu
Copy link
Member

deliahu commented Apr 15, 2020

Closing due to inactivity. Feel free to follow up here or on Gitter if you have any additional questions.

@deliahu deliahu closed this as completed Apr 15, 2020
@iborko
Copy link

iborko commented May 5, 2020

the Python container will not be the limiting factor, since multiple in-flight requests can be sent through to the GPU concurrently.

I don't see how multiple in-flight requests can be sent through the GPU concurrently using Python. Can you provide an example of that? In theory perhaps, but it certainly can't be done using Tensorflow or PyTorch.

@deliahu
Copy link
Member

deliahu commented May 5, 2020

@iborko With our TensorFlow predictor type, there are two containers running: one (we can call it "API server") receives prediction requests and does pre/post processing (this is where your predict() function runs), and we run TensorFlow Serving in a separate container to actually run the inference (the API server makes the TF Serving request when the user calls self.client.predict(payload)). The TF Serving container is the one that has access to the GPU since it runs the inference, and the API server does not have GPU.

The API server can be configured to run with multiple threads and/or workers, allowing incoming requests to be processed concurrently (here is the documentation on that, which has changed since I explained how to configure it above). Therefore TF Serving requests can be sent concurrently. This is helpful if there are any preprocessing or postprocessing steps (which happen in the API server), especially if they involve network requests. So this is what I meant by "multiple in-flight requests can be sent through to the GPU concurrently".

Whether multiple requests can be processed by TF Serving concurrently is a different matter. My understanding is that unless server-side batching is enabled, requests will be processed sequentially. We have #152 to add support for this.

This means that leveraging concurrency in the API server can have major benefits when there is significant pre/post processing (especially involving network requests), and smaller benefits if the prediction request goes straight through to TF Serving without pre/post processing.

With our Python predictor type, the discussion above pretty much applies the same. In this setup, there is only one container: the API server which receives the request and calls your predict() function (which runs the actual inference). In this case, concurrency can also be used to allow multiple predict() calls to be running concurrently. Like with the TensorFlow predictor, the benefit will depend on how much pre/post processing is done outside of the actual inference and whether this step is I/O bound.

Does that make sense? Let me know if you still have questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants