-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Submitting raw data via IPC #1
Comments
In general I think your assessment is correct: I/O can be a performance limiter for some models and a primary way to fix this in many cases is to make the pre-processing local with the inference. Here are some variations we think about and where we stand as far as current support:
I would suggest that you start with (1a) and see how much benefit that gets you. We are generally interested in improving TRTIS in this area and so would welcome your experience and feedback as you experiment. If you think you could contribute something generally useful we would be very open to working with you on it, just be sure to include us in your plans early on so we can make sure we are all on the same page. As for your question #3. Yes, for experimenting it is probably fastest to hack up the gRPC service to instead pass the reference instead of the actual data (but keep the rest of the request/response message the same). infer.cc is where the data (raw_input) is read out of the request message so you would need to change that to instead read from shared memory. |
Hey! Thank you. |
Ok, I've implemented it for grpc (only) in very hacky way https://github.com/seovchinnikov/tensorrt-inference-server/tree/file-api |
@mrjackbo - For my projects, I've done exactly what you are describing above. I've created a pre/post-processing service which uses sysv shared-memory to avoid serializing, moving, and deserializing raw tensors over a gRPC message. You have two granularities of access control on which you can expose the shared-memory segments between processes:
In Kubernetes, you can use a
In this example, the Note: the TRTIS service also needs to be customized to connect to the shared-memory service and handshake segment IDs. Using Docker, you can use the
I'll post and example of the shared-memory service I have used and update this thread when it's ready. The example is currently in an older project that is actively being moved to a new github project. |
Hello, I am now tackling this and before I get too far in the weeds I figured I would follow up here given the latest release r19.04 supports custom operations at build time or startup. I was wondering if I wanted to avoid having to write data to files before making a request to TRTIS, would I be looking at writing a custom operation, or am I looking at using the existing API somehow? |
I assume when you say "custom operation" you mean "custom backend". You could create a custom backend that expected the input to identify the shared memory handle, offset, size, etc. The custom backend would extract the data from the shared memory handle into one or more output tensors. You could then ensemble this custom backend with your actual model. When you made a request your input tensor would be just the shared memory handle, offset, size, etc. data expected by your custom backend. An example of using a custom backend with an ensemble will be included in the 19.05 release (coming later this week). Or you can find it now on master: https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-master-branch-guide/docs/client.html#ensemble-image-classification-example-application Note that eventually we will have a new API (or enhancement to existing API) that will allow you to send a "handle" to the shared memory containing the input tensor instead of the actual tensor values. But we don't have a schedule yet for when that will be available. |
Thanks for the information. I'll checkout master!
…On Wed, May 22, 2019 at 9:14 AM deadeyegoodwin ***@***.***> wrote:
I assume when you say "custom operation" you mean "custom backend".
You could create a custom backend that expected the input to identify the
shared memory handle, offset, size, etc. The custom backend would extract
the data from the shared memory handle into one or more output tensors. You
could then ensemble this custom backend with your actual model. When you
made a request your input tensor would be just the shared memory handle,
offset, size, etc. data expected by your custom backend. An example of
using a custom backend with an ensemble will be included in the 19.05
release (coming later this week). Or you can find it now on master:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-master-branch-guide/docs/client.html#ensemble-image-classification-example-application
Note that eventually we will have a new API (or enhancement to existing
API) that will allow you to send a "handle" to the shared memory containing
the input tensor instead of the actual tensor values. But we don't have a
schedule yet for when that will be available.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFSEXYMFYZTGSF3AQ4XGGGDPWVWO3ANCNFSM4GGLBQ4A>
.
--
Padraic McGraw
Software Developer
Lucidyne Technoligies Inc.
|
@deadeyegoodwin Anything new regarding this topic? Using shared memory would probably double (if not even triple) my throughput at this point, so I will have to implement one of the many mentioned solutions above anyway. I will of course share my insights if needed, so it would be useful to know what the current state and plan is API-wise. This is where I'm at right now:
This unfortunately does not increase performance a lot, as HTTP (and to less extent gRPC) seem to become the major bottleneck with large input tensors (608x608x3 in this case) quite rapidly, even on localhost:
4.5 seconds HTTP time versus 1 second server compute time with 1135 MBit (!) per 32-batch. Not sure if allocation and initialization of the batch on client side is included though.
I think this could be a very fast solution, even compared with shared-memory approaches. A quick test with iperf on localhost indicates that transmission on loopback is mainly CPU bound:
So I suppose TCP slow start must be disabled or worked around, but this might be sufficient, even in comparison to shared memory.
Probably the fastest and "cleanest" solution. Would love to see this supported in the API, without the need for a custom backend receiving shared memory handles and passing the data on in an ensemble. Right now this is the way to go though I guess? So I will try that first before checking b). Any additional input is much appreciated. |
We have just started work on implementing a shared-memory AP (option C). Changes will start to come into master and we expect to have an initial minimal implementation in about 3 weeks. The API will allow input and output tensors to be passed to/from TRTIS via shared-memory instead of over the network. It will be the responsibility of an outside "agent" to create and manage the lifetime of the shared-memory regions. TRTIS will provide APIs that allow that "agent" to register/unregister these shared memory regions with TRTIS and then they can be used in inference requests. |
@deadeyegoodwin @CoderHam what are the chances the three shared memory branches will make it onto master this week? And will perf client support a shared memory test out of the box? I tried building the shared memory branches, but I'm not sure I'm getting the combination of server and clients of the different branches right. What would be the easiest way to get a little test going? Building the server on "hemantj-sharedMemory-server" and then use the simple shm client from "hemantj-sharedMemory-test"? For now I'm only interested in the performance gains and resulting throughput, so I'm basically fine with an unstable, buggy demo if it at least runs somehow. Changes in code look great so far, thanks for the good work! |
By this week we should have shared memory support for input tensors with some minimal testing. Output tensor support will follow shortly after. Adding support to perf_client plus much more extensive testing is needed after that before we can declare system memory (CPU) sharing complete. That will likely take a couple of weeks. After that we will start on GPU shared memory. |
The master branch now has the initial implementation for shared memory support for input tensors and some minimal testing. Currently only the C++ client API supports shared memory (Python support is TBD.. but you can always use grpc to generate client code for many languages). The C++ API changes are here: 6d33c8c#diff-906ebe14e6f98b22609d12ac8433acc0 An example application is: https://github.com/NVIDIA/tensorrt-inference-server/blob/master/src/clients/c%2B%2B/simple_shm_client.cc. The L0_simple_shared_memory_example test performs some minimal testing using that example application. |
Works perfectly, thanks! I had to add "--ipc=host" to docker (has been mentioned somewhere in a related pull request) and use the -I flag to prevent the simple_shm_client from using output shared memory, if anyone else is trying. |
That's right the Yes,
|
@CoderHam for the documentation it might also be worth to add that docker is apparently limited to 64MB of shared memory by default, easily surpassed by even the modest batch sizes for some models. And thanks for #541, was about to dive deeper into the code when your commits started coming in ;) |
@philipp-schmidt Thanks for bringing the memory limit to my attention. I will go ahead and document the same in the client + server API docs. |
System shared memory is complete and available on master branch and 19.10. CUDA shared memory is in progress and will be available in 19.11. Closing. |
Hi, thanks for open-sourcing this project!
I experimented with the TensorRT inference server, and I found that with my target model (a TensorRT execution plan that has FP16 inputs and outputs) to max-out my system's two GPUsm I need to send about 1.2 GBytes per second through the network stack. In my view, this means that scaling this architecture to a server with eight (or even more) GPUs either requires (multiple) IB interconnects, or a preprocessor which is co-located with the inference server, which receives compressed images, and sends raw data to the TRT server.
Once we assume that a preprocessor is located on the same physical node as the TRT inference server (and hope that the CPUs does not become a bottleneck now), then it would be much preferable to submit raw data via IPC (e.g. through
/dev/shm
) to the inference server, and thus avoid the overhead introduced by gRPC.Here are my questions:
GRPCInferRequestProvider::GetNextInputContent
intensorrt-inference-server/src/core/infer.cc
accordingly? Did I overlook a place where changes are necessary?Again, thanks for making this tool available.
The text was updated successfully, but these errors were encountered: