Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submitting raw data via IPC #1

Closed
mrjackbo opened this issue Nov 26, 2018 · 17 comments
Closed

Submitting raw data via IPC #1

mrjackbo opened this issue Nov 26, 2018 · 17 comments
Labels
enhancement New feature or request

Comments

@mrjackbo
Copy link

Hi, thanks for open-sourcing this project!

I experimented with the TensorRT inference server, and I found that with my target model (a TensorRT execution plan that has FP16 inputs and outputs) to max-out my system's two GPUsm I need to send about 1.2 GBytes per second through the network stack. In my view, this means that scaling this architecture to a server with eight (or even more) GPUs either requires (multiple) IB interconnects, or a preprocessor which is co-located with the inference server, which receives compressed images, and sends raw data to the TRT server.

Once we assume that a preprocessor is located on the same physical node as the TRT inference server (and hope that the CPUs does not become a bottleneck now), then it would be much preferable to submit raw data via IPC (e.g. through /dev/shm) to the inference server, and thus avoid the overhead introduced by gRPC.

Here are my questions:

  1. Is the above assessment and the conclusions I draw from it reasonable?
  2. Do you have "submission of raw data via IPC mechanisms" on your roadmap? E.g. a feature where one submits a reference to the blob of preprocessed data in shared memory to the server via gRPC, and the server then loads this blob and uses it as input. If so, when do you plan on releasing it?
  3. If I were to implement a version of this myself, do you agree that a first quick-and-dirty approach would be to a) change the gRPC service proto, and then b) change GRPCInferRequestProvider::GetNextInputContent in tensorrt-inference-server/src/core/infer.cc accordingly? Did I overlook a place where changes are necessary?

Again, thanks for making this tool available.

@deadeyegoodwin
Copy link
Contributor

In general I think your assessment is correct: I/O can be a performance limiter for some models and a primary way to fix this in many cases is to make the pre-processing local with the inference. Here are some variations we think about and where we stand as far as current support:

  1. Pre-processing "service" running on same node as TensorRT Inference Server (TRTIS).
    a. Use GRPC (or HTTP) to communicate from pre-processor -> TRTIS. Since communication is now local it may no longer be a bottleneck...
    b. For even higher BW between pre-processor -> TRTIS, remove the GRPC/HTTP protocol overhead by implementing a custom/raw socket API. The internal APIs to allow this are already available within TRTIS and we plan to formalize and document them better in the future. Another option here is a flatbuffer interface which we have also thought about but not done anything with as yet.
    c. Use shared-memory as you suggest... this would likely require a custom TRTIS API to communicate the shared-memory reference so is similar to (b).
    d. For maximum bandwidth you could share GPU memory between pre-processor and TRTIS and use that for communication. The pre-processor would leave the input tensors in GPU memory and just share the location (via CUDA IPC) with TRTIS. We want to add some functionality to TRTIS to support this but currently we have not.
  2. Avoid communication completely by implementing the pre-processor within TRTIS. Again, the
    internal APIs to allow this are already available within TRTIS and we plan to formalize and document them better in the future. In general we are interested in generic pre-processor "add-ons" of this kind that we can incorporate into TRTIS as build-time options.

I would suggest that you start with (1a) and see how much benefit that gets you. We are generally interested in improving TRTIS in this area and so would welcome your experience and feedback as you experiment. If you think you could contribute something generally useful we would be very open to working with you on it, just be sure to include us in your plans early on so we can make sure we are all on the same page.

As for your question #3. Yes, for experimenting it is probably fastest to hack up the gRPC service to instead pass the reference instead of the actual data (but keep the rest of the request/response message the same). infer.cc is where the data (raw_input) is read out of the request message so you would need to change that to instead read from shared memory.

@seovchinnikov
Copy link

seovchinnikov commented Nov 30, 2018

Hey! Thank you.
I suggested the same idea here https://github.com/NVIDIA/dl-inference-server/issues/24#event-1980712680
so it's obviously popular enhancement, it would be cool to have a basic implementation of point #3

@seovchinnikov
Copy link

Ok, I've implemented it for grpc (only) in very hacky way https://github.com/seovchinnikov/tensorrt-inference-server/tree/file-api
But I've been to hell and back to turn off all sanity checks because it was not intended to take a dynamic-sized input so I hope for a better solution.
@deadeyegoodwin thanks for a very well-structured code, it was not very difficult to figure it out what to tweak

@ryanolson
Copy link
Contributor

ryanolson commented Dec 5, 2018

@mrjackbo - For my projects, I've done exactly what you are describing above. I've created a pre/post-processing service which uses sysv shared-memory to avoid serializing, moving, and deserializing raw tensors over a gRPC message.

You have two granularities of access control on which you can expose the shared-memory segments between processes:

  1. Node level - create shared-memory segments which are exposed to any process running on the node.
  2. Namespace level - created shared-memory segments are only accessible if the processes/containers share the same IPC namespace.

In Kubernetes, you can use a DaemonSet to create node-level IPCs or you can use multiple containers in a Pod, which by default all containers in a Pod share the same IPC namespace. Unfortunately, there is no API (that I am aware of) that allows different Pods to share the same IPC namespace. And example might look like:

      containers:
      - name: shared-memory
        image: my-shared-memory-service-image
        ports:
        - name: grpc
          containerPort: 50049
      - name: trtis
        image: my-customized-trtis-image
        command: ["wait-for-it.sh", "localhost:50049", "--timeout=0", "--", "/opt/tensorrtserver/bin/trtserver", "--model-store=/tmp/models"]
        resources:
          limits:
            nvidia.com/gpu: 1
        ports:
        - name: metrics
          containerPort: 8002

In this example, the shared-memory service receives async gRPC incoming requests, then using the async gRPC client to forward them to TRTIS. You have to customize TRTIS's protobuf API definition so you can pass a segment_id and offset rather than a bytes object.

Note: the TRTIS service also needs to be customized to connect to the shared-memory service and handshake segment IDs.

Using Docker, you can use the --ipc flag directly. You can create the your shared memory service, then use --ipcs=container:<shared_memory_service_container_name>. The Docker Compose file might looks something like:

  sharedmemory:
    image: my-shared-memory-service-image
    ports:
    - 3333:50051
  tensorrt:
    depends_on:
    - sharedmemory
    image: my-customized-trtis-image
    ipc: container:inferencedemo_sharedmemory_1

I'll post and example of the shared-memory service I have used and update this thread when it's ready. The example is currently in an older project that is actively being moved to a new github project.

@pmcgraw-lucidyne
Copy link

Hello, I am now tackling this and before I get too far in the weeds I figured I would follow up here given the latest release r19.04 supports custom operations at build time or startup. I was wondering if I wanted to avoid having to write data to files before making a request to TRTIS, would I be looking at writing a custom operation, or am I looking at using the existing API somehow?

@deadeyegoodwin
Copy link
Contributor

I assume when you say "custom operation" you mean "custom backend".

You could create a custom backend that expected the input to identify the shared memory handle, offset, size, etc. The custom backend would extract the data from the shared memory handle into one or more output tensors. You could then ensemble this custom backend with your actual model. When you made a request your input tensor would be just the shared memory handle, offset, size, etc. data expected by your custom backend. An example of using a custom backend with an ensemble will be included in the 19.05 release (coming later this week). Or you can find it now on master: https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-master-branch-guide/docs/client.html#ensemble-image-classification-example-application

Note that eventually we will have a new API (or enhancement to existing API) that will allow you to send a "handle" to the shared memory containing the input tensor instead of the actual tensor values. But we don't have a schedule yet for when that will be available.

@pmcgraw-lucidyne
Copy link

pmcgraw-lucidyne commented May 22, 2019 via email

@philipp-schmidt
Copy link

Note that eventually we will have a new API (or enhancement to existing API) that will allow you to send a "handle" to the shared memory containing the input tensor instead of the actual tensor values. But we don't have a schedule yet for when that will be available.

@deadeyegoodwin Anything new regarding this topic? Using shared memory would probably double (if not even triple) my throughput at this point, so I will have to implement one of the many mentioned solutions above anyway. I will of course share my insights if needed, so it would be useful to know what the current state and plan is API-wise.

This is where I'm at right now:

a. Use GRPC (or HTTP) to communicate from pre-processor -> TRTIS. Since communication is now local it may no longer be a bottleneck...

This unfortunately does not increase performance a lot, as HTTP (and to less extent gRPC) seem to become the major bottleneck with large input tensors (608x608x3 in this case) quite rapidly, even on localhost:

root@pc001:/workspace/build# ./perf_client -m yolov3_trt -t 16 -p 15000 -b 32              
*** Measurement Settings ***
  Batch size: 32
  Measurement window: 15000 msec
  Reporting average latency

Request concurrency: 16
  Client: 
    Request count: 54
    Throughput: 115 infer/sec
    Avg latency: 4532481 usec (standard deviation 2339987 usec)
    Avg HTTP time: 4524200 usec (send/recv 3499134 usec + response wait 1025066 usec)
  Server: 
    Request count: 64
    Avg request latency: 1036893 usec (overhead 15472 usec + queue 651073 usec + compute 370348 usec)

4.5 seconds HTTP time versus 1 second server compute time with 1135 MBit (!) per 32-batch. Not sure if allocation and initialization of the batch on client side is included though.

b. For even higher BW between pre-processor -> TRTIS, remove the GRPC/HTTP protocol overhead by implementing a custom/raw socket API. The internal APIs to allow this are already available within TRTIS and we plan to formalize and document them better in the future. Another option here is a flatbuffer interface which we have also thought about but not done anything with as yet.

I think this could be a very fast solution, even compared with shared-memory approaches. A quick test with iperf on localhost indicates that transmission on loopback is mainly CPU bound:

root@pc001:~$ iperf -c localhost
------------------------------------------------------------
Client connecting to localhost, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 127.0.0.1 port 44234 connected with 127.0.0.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  96.3 GBytes  82.7 Gbits/sec

So I suppose TCP slow start must be disabled or worked around, but this might be sufficient, even in comparison to shared memory.

c. Use shared-memory as you suggest... this would likely require a custom TRTIS API to communicate the shared-memory reference so is similar to (b).

Probably the fastest and "cleanest" solution. Would love to see this supported in the API, without the need for a custom backend receiving shared memory handles and passing the data on in an ensemble. Right now this is the way to go though I guess? So I will try that first before checking b). Any additional input is much appreciated.

@deadeyegoodwin
Copy link
Contributor

We have just started work on implementing a shared-memory AP (option C). Changes will start to come into master and we expect to have an initial minimal implementation in about 3 weeks. The API will allow input and output tensors to be passed to/from TRTIS via shared-memory instead of over the network. It will be the responsibility of an outside "agent" to create and manage the lifetime of the shared-memory regions. TRTIS will provide APIs that allow that "agent" to register/unregister these shared memory regions with TRTIS and then they can be used in inference requests.

@philipp-schmidt
Copy link

@deadeyegoodwin @CoderHam what are the chances the three shared memory branches will make it onto master this week? And will perf client support a shared memory test out of the box?

I tried building the shared memory branches, but I'm not sure I'm getting the combination of server and clients of the different branches right. What would be the easiest way to get a little test going? Building the server on "hemantj-sharedMemory-server" and then use the simple shm client from "hemantj-sharedMemory-test"? For now I'm only interested in the performance gains and resulting throughput, so I'm basically fine with an unstable, buggy demo if it at least runs somehow. Changes in code look great so far, thanks for the good work!

@deadeyegoodwin
Copy link
Contributor

By this week we should have shared memory support for input tensors with some minimal testing. Output tensor support will follow shortly after. Adding support to perf_client plus much more extensive testing is needed after that before we can declare system memory (CPU) sharing complete. That will likely take a couple of weeks. After that we will start on GPU shared memory.

@deadeyegoodwin
Copy link
Contributor

deadeyegoodwin commented Aug 7, 2019

The master branch now has the initial implementation for shared memory support for input tensors and some minimal testing.

Currently only the C++ client API supports shared memory (Python support is TBD.. but you can always use grpc to generate client code for many languages). The C++ API changes are here: 6d33c8c#diff-906ebe14e6f98b22609d12ac8433acc0

An example application is: https://github.com/NVIDIA/tensorrt-inference-server/blob/master/src/clients/c%2B%2B/simple_shm_client.cc. The L0_simple_shared_memory_example test performs some minimal testing using that example application.

@philipp-schmidt
Copy link

Works perfectly, thanks!

I had to add "--ipc=host" to docker (has been mentioned somewhere in a related pull request) and use the -I flag to prevent the simple_shm_client from using output shared memory, if anyone else is trying.

@CoderHam
Copy link
Contributor

CoderHam commented Aug 8, 2019

That's right the simple_shm_client is currently set up to use both input and output with shared memory by default (-I for only input and -O for only output).

Yes, --ipc=host is necessary I will remember to add this to the docs when the output shared memory is also completed.

Works perfectly, thanks!

I had to add "--ipc=host" to docker (has been mentioned somewhere in a related pull request) and use the -I flag to prevent the simple_shm_client from using output shared memory, if anyone else is trying.

@philipp-schmidt
Copy link

@CoderHam for the documentation it might also be worth to add that docker is apparently limited to 64MB of shared memory by default, easily surpassed by even the modest batch sizes for some models.
--shm-size=256m increases this limit to e.g. 256MB.

https://stackoverflow.com/questions/30210362/how-to-increase-the-size-of-the-dev-shm-in-docker-container

And thanks for #541, was about to dive deeper into the code when your commits started coming in ;)

@CoderHam
Copy link
Contributor

@philipp-schmidt Thanks for bringing the memory limit to my attention. I will go ahead and document the same in the client + server API docs.

@deadeyegoodwin
Copy link
Contributor

System shared memory is complete and available on master branch and 19.10. CUDA shared memory is in progress and will be available in 19.11. Closing.

deadeyegoodwin pushed a commit that referenced this issue Nov 16, 2020
KrishnanPrash added a commit that referenced this issue Oct 18, 2024
KrishnanPrash added a commit that referenced this issue Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

7 participants