Submitting raw data via IPC #1

mrjackbo · 2018-11-26T10:42:09Z

Hi, thanks for open-sourcing this project!

I experimented with the TensorRT inference server, and I found that with my target model (a TensorRT execution plan that has FP16 inputs and outputs) to max-out my system's two GPUsm I need to send about 1.2 GBytes per second through the network stack. In my view, this means that scaling this architecture to a server with eight (or even more) GPUs either requires (multiple) IB interconnects, or a preprocessor which is co-located with the inference server, which receives compressed images, and sends raw data to the TRT server.

Once we assume that a preprocessor is located on the same physical node as the TRT inference server (and hope that the CPUs does not become a bottleneck now), then it would be much preferable to submit raw data via IPC (e.g. through /dev/shm) to the inference server, and thus avoid the overhead introduced by gRPC.

Here are my questions:

Is the above assessment and the conclusions I draw from it reasonable?
Do you have "submission of raw data via IPC mechanisms" on your roadmap? E.g. a feature where one submits a reference to the blob of preprocessed data in shared memory to the server via gRPC, and the server then loads this blob and uses it as input. If so, when do you plan on releasing it?
If I were to implement a version of this myself, do you agree that a first quick-and-dirty approach would be to a) change the gRPC service proto, and then b) change GRPCInferRequestProvider::GetNextInputContent in tensorrt-inference-server/src/core/infer.cc accordingly? Did I overlook a place where changes are necessary?

Again, thanks for making this tool available.

The text was updated successfully, but these errors were encountered:

deadeyegoodwin · 2018-11-26T19:31:11Z

In general I think your assessment is correct: I/O can be a performance limiter for some models and a primary way to fix this in many cases is to make the pre-processing local with the inference. Here are some variations we think about and where we stand as far as current support:

Pre-processing "service" running on same node as TensorRT Inference Server (TRTIS).
a. Use GRPC (or HTTP) to communicate from pre-processor -> TRTIS. Since communication is now local it may no longer be a bottleneck...
b. For even higher BW between pre-processor -> TRTIS, remove the GRPC/HTTP protocol overhead by implementing a custom/raw socket API. The internal APIs to allow this are already available within TRTIS and we plan to formalize and document them better in the future. Another option here is a flatbuffer interface which we have also thought about but not done anything with as yet.
c. Use shared-memory as you suggest... this would likely require a custom TRTIS API to communicate the shared-memory reference so is similar to (b).
d. For maximum bandwidth you could share GPU memory between pre-processor and TRTIS and use that for communication. The pre-processor would leave the input tensors in GPU memory and just share the location (via CUDA IPC) with TRTIS. We want to add some functionality to TRTIS to support this but currently we have not.
Avoid communication completely by implementing the pre-processor within TRTIS. Again, the
internal APIs to allow this are already available within TRTIS and we plan to formalize and document them better in the future. In general we are interested in generic pre-processor "add-ons" of this kind that we can incorporate into TRTIS as build-time options.

I would suggest that you start with (1a) and see how much benefit that gets you. We are generally interested in improving TRTIS in this area and so would welcome your experience and feedback as you experiment. If you think you could contribute something generally useful we would be very open to working with you on it, just be sure to include us in your plans early on so we can make sure we are all on the same page.

As for your question #3. Yes, for experimenting it is probably fastest to hack up the gRPC service to instead pass the reference instead of the actual data (but keep the rest of the request/response message the same). infer.cc is where the data (raw_input) is read out of the request message so you would need to change that to instead read from shared memory.

seovchinnikov · 2018-11-30T10:30:57Z

Hey! Thank you.
I suggested the same idea here https://github.com/NVIDIA/dl-inference-server/issues/24#event-1980712680
so it's obviously popular enhancement, it would be cool to have a basic implementation of point #3

seovchinnikov · 2018-12-05T00:07:04Z

Ok, I've implemented it for grpc (only) in very hacky way https://github.com/seovchinnikov/tensorrt-inference-server/tree/file-api
But I've been to hell and back to turn off all sanity checks because it was not intended to take a dynamic-sized input so I hope for a better solution.
@deadeyegoodwin thanks for a very well-structured code, it was not very difficult to figure it out what to tweak

ryanolson · 2018-12-05T17:08:38Z

@mrjackbo - For my projects, I've done exactly what you are describing above. I've created a pre/post-processing service which uses sysv shared-memory to avoid serializing, moving, and deserializing raw tensors over a gRPC message.

You have two granularities of access control on which you can expose the shared-memory segments between processes:

Node level - create shared-memory segments which are exposed to any process running on the node.
Namespace level - created shared-memory segments are only accessible if the processes/containers share the same IPC namespace.

In Kubernetes, you can use a DaemonSet to create node-level IPCs or you can use multiple containers in a Pod, which by default all containers in a Pod share the same IPC namespace. Unfortunately, there is no API (that I am aware of) that allows different Pods to share the same IPC namespace. And example might look like:

      containers:
      - name: shared-memory
        image: my-shared-memory-service-image
        ports:
        - name: grpc
          containerPort: 50049
      - name: trtis
        image: my-customized-trtis-image
        command: ["wait-for-it.sh", "localhost:50049", "--timeout=0", "--", "/opt/tensorrtserver/bin/trtserver", "--model-store=/tmp/models"]
        resources:
          limits:
            nvidia.com/gpu: 1
        ports:
        - name: metrics
          containerPort: 8002

In this example, the shared-memory service receives async gRPC incoming requests, then using the async gRPC client to forward them to TRTIS. You have to customize TRTIS's protobuf API definition so you can pass a segment_id and offset rather than a bytes object.

Note: the TRTIS service also needs to be customized to connect to the shared-memory service and handshake segment IDs.

Using Docker, you can use the --ipc flag directly. You can create the your shared memory service, then use --ipcs=container:<shared_memory_service_container_name>. The Docker Compose file might looks something like:

  sharedmemory:
    image: my-shared-memory-service-image
    ports:
    - 3333:50051
  tensorrt:
    depends_on:
    - sharedmemory
    image: my-customized-trtis-image
    ipc: container:inferencedemo_sharedmemory_1

I'll post and example of the shared-memory service I have used and update this thread when it's ready. The example is currently in an older project that is actively being moved to a new github project.

pmcgraw-lucidyne · 2019-05-21T22:15:13Z

Hello, I am now tackling this and before I get too far in the weeds I figured I would follow up here given the latest release r19.04 supports custom operations at build time or startup. I was wondering if I wanted to avoid having to write data to files before making a request to TRTIS, would I be looking at writing a custom operation, or am I looking at using the existing API somehow?

deadeyegoodwin · 2019-05-22T16:14:34Z

I assume when you say "custom operation" you mean "custom backend".

You could create a custom backend that expected the input to identify the shared memory handle, offset, size, etc. The custom backend would extract the data from the shared memory handle into one or more output tensors. You could then ensemble this custom backend with your actual model. When you made a request your input tensor would be just the shared memory handle, offset, size, etc. data expected by your custom backend. An example of using a custom backend with an ensemble will be included in the 19.05 release (coming later this week). Or you can find it now on master: https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-master-branch-guide/docs/client.html#ensemble-image-classification-example-application

Note that eventually we will have a new API (or enhancement to existing API) that will allow you to send a "handle" to the shared memory containing the input tensor instead of the actual tensor values. But we don't have a schedule yet for when that will be available.

pmcgraw-lucidyne · 2019-05-22T16:43:42Z

Thanks for the information. I'll checkout master!

…

On Wed, May 22, 2019 at 9:14 AM deadeyegoodwin ***@***.***> wrote: I assume when you say "custom operation" you mean "custom backend". You could create a custom backend that expected the input to identify the shared memory handle, offset, size, etc. The custom backend would extract the data from the shared memory handle into one or more output tensors. You could then ensemble this custom backend with your actual model. When you made a request your input tensor would be just the shared memory handle, offset, size, etc. data expected by your custom backend. An example of using a custom backend with an ensemble will be included in the 19.05 release (coming later this week). Or you can find it now on master: https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-master-branch-guide/docs/client.html#ensemble-image-classification-example-application Note that eventually we will have a new API (or enhancement to existing API) that will allow you to send a "handle" to the shared memory containing the input tensor instead of the actual tensor values. But we don't have a schedule yet for when that will be available. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFSEXYMFYZTGSF3AQ4XGGGDPWVWO3ANCNFSM4GGLBQ4A> .

-- Padraic McGraw Software Developer Lucidyne Technoligies Inc.

philipp-schmidt · 2019-07-23T14:18:25Z

Note that eventually we will have a new API (or enhancement to existing API) that will allow you to send a "handle" to the shared memory containing the input tensor instead of the actual tensor values. But we don't have a schedule yet for when that will be available.

@deadeyegoodwin Anything new regarding this topic? Using shared memory would probably double (if not even triple) my throughput at this point, so I will have to implement one of the many mentioned solutions above anyway. I will of course share my insights if needed, so it would be useful to know what the current state and plan is API-wise.

This is where I'm at right now:

a. Use GRPC (or HTTP) to communicate from pre-processor -> TRTIS. Since communication is now local it may no longer be a bottleneck...

This unfortunately does not increase performance a lot, as HTTP (and to less extent gRPC) seem to become the major bottleneck with large input tensors (608x608x3 in this case) quite rapidly, even on localhost:

root@pc001:/workspace/build# ./perf_client -m yolov3_trt -t 16 -p 15000 -b 32              
*** Measurement Settings ***
  Batch size: 32
  Measurement window: 15000 msec
  Reporting average latency

Request concurrency: 16
  Client: 
    Request count: 54
    Throughput: 115 infer/sec
    Avg latency: 4532481 usec (standard deviation 2339987 usec)
    Avg HTTP time: 4524200 usec (send/recv 3499134 usec + response wait 1025066 usec)
  Server: 
    Request count: 64
    Avg request latency: 1036893 usec (overhead 15472 usec + queue 651073 usec + compute 370348 usec)

4.5 seconds HTTP time versus 1 second server compute time with 1135 MBit (!) per 32-batch. Not sure if allocation and initialization of the batch on client side is included though.

b. For even higher BW between pre-processor -> TRTIS, remove the GRPC/HTTP protocol overhead by implementing a custom/raw socket API. The internal APIs to allow this are already available within TRTIS and we plan to formalize and document them better in the future. Another option here is a flatbuffer interface which we have also thought about but not done anything with as yet.

I think this could be a very fast solution, even compared with shared-memory approaches. A quick test with iperf on localhost indicates that transmission on loopback is mainly CPU bound:

root@pc001:~$ iperf -c localhost
------------------------------------------------------------
Client connecting to localhost, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 127.0.0.1 port 44234 connected with 127.0.0.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  96.3 GBytes  82.7 Gbits/sec

So I suppose TCP slow start must be disabled or worked around, but this might be sufficient, even in comparison to shared memory.

c. Use shared-memory as you suggest... this would likely require a custom TRTIS API to communicate the shared-memory reference so is similar to (b).

Probably the fastest and "cleanest" solution. Would love to see this supported in the API, without the need for a custom backend receiving shared memory handles and passing the data on in an ensemble. Right now this is the way to go though I guess? So I will try that first before checking b). Any additional input is much appreciated.

deadeyegoodwin · 2019-07-23T18:13:55Z

We have just started work on implementing a shared-memory AP (option C). Changes will start to come into master and we expect to have an initial minimal implementation in about 3 weeks. The API will allow input and output tensors to be passed to/from TRTIS via shared-memory instead of over the network. It will be the responsibility of an outside "agent" to create and manage the lifetime of the shared-memory regions. TRTIS will provide APIs that allow that "agent" to register/unregister these shared memory regions with TRTIS and then they can be used in inference requests.

philipp-schmidt · 2019-08-06T11:13:58Z

@deadeyegoodwin @CoderHam what are the chances the three shared memory branches will make it onto master this week? And will perf client support a shared memory test out of the box?

I tried building the shared memory branches, but I'm not sure I'm getting the combination of server and clients of the different branches right. What would be the easiest way to get a little test going? Building the server on "hemantj-sharedMemory-server" and then use the simple shm client from "hemantj-sharedMemory-test"? For now I'm only interested in the performance gains and resulting throughput, so I'm basically fine with an unstable, buggy demo if it at least runs somehow. Changes in code look great so far, thanks for the good work!

deadeyegoodwin · 2019-08-06T15:59:50Z

By this week we should have shared memory support for input tensors with some minimal testing. Output tensor support will follow shortly after. Adding support to perf_client plus much more extensive testing is needed after that before we can declare system memory (CPU) sharing complete. That will likely take a couple of weeks. After that we will start on GPU shared memory.

deadeyegoodwin · 2019-08-07T22:48:40Z

The master branch now has the initial implementation for shared memory support for input tensors and some minimal testing.

Currently only the C++ client API supports shared memory (Python support is TBD.. but you can always use grpc to generate client code for many languages). The C++ API changes are here: 6d33c8c#diff-906ebe14e6f98b22609d12ac8433acc0

An example application is: https://github.com/NVIDIA/tensorrt-inference-server/blob/master/src/clients/c%2B%2B/simple_shm_client.cc. The L0_simple_shared_memory_example test performs some minimal testing using that example application.

philipp-schmidt · 2019-08-08T00:49:00Z

Works perfectly, thanks!

I had to add "--ipc=host" to docker (has been mentioned somewhere in a related pull request) and use the -I flag to prevent the simple_shm_client from using output shared memory, if anyone else is trying.

CoderHam · 2019-08-08T01:06:19Z

That's right the simple_shm_client is currently set up to use both input and output with shared memory by default (-I for only input and -O for only output).

Yes, --ipc=host is necessary I will remember to add this to the docs when the output shared memory is also completed.

Works perfectly, thanks!

I had to add "--ipc=host" to docker (has been mentioned somewhere in a related pull request) and use the -I flag to prevent the simple_shm_client from using output shared memory, if anyone else is trying.

philipp-schmidt · 2019-08-09T22:13:49Z

@CoderHam for the documentation it might also be worth to add that docker is apparently limited to 64MB of shared memory by default, easily surpassed by even the modest batch sizes for some models.
--shm-size=256m increases this limit to e.g. 256MB.

https://stackoverflow.com/questions/30210362/how-to-increase-the-size-of-the-dev-shm-in-docker-container

And thanks for #541, was about to dive deeper into the code when your commits started coming in ;)

CoderHam · 2019-08-10T03:12:27Z

@philipp-schmidt Thanks for bringing the memory limit to my attention. I will go ahead and document the same in the client + server API docs.

deadeyegoodwin · 2019-10-21T16:39:10Z

System shared memory is complete and available on master branch and 19.10. CUDA shared memory is in progress and will be available in 19.11. Closing.

Pull from origin

zoidburg mentioned this issue Nov 30, 2018

got problem while serving with TensorRT plan #7

Closed

deadeyegoodwin added the enhancement New feature or request label Dec 15, 2018

pmcgraw-lucidyne mentioned this issue Dec 21, 2018

Implement Pre-process "add-on" to reduce TRTIS communication bottleneck #25

Closed

vilmara mentioned this issue Mar 7, 2019

Performance Example Application: [ 0] INTERNAL - No valid requests recorded within time interval. Please use a larger time window. #136

Closed

vilmara mentioned this issue Apr 18, 2019

Unexpected TensorRT5.1.2 Results vs TRTIS1.0.0 Results #239

Closed

deadeyegoodwin closed this as completed Oct 21, 2019

taomiao mentioned this issue Nov 20, 2019

pytorch bert model error #900

Closed

seovchinnikov mentioned this issue Feb 3, 2020

Image compression #187

Closed

zhouxuan009 mentioned this issue Feb 19, 2020

Problem when running sequence models #1123

Closed

bezero mentioned this issue Apr 6, 2020

Encountered a non-existing input blob #1271

Closed

VJAYSLN mentioned this issue Aug 12, 2020

Perf Client Failed while inferencing request on the loaded bert model !! #1899

Closed

rarvind33 mentioned this issue Aug 27, 2020

Failed to load 'resnet50_netdef' #1935

Closed

ruilongzhang mentioned this issue Sep 9, 2020

[enforce fail at operator.cc:76] blob != nullptr. op Cast: Encountered a non-existing input blob: data #1993

Closed

deadeyegoodwin pushed a commit that referenced this issue Nov 16, 2020

Merge pull request #1 from triton-inference-server/master

f3d02d9

Pull from origin

ald2004 mentioned this issue Jan 19, 2021

segment fault within http_server.cc about rapidjson #2434

Closed

arunsu mentioned this issue Mar 4, 2021

Running torchscript exported model in Triton throws InferenceServerException #2594

Closed

lincong8722 mentioned this issue Nov 30, 2021

About PyTorch execute failure: forward() is missing value for argument 'input'. error #3633

Closed

deadeyegoodwin mentioned this issue Dec 9, 2021

Support tracing tensors in triton #3598

Merged

jackzhou121 mentioned this issue Mar 3, 2022

triton server failed exited with coredump #4010

Closed

This was referenced Apr 25, 2022

[LibTorch] Expected Tensor but got None with inception v3 #2526

Closed

Running inference with Pytorch backend on Jetson nano #4298

Closed

PRIYANKArythem3 mentioned this issue Jul 19, 2022

Server crashes on loading shared libraries #4665

Closed

jackzhou121 mentioned this issue Aug 17, 2022

triton pytorch backend malloc coredump #4778

Closed

zhaotyer mentioned this issue Aug 22, 2022

Core dump when dynamic batch Infer using tensorflow backend #4769

Closed

jackzhou121 mentioned this issue Sep 5, 2022

use triton container 22.07 sdk load torchscript model failed #4848

Closed

Tsingjie89 mentioned this issue Sep 8, 2022

python backend crash #4857

Closed

johntaves mentioned this issue Nov 15, 2022

client fails to compile #5078

Closed

zhaotyer mentioned this issue Dec 22, 2022

Core dump when load model with config which containning repoagent in explicit mode #5189

Closed

rmccorm4 mentioned this issue Feb 3, 2023

Add gdb backtrace to qa tests when server fails to start within timeout #5310

Merged

rainyfly mentioned this issue Sep 8, 2023

Memory Leak in python backend decouple mode #6270

Closed

grzywada mentioned this issue Nov 17, 2023

Segmentation fault in tritonserver when I unload a model that uses implicit state management immediately after I infer the model #6594

Open

mujahiddh mentioned this issue Dec 5, 2023

Error during triton server build: [ 38%] No configure step for 'grpc-repo' #6654

Closed

lawliet0823 mentioned this issue Feb 17, 2024

Encountering a segmentation fault issue when attempting to send multiple images via gRPC #6891

Open

vonchenplus mentioned this issue Mar 14, 2024

[Pytorch model] Triton inference server didn't response the second request from client (only run with first request) #6593

Closed

MouseSun846 mentioned this issue Jun 2, 2024

triton malloc fail #7308

Open

KrishnanPrash added a commit that referenced this issue Oct 18, 2024

timeout attempt #1

8d1eda4

KrishnanPrash added a commit that referenced this issue Nov 4, 2024

User workflow #1

e4fe9f9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submitting raw data via IPC #1

Submitting raw data via IPC #1

mrjackbo commented Nov 26, 2018

deadeyegoodwin commented Nov 26, 2018

seovchinnikov commented Nov 30, 2018 •

edited

Loading

seovchinnikov commented Dec 5, 2018

ryanolson commented Dec 5, 2018 •

edited

Loading

pmcgraw-lucidyne commented May 21, 2019

deadeyegoodwin commented May 22, 2019

pmcgraw-lucidyne commented May 22, 2019 via email

philipp-schmidt commented Jul 23, 2019

deadeyegoodwin commented Jul 23, 2019

philipp-schmidt commented Aug 6, 2019

deadeyegoodwin commented Aug 6, 2019

deadeyegoodwin commented Aug 7, 2019 •

edited

Loading

philipp-schmidt commented Aug 8, 2019

CoderHam commented Aug 8, 2019

philipp-schmidt commented Aug 9, 2019

CoderHam commented Aug 10, 2019

deadeyegoodwin commented Oct 21, 2019

Submitting raw data via IPC #1

Submitting raw data via IPC #1

Comments

mrjackbo commented Nov 26, 2018

deadeyegoodwin commented Nov 26, 2018

seovchinnikov commented Nov 30, 2018 • edited Loading

seovchinnikov commented Dec 5, 2018

ryanolson commented Dec 5, 2018 • edited Loading

pmcgraw-lucidyne commented May 21, 2019

deadeyegoodwin commented May 22, 2019

pmcgraw-lucidyne commented May 22, 2019 via email

philipp-schmidt commented Jul 23, 2019

deadeyegoodwin commented Jul 23, 2019

philipp-schmidt commented Aug 6, 2019

deadeyegoodwin commented Aug 6, 2019

deadeyegoodwin commented Aug 7, 2019 • edited Loading

philipp-schmidt commented Aug 8, 2019

CoderHam commented Aug 8, 2019

philipp-schmidt commented Aug 9, 2019

CoderHam commented Aug 10, 2019

deadeyegoodwin commented Oct 21, 2019

seovchinnikov commented Nov 30, 2018 •

edited

Loading

ryanolson commented Dec 5, 2018 •

edited

Loading

deadeyegoodwin commented Aug 7, 2019 •

edited

Loading