perf_client has limits on concurrency #14

mrjackbo · 2018-12-14T06:54:40Z

I cannot max out my two GPUs using one instance of perf_client, but using multiple instances I can.

A little bit more detail:
Using perf_client with GRPC and -a, the throughput does not increase when I go beyond -t 9.
But using four instances with -t 9 simultaneously for the same model, I get much higher total throughput. This is can be verified using trtis metrics.
Is this expected behavior? I can share the precise command that I use when I am back at my desk, but I cannot share the model that I use.

deadeyegoodwin · 2018-12-14T23:05:06Z

Is there any particular reason that you are using the -a flag. That causes a single thread to attempt to drive the entire load and that will eventually be limited by that thread. Try running perf_client without -a and specify a larger -t value and see if that doesn't increase throughput. Also are you aware of the -d flag which allows perf_client to automatically search and create a latency vs. infer/sec curve. See the documentation for more information on that.

mrjackbo · 2018-12-15T11:54:28Z

This is the only way I could make it work at all. I was getting timeouts without -a. I can try again next week.

mrjackbo · 2018-12-18T20:05:49Z

I managed to get perf_client working with high concurrency without the -a flag. The problem was that with high concurrency (say -t 19 or -t 36) perf_client needs some time before it starts sending data to the server. Let's call this period "warmup-phase". During this phase, perf_client maxes out several CPU cores. My guess is that perf_client generates random inputs on a per thread basis, which leads to a long wait when using models with large inputs and with high concurrency. Moreover, warmup-time seems to count towards the measurement window set with -p.
In my particular situation, I had set -p 5000, which was much too short compared to the warmup phase needed by perf_client. Better documentation would help here (and perhaps show a "please wait" message during this warmup). My model uses FP16 inputs of shape (1,800,1200,3), and inference starts after 30-40 seconds.

As to my original issue with concurrency: I can now exceed the results previously obtained with -a by using -t 19 (and setting -p to account for warm-up). But still, on a larger server with 4 GPUs, I get noticeable gains using two instances of perf_client with -t 36 each, compared to one instance with -t 72. I am not sure what the bottleneck is now.

deadeyegoodwin · 2019-01-03T16:42:18Z

For the warmup-phase problem we will look into removing that from the measurement window so you don't need to use a large -p value.

For large concurrency values with -a it is possible that the single thread used with -a is unable to keep up with handing the requests and responses for 72 requests.

deadeyegoodwin added the bug Something isn't working label Jan 3, 2019

GuanLuo mentioned this issue Jan 29, 2019

Fix #55 Corrected runner threads to spawn according to GPU counts. #57

Merged

deadeyegoodwin closed this as completed in #57 Jan 29, 2019

taomiao mentioned this issue Nov 20, 2019

pytorch bert model error #900

Closed

tanmayv25 mentioned this issue Jun 18, 2020

Enabling HTTPS in python HTTP client library #1684

Merged

rarvind33 mentioned this issue Aug 27, 2020

Failed to load 'resnet50_netdef' #1935

Closed

ruilongzhang mentioned this issue Sep 9, 2020

[enforce fail at operator.cc:76] blob != nullptr. op Cast: Encountered a non-existing input blob: data #1993

Closed

jackzhou121 mentioned this issue Mar 3, 2022

triton server failed exited with coredump #4010

Closed

jackzhou121 mentioned this issue Aug 17, 2022

triton pytorch backend malloc coredump #4778

Closed

zhaotyer mentioned this issue Aug 22, 2022

Core dump when dynamic batch Infer using tensorflow backend #4769

Closed

Tsingjie89 mentioned this issue Sep 8, 2022

python backend crash #4857

Closed

vonchenplus mentioned this issue Mar 14, 2024

[Pytorch model] Triton inference server didn't response the second request from client (only run with first request) #6593

Closed

MouseSun846 mentioned this issue Jun 2, 2024

triton malloc fail #7308

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf_client has limits on concurrency #14

perf_client has limits on concurrency #14

mrjackbo commented Dec 14, 2018

deadeyegoodwin commented Dec 14, 2018

mrjackbo commented Dec 15, 2018

mrjackbo commented Dec 18, 2018 •

edited

Loading

deadeyegoodwin commented Jan 3, 2019

perf_client has limits on concurrency #14

perf_client has limits on concurrency #14

Comments

mrjackbo commented Dec 14, 2018

deadeyegoodwin commented Dec 14, 2018

mrjackbo commented Dec 15, 2018

mrjackbo commented Dec 18, 2018 • edited Loading

deadeyegoodwin commented Jan 3, 2019

mrjackbo commented Dec 18, 2018 •

edited

Loading