Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf_client has limits on concurrency #14

Closed
mrjackbo opened this issue Dec 14, 2018 · 4 comments · Fixed by #57
Closed

perf_client has limits on concurrency #14

mrjackbo opened this issue Dec 14, 2018 · 4 comments · Fixed by #57
Labels
bug Something isn't working

Comments

@mrjackbo
Copy link

I cannot max out my two GPUs using one instance of perf_client, but using multiple instances I can.

A little bit more detail:
Using perf_client with GRPC and -a, the throughput does not increase when I go beyond -t 9.
But using four instances with -t 9 simultaneously for the same model, I get much higher total throughput. This is can be verified using trtis metrics.
Is this expected behavior? I can share the precise command that I use when I am back at my desk, but I cannot share the model that I use.

@deadeyegoodwin
Copy link
Contributor

Is there any particular reason that you are using the -a flag. That causes a single thread to attempt to drive the entire load and that will eventually be limited by that thread. Try running perf_client without -a and specify a larger -t value and see if that doesn't increase throughput. Also are you aware of the -d flag which allows perf_client to automatically search and create a latency vs. infer/sec curve. See the documentation for more information on that.

@mrjackbo
Copy link
Author

This is the only way I could make it work at all. I was getting timeouts without -a. I can try again next week.

@mrjackbo
Copy link
Author

mrjackbo commented Dec 18, 2018

I managed to get perf_client working with high concurrency without the -a flag. The problem was that with high concurrency (say -t 19 or -t 36) perf_client needs some time before it starts sending data to the server. Let's call this period "warmup-phase". During this phase, perf_client maxes out several CPU cores. My guess is that perf_client generates random inputs on a per thread basis, which leads to a long wait when using models with large inputs and with high concurrency. Moreover, warmup-time seems to count towards the measurement window set with -p.
In my particular situation, I had set -p 5000, which was much too short compared to the warmup phase needed by perf_client. Better documentation would help here (and perhaps show a "please wait" message during this warmup). My model uses FP16 inputs of shape (1,800,1200,3), and inference starts after 30-40 seconds.

As to my original issue with concurrency: I can now exceed the results previously obtained with -a by using -t 19 (and setting -p to account for warm-up). But still, on a larger server with 4 GPUs, I get noticeable gains using two instances of perf_client with -t 36 each, compared to one instance with -t 72. I am not sure what the bottleneck is now.

@deadeyegoodwin deadeyegoodwin added the bug Something isn't working label Jan 3, 2019
@deadeyegoodwin
Copy link
Contributor

For the warmup-phase problem we will look into removing that from the measurement window so you don't need to use a large -p value.

For large concurrency values with -a it is possible that the single thread used with -a is unable to keep up with handing the requests and responses for 72 requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

Successfully merging a pull request may close this issue.

2 participants