You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Really appreciate the awesome work by the team - have managed to get almost a x100 speedup so far with the fastertransformer_backend on triton compared to plain PyTorch with a fine-tuned T5-base model. This jupyter notebook from the team was a great reference material in helping me achieve that.
I had used a locust load testing script to hit the triton inference server directly with a couple of binary files that I created from an original set of raw sample texts (which would otherwise have to be generated by the python script I am using to query the triton inference server in actual usage), just to measure the throughput I am getting from the triton server.
On a server with a single Tesla T4 GPU after a little tuning - which included loading 2 model instances on that single GPU and turning on dynamic batching with some max queue delay, I managed to get a throughput of about 8 RPS (requests per second).
The config.pbtxt used in the single GPU server is as follows:
# Copyright (c) 2021-2022, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
name: "fastertransformer"
backend: "fastertransformer"
default_model_filename: "t5"
max_batch_size: 96
input [
{
name: "input_ids"
data_type: TYPE_UINT32
dims: [ -1 ]
},
{
name: "sequence_length"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "runtime_top_k"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "beam_search_diversity_rate"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "temperature"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "len_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "repetition_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "random_seed"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "is_return_log_probs"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "max_output_len"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "beam_width"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "start_id"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "end_id"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "bad_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
},
{
name: "stop_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
}
]
output [
{
name: "output_ids"
data_type: TYPE_UINT32
dims: [ -1, -1 ]
},
{
name: "sequence_length"
data_type: TYPE_UINT32
dims: [ -1 ]
},
{
name: "cum_log_probs"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "output_log_probs"
data_type: TYPE_FP32
dims: [ -1, -1 ]
}
]
instance_group [
{
count: 2
kind : KIND_CPU
}
]
parameters {
key: "tensor_para_size"
value: {
string_value: "1"
}
}
parameters {
key: "pipeline_para_size"
value: {
string_value: "1"
}
}
parameters {
key: "is_half"
value: {
string_value: "1"
}
}
parameters {
key: "enable_custom_all_reduce"
value: {
string_value: "0"
}
}
parameters {
key: "model_type"
value: {
string_value: "T5"
}
}
parameters {
key: "model_checkpoint_path"
value: {
string_value: "./triton-model-store/t5/fastertransformer/ft_models_fp16/1-gpu/"
}
}
dynamic_batching {
max_queue_delay_microseconds: 500
}
As I am expecting to deal with an even larger workload with more concurrent users, I tried scaling up the server to 4 Tesla T4 GPUs and reconverted the PyTorch model to a FasterTransformer model for the 4 GPU instance using:
(Yes, I am running a local build of triton-server instead of the pre-built docker image for both the single GPU and 4 GPU instances due to certain restrictions.)
One would expect to get at least some factor of speedups just by virtue of having more GPUs (even when not yet properly tuned), so you can imagine my surprise when I found that the RPS was still stuck at about 8 - essentially no difference from running on the 1 GPU server.
I then tried to increase the model instance count per GPU in config.pbtxt from 1 to 2 to match the single GPU server configs, but ran into another issue where the GPU utilization for all 4 GPUs would very quickly jump to 100% and the tritonserver process on CPU would also jump to 100%. It would then get stuck there spinning at 100% utilization but would not actually be returning any responses.
Coming across this comment in issue #34, I wonder if there is some known issue about fastertransformer_backend or triton itself not being able to run multiple model instances on a multi-GPU server if the GPU (e.g. Tesla T4) does not natively support P2P GPU connection - and whether this is also part of the cause for the throughput on a 4 GPU server having little to no difference to the throughput on a 1 GPU server.
Would the team or anyone else have any other clues on why the multi-GPU resources are not being fully utilized in this case and why I am having issues running multiple instances of the same T5 model on each GPU in a multi-GPU server?
Thanks in advance!
The text was updated successfully, but these errors were encountered:
Really appreciate the awesome work by the team - have managed to get almost a x100 speedup so far with the
fastertransformer_backend
on triton compared to plain PyTorch with a fine-tuned T5-base model. This jupyter notebook from the team was a great reference material in helping me achieve that.I had used a locust load testing script to hit the triton inference server directly with a couple of binary files that I created from an original set of raw sample texts (which would otherwise have to be generated by the python script I am using to query the triton inference server in actual usage), just to measure the throughput I am getting from the triton server.
On a server with a single Tesla T4 GPU after a little tuning - which included loading 2 model instances on that single GPU and turning on dynamic batching with some max queue delay, I managed to get a throughput of about 8 RPS (requests per second).
The
config.pbtxt
used in the single GPU server is as follows:As I am expecting to deal with an even larger workload with more concurrent users, I tried scaling up the server to 4 Tesla T4 GPUs and reconverted the PyTorch model to a FasterTransformer model for the 4 GPU instance using:
I also modified the following
config.pbtxt
parameters:and started up the triton inference server with:
(Yes, I am running a local build of triton-server instead of the pre-built docker image for both the single GPU and 4 GPU instances due to certain restrictions.)
One would expect to get at least some factor of speedups just by virtue of having more GPUs (even when not yet properly tuned), so you can imagine my surprise when I found that the RPS was still stuck at about 8 - essentially no difference from running on the 1 GPU server.
I then tried to increase the model instance count per GPU in
config.pbtxt
from1
to2
to match the single GPU server configs, but ran into another issue where the GPU utilization for all 4 GPUs would very quickly jump to 100% and thetritonserver
process on CPU would also jump to 100%. It would then get stuck there spinning at 100% utilization but would not actually be returning any responses.Coming across this comment in issue #34, I wonder if there is some known issue about
fastertransformer_backend
or triton itself not being able to run multiple model instances on a multi-GPU server if the GPU (e.g. Tesla T4) does not natively support P2P GPU connection - and whether this is also part of the cause for the throughput on a 4 GPU server having little to no difference to the throughput on a 1 GPU server.Would the team or anyone else have any other clues on why the multi-GPU resources are not being fully utilized in this case and why I am having issues running multiple instances of the same T5 model on each GPU in a multi-GPU server?
Thanks in advance!
The text was updated successfully, but these errors were encountered: