Skip to content

Scripts to help deploy the Flair ner-english-fast model on Triton Server as a TorchScript model.

Notifications You must be signed in to change notification settings

tuttlebr/flair-on-nvidia-triton

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Flair Models on NVIDIA Triton Server

Scripts to help deploy the Flair ner-english-fast model on Triton Server as a TorchScript model.

Input

  1. Sentence(s)
NVIDIA is founded by Jensen Huang , Chris Malachowsky and Curtis Priem .

Output

  1. Decoded
{
  "NVIDIA is founded by Jensen Huang , Chris Malachowsky and Curtis Priem .": [
    {
      "entity_group": "ORG",
      "start": 0,
      "word": "NVIDIA",
      "end": 6,
      "score": 99
    },
    {
      "entity_group": "PER",
      "start": 21,
      "word": "Jensen Huang",
      "end": 33,
      "score": 99
    },
    {
      "entity_group": "PER",
      "start": 35,
      "word": "Chris Malachowsky",
      "end": 52,
      "score": 99
    },
    {
      "entity_group": "PER",
      "start": 57,
      "word": "Curtis Priem",
      "end": 69,
      "score": 99
    }
  ]
}

Model download

You'll need download the original model from flair and place it in workspace/triton-models/ner-english-fast/1/model.bin

Launch Triton Server

The prior model conversion step will place the model.pt file in the workspace/triton-models/flair-ner-english-fast/1 folder. This will then be bound to the container for serving and will be the model_repo. The default configuration assumes serving is done on an NVIDIA GPU but this could be modified here: workspace/triton-models/flair-ner-english-fast/config.pbtxt

docker compose up triton-server

Performance

There is a script which utilized NVIDIA"s perf_analyzer tool which can quickly report details about latency for your model set up.

docker compose up triton-client
docker compose run perf-analyzer

Below are some preliminary results to serve as a simple benchmark to be improved upon.

*** Measurement Settings ***
  Batch size: 32
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 30000 msec
  Latency limit: 0 msec
  Concurrency limit: 64 concurrent requests
  Using asynchronous calls for inference
  Stabilizing using p99 latency

Request concurrency: 32
  Client: 
    Request count: 6945
    Throughput: 2057.75 infer/sec
    p50 latency: 498487 usec
    p90 latency: 547042 usec
    p95 latency: 561698 usec
    p99 latency: 590426 usec
    Avg gRPC time: 496396 usec ((un)marshal request/response 3 usec + response wait 496393 usec)
  Server: 
    Inference count: 222240
    Execution count: 6945
    Successful request count: 6945
    Avg request latency: 496099 usec (overhead 7 usec + queue 465015 usec + compute input 11 usec + compute infer 31006 usec + compute output 59 usec)

Request concurrency: 48
  Client: 
    Request count: 6945
    Throughput: 2057.75 infer/sec
    p50 latency: 747909 usec
    p90 latency: 800052 usec
    p95 latency: 814290 usec
    p99 latency: 835362 usec
    Avg gRPC time: 744492 usec ((un)marshal request/response 3 usec + response wait 744489 usec)
  Server: 
    Inference count: 222240
    Execution count: 6945
    Successful request count: 6945
    Avg request latency: 744258 usec (overhead 6 usec + queue 713167 usec + compute input 11 usec + compute infer 31013 usec + compute output 60 usec)

Request concurrency: 64
  Client: 
    Request count: 6912
    Throughput: 2047.97 infer/sec
    p50 latency: 998946 usec
    p90 latency: 1057445 usec
    p95 latency: 1078742 usec
    p99 latency: 1121684 usec
    Avg gRPC time: 997158 usec ((un)marshal request/response 3 usec + response wait 997155 usec)
  Server: 
    Inference count: 221184
    Execution count: 6912
    Successful request count: 6912
    Avg request latency: 996923 usec (overhead 7 usec + queue 965683 usec + compute input 11 usec + compute infer 31161 usec + compute output 60 usec)

Inferences/Second vs. Client p99 Batch Latency
Concurrency: 32, throughput: 2057.75 infer/sec, latency 590426 usec
Concurrency: 48, throughput: 2057.75 infer/sec, latency 835362 usec
Concurrency: 64, throughput: 2047.97 infer/sec, latency 1121684 usec

About

Scripts to help deploy the Flair ner-english-fast model on Triton Server as a TorchScript model.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published