optimize pytorch engine inference with falcon model #1234
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix falcon tp
before
concurrency: 256
elapsed_time: 128.932s
first token latency(s)(min, max, ave): 0.270, 10.948, 3.440
per-token latency(s) percentile(50, 75, 95, 99): [0.095, 0.102, 0.232, 0.441]
number of prompt tokens: 242197
number of completion tokens: 220686
token throughput (completion token): 1711.640 token/s
token throughput (prompt + completion token): 3590.119 token/s
RPS (request per second): 7.756 req/s
RPM (request per minute): 465.360 req/min