high throughput inference

Was chatting with @Chillee about our plans in AO today and he mentioned we should be focusing on a few concrete problems like
1. Demonstrate compelling perf for fp8 gemm at a variety of batch sizes.
2. Demonstrate compelling perf for weight only int8 gemm at a variety of batch sizes.
3. Demonstrate compelling perf for weight only intX gemm at low batch sizes.
4. Demonstrate compelling perf for weight intX, activation fp8 at a variety of batch sizes.

We could as a baseline extend gpt-fast to work with bs=n w/o doing any kv cache management work and measure perf there. Copying feedback as is, open to discussing more and adding more details as time progresses

EDIT: gpt-fast already has a batched generation branch by Horace https://github.com/pytorch-labs/gpt-fast/tree/batched_generation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

high throughput inference #663

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

high throughput inference #663

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions