Open
Description
Was chatting with @Chillee about our plans in AO today and he mentioned we should be focusing on a few concrete problems like
- Demonstrate compelling perf for fp8 gemm at a variety of batch sizes.
- Demonstrate compelling perf for weight only int8 gemm at a variety of batch sizes.
- Demonstrate compelling perf for weight only intX gemm at low batch sizes.
- Demonstrate compelling perf for weight intX, activation fp8 at a variety of batch sizes.
We could as a baseline extend gpt-fast to work with bs=n w/o doing any kv cache management work and measure perf there. Copying feedback as is, open to discussing more and adding more details as time progresses
EDIT: gpt-fast already has a batched generation branch by Horace https://github.com/pytorch-labs/gpt-fast/tree/batched_generation