Update model definition to support Flash-Decoding #177

masahi · 2024-01-30T00:39:05Z

This PR integrates Flash-Decoding support from apache/tvm#16474. This is a drop-in replacement for the vLLM kernel. The only difference with the vLLM-based build is the shape of KV cache blocks. In particular, the block size for vLLM is 16 while for Flash-Decoding it is 256.

In addition, it supports decoding with multiple, fixed length queries per request, which is necessary for speculative decoding. evaluate_multi_query from #156 can also be used for this purpose, but it supports variable-length queries per request and piggy-backs to the prefill attention, which is not efficient when the number of queries is fixed and small. The changes in run_llama_batched_vllm.py demonstrates that the new Relax function, decode_multi_query, can do exactly the same thing as evaluate_multi_query when the query length is fixed.

This PR only updates the model definition and run_llama_batched_vllm.py example script. I'll follow up with the integration into mlc-serve next.

Need the latest https://github.com/octoml/tvm/tree/for-mlc-serve-jan12

@sunggg @yelite @vinx13

masahi · 2024-01-30T00:50:51Z

mlc_llm/core.py

@@ -402,6 +403,10 @@ class BuildArgs:
            "action": "store_true",
        },
    )
+    paged_kv_cache_type: str = field(
+        default="vllm",
+        metadata={"help": "The type of paged KV cache, either vllm or flash-decoding"},


This new option makes --use_vllm_attention obsolete. Since removing it is a breaking change, I'll do that later when I integrate Flash-Decoding into mlc-serve. @sunggg

The repetition penalty (introduced in [CTRL](https://arxiv.org/abs/1909.05858)) can help prevent the LLM from generating repetitive tokens. This PR implements the repetition penalty. Note: Previous the logits softmax is performed on GPU, this PR moves it to CPU to accommodate the repetition penalty.

sunggg

LGTM, in the follow-up PR, would you share some benchmark numbers? Thank you!

masahi added 13 commits January 29, 2024 09:26

test stub

30e57a0

wip

6f3429a

wip

97a4366

wip

7279cb6

compiled

7348f0e

wip

b692376

fix

1df6cac

fix

8c8872c

wip, decode with flash decoding works

6a8272f

all work

487129c

add paged_kv_cache_type option

8114197

read kv_type from artifact

2d6c81b

black

67353b2

masahi commented Jan 30, 2024

View reviewed changes

vinx13 approved these changes Jan 30, 2024

View reviewed changes

refactor attention backend

b9e41e1

masahi force-pushed the flash-decoding branch from c9d7fac to b9e41e1 Compare January 30, 2024 04:52

minor clean up

910e31b

sunggg approved these changes Jan 30, 2024

View reviewed changes

sunggg merged commit 253da78 into octoml:batch-serving Jan 30, 2024
1 check passed

This was referenced Jan 31, 2024

Integrate Flash-Decoding into engine #181

Merged

Fix multi-gpu build #182

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update model definition to support Flash-Decoding #177

Update model definition to support Flash-Decoding #177

masahi commented Jan 30, 2024 •

edited

Loading

masahi Jan 30, 2024

sunggg left a comment

Update model definition to support Flash-Decoding #177

Update model definition to support Flash-Decoding #177

Conversation

masahi commented Jan 30, 2024 • edited Loading

masahi Jan 30, 2024

Choose a reason for hiding this comment

sunggg left a comment

Choose a reason for hiding this comment

masahi commented Jan 30, 2024 •

edited

Loading