Allow block size 128 or 64 for flash decoding #186

masahi · 2024-02-01T21:44:00Z

The upstream code requires the block size be a multiple of 256, but it's actually possible to use 128 or 64 depending on head_dim. For llama-based models, block size 128 can be used.

@elvin-n @sunggg

Allow block size 128 or 64 for flash decoding

7c29571

masahi merged commit 4ebb5a3 into octoml:batch-serving Feb 1, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow block size 128 or 64 for flash decoding #186

Allow block size 128 or 64 for flash decoding #186

masahi commented Feb 1, 2024 •

edited

Loading

Allow block size 128 or 64 for flash decoding #186

Allow block size 128 or 64 for flash decoding #186

Conversation

masahi commented Feb 1, 2024 • edited Loading

masahi commented Feb 1, 2024 •

edited

Loading