Have you ever wanted to inference a baby Llama 2 model in pure C? No? Well, now you can!
We can now run the full llama2-7B!! No memory mapping for now, so all the weights must fit in memory (~26Gb). On my codespaces VM with 16 cores and 64Gb memory, the inference runs at 1.4 tokens per second.
It's fast: ~1 token per sec on 7B chat model, on a 6 core machine.
I'm pretty sure that llama2.c
is going to move fast and get lots of contributions.
So any contribution is welcome here!
- 8bit vectorization.
MIT