Release llamafile v0.8.7 · Mozilla-Ocho/llamafile

This release includes important performance enhancements for quants.

293a528 Performance improvements on Arm for legacy and k-quants (#453)
c38feb4 Optimized matrix multiplications for i-quants on __aarch64__ (#464)

This release fixes bugs. For example, we're now using a brand new memory
manager, which is believed to support platforms like Android that have a
virtual address space with fewer than 47 bits. This release also restores our
prebuilt Windows AMD GPU support, thanks to tinyBLAS.

0c0e72a Upgrade to Cosmopolitan v3.5.1
629e208 Fix server crash due to /dev/urandom
60404a8 Always use tinyBLAS with AMD GPUs on Windows
6d3590c Pacify --temp flag when running in server mode
a28250b Update GGML_HIP_UMA (#473)
e973fa2 Improve CPU brand detection
9cd8d70 Update sever README build/testing instructions (#461)

It should be noted that, in future releases, we plan to introduce a new
server for llamafile. This new server is being designed for performance
and production-worthiness. It's not included in this release, since the
new server currently only supports a tokenization endpoint. However the
endpoint is capable of doing 2 million requests per second whereas with
the current server, the most we've ever seen is a few thousand.

e0656ea Introduce new llamafile server

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llamafile v0.8.7