llamafile v0.8.7
This release includes important performance enhancements for quants.
- 293a528 Performance improvements on Arm for legacy and k-quants (#453)
- c38feb4 Optimized matrix multiplications for i-quants on
__aarch64__
(#464)
This release fixes bugs. For example, we're now using a brand new memory
manager, which is believed to support platforms like Android that have a
virtual address space with fewer than 47 bits. This release also restores our
prebuilt Windows AMD GPU support, thanks to tinyBLAS.
- 0c0e72a Upgrade to Cosmopolitan v3.5.1
- 629e208 Fix server crash due to /dev/urandom
- 60404a8 Always use tinyBLAS with AMD GPUs on Windows
- 6d3590c Pacify --temp flag when running in server mode
- a28250b Update GGML_HIP_UMA (#473)
- e973fa2 Improve CPU brand detection
- 9cd8d70 Update sever README build/testing instructions (#461)
It should be noted that, in future releases, we plan to introduce a new
server for llamafile. This new server is being designed for performance
and production-worthiness. It's not included in this release, since the
new server currently only supports a tokenization endpoint. However the
endpoint is capable of doing 2 million requests per second whereas with
the current server, the most we've ever seen is a few thousand.
- e0656ea Introduce new llamafile server