Releases: Mozilla-Ocho/llamafile
llamafile v0.7.2
llamafile v0.7.1
This release fixes bugs in the 0.7.0 release.
- Fix 2 embeddings-related issues in server.cpp (#324)
- Detect search query to start webchat (#333)
- Use LLAMAFILE_GPU_ERROR value -2 instead of -1 (#291)
- Fix --silent-prompt flag regression #328
- Clamp out of range values in K quantizer ef0307e
- Update to latest q5_k quantization code a8b0b15
- Change file format magic number for recently bf16 file format introduced in 0.7.0. This is a breaking change. It's due to a numbering conflict with the upstream project. We're still waiting on a permanent assignment for bfloat16 so this could potentially change again. Follow ggerganov/llama.cpp#6412 for updates.
Mixtral 8x22b and Grok support are not available in this release, but they are available if you build llamafile from source on the main branch at HEAD. We're currently dealing with an AMD Windows GPU support regression there. Once it's resolved, a 0.8 release will ship.
llamafile v0.7
llamafile lets you distribute and run LLMs with a single file
This release improves the performance and accuracy of both CPU and GPU computations in addition to security.
- tinyBLAS now gives outputs consistent with the cuBLAS thanks to Kahan summation on matvec ops. This is good news for Windows users, because llamafile releases bundle tinyBLAS DLLs for driver-only GPU support. That support will now be faster, and more accurate than before, thereby reducing the need to install the CUDA / ROCm SDKs yourself.
- Prompt evaluation now goes much faster on CPU. For example, f16 weights on Raspberry Pi 5 are now 8x faster. These new optimizations mostly apply to
F16
,BF16
,Q8_0
,Q4_0
,Q4_0
, andF32
weights. Depending on the hardware and weights being used, we've observed llamafile-0.7 going anywhere between 30% to 500% faster than llama.cpp upstream. - Support for the bf16 data type has been introduced for CPU only, which is the Google Brain floating point format.
- Support for AVX512 has been introduced. Owners of CPUs like Zen4 can expect to see 10x faster prompt eval times.
- If you want to run
llamafile-0.7 [...] --recompile --gpu amd
support on Windows, this release requires that you use version 5.7+ of the ROCm HIP SDK, which may be downloaded here. - This release includes a security fix for CVE-2024-23496 (see #294).
- This release is synced with llama.cpp 2024-03-22 upstream.
llamafile v0.6.2
llamafile lets you distribute and run LLMs with a single file
This release synchronizes with llama.cpp upstream and polishes GPU
auto-configuration. Support for splitting a model onto multiple NVIDIA
GPUs has been restored.
- dfd3335 Synchronize with llama.cpp 2024-01-27
- c008e43 Synchronize with llama.cpp 2024-01-26
- e34b35c Make GPU auto configuration more resilient
- 79b88f8 Sanitize -ngl flag on Apple Metal
There's a known issue with support for splitting onto multiple AMD GPUs,
which currently doesn't work. This is an upstream issue we're working to
solve. The workaround is to set export HIP_VISIBLE_DEVICES=0
in your
environment when running llamafile, so it'll only see the first GPU.
Example llamafiles
Our llamafiles on Hugging Face are updated shortly after a release goes live.
Flagship models
Supreme models (highest-end consumer hardware)
- https://hf.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile
- https://hf.co/jartine/WizardCoder-Python-34B-V1.0-llamafile
Tiny models (small enough to use on raspberry pi)
- https://hf.co/jartine/phi-2-llamafile
- https://hf.co/jartine/rocket-3B-llamafile
- https://hf.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF
Other models:
- https://hf.co/jartine/jartine/wizardcoder-13b-python
- https://hf.co/jartine/jartine/Nous-Hermes-Llama2-llamafile
- https://hf.co/jartine/jartine/dolphin-2.5-mixtral-8x7b-llamafile
If you have a slow Internet connection and want to update your llamafiles without needing to redownload, then see the instructions here: #24 (comment) You can also download llamafile-0.6.2
and simply say ./llamafile-0.6.2 -m old.llamafile
to run your old weights.
llamafile v0.6.1
llamafile lets you distribute and run LLMs with a single file
This release fixes a crash that can happen on Apple Metal GPUs.
- 9c85d9c Fix free() related crash in ggml-metal.m
Windows users will see better performance with tinyBLAS. Please note we
still recommend installing the CUDA SDK (NVIDIA), or HIP/ROCm SDK (AMD)
for maximum performance and accuracy if you're in their support vector.
- df0b3ff Use thread-local register file for matmul speedups (#205)
- 4892494 Change BM/BN/BK to template parameters (#203)
- ed05ba9 Reduce server memory use on Windows
This release also synchronizes with llama.cpp upstream (as of Jan 9th)
along with other improvements.
- 133b05e Sync with llama.cpp upstream
- 67d97b5 Use hipcc on $PATH if it exists
- 15e2339 Do better job reporting AMD hipBLAS errors
- c617679 Don't crash when --image argument is invalid
- 3e8aa78 Clarify install/gpu docs/behavior per feedback
- eb4989a Fix typo in OpenAI API
Example llamafiles
Our llamafiles on Hugging Face are updated shortly after a release goes live.
Flagship models
Supreme models (highest-end consumer hardware)
- https://hf.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile
- https://hf.co/jartine/WizardCoder-Python-34B-V1.0-llamafile
Tiny models (small enough to use on raspberry pi)
- https://hf.co/jartine/phi-2-llamafile
- https://hf.co/jartine/rocket-3B-llamafile
- https://hf.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF
Other models:
- https://hf.co/jartine/jartine/wizardcoder-13b-python
- https://hf.co/jartine/jartine/Nous-Hermes-Llama2-llamafile
- https://hf.co/jartine/jartine/dolphin-2.5-mixtral-8x7b-llamafile
If you have a slow Internet connection and want to update your llamafiles without needing to redownload, then see the instructions here: #24 (comment) You can also download llamafile-0.6.1
and simply say ./llamafile-0.6.1 -m old.llamafile
to run your old weights.
llamafile v0.6
llamafile lets you distribute and run LLMs with a single file
This release features significant improvements to GPU support.
- 4616816 Introduce support for multiple GPUs
- 6559da6 Introduce AMD GPU support for Linux
- 20d5f46 Make CLIP GPU acceleration work on UNIX / Windows
The llamafile server is now more reliable. Invalid JSON won't crash the
server. Opening a browser tab won't prevent the server from starting.
- 3384234 Upgrade to cosmocc 3.2.4
- 585c2d8 Make browser tab launching more reliable
- 7a5ec37 Show IP addresses when binding to 0.0.0.0
- d39ec38 Enable setting thread affinity on NUMA systems
You can now say llamafile -m foo.llamafile
to load a model from a
llamafile without having to execute it, or extract the gguf file.
- bb136e1 Support opening weights from llamafiles
The documentation has been improved (but still a work in progress).
- 7ad00db Add more content to manual
Example llamafiles
Our llamafiles on Hugging Face are updated shortly after a release goes live.
Flagship models
Supreme models (highest-end consumer hardware)
- https://hf.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile
- https://hf.co/jartine/WizardCoder-Python-34B-V1.0-llamafile
Tiny models (small enough to use on raspberry pi)
- https://hf.co/jartine/phi-2-llamafile
- https://hf.co/jartine/rocket-3B-llamafile
- https://hf.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF
Other models:
- https://hf.co/jartine/jartine/wizardcoder-13b-python
- https://hf.co/jartine/jartine/Nous-Hermes-Llama2-llamafile
- https://hf.co/jartine/jartine/dolphin-2.5-mixtral-8x7b-llamafile
If you have a slow Internet connection and want to update your llamafiles without needing to redownload, then see the instructions here: #24 (comment) You can also download llamafile-0.6
and simply say ./llamafile-0.6 -m old.llamafile
to run your old weights.
llamafile v0.5
llamafile lets you distribute and run LLMs with a single file
The llamafile-server
command is now unified into llamafile
. This way
you won't need to upload your llamafiles to Hugging Face twice. We also
have rich man
page documentation for this command, which can be viewed
with pagination on all platforms via the llamafile --help
flag.
- b86dcb7 Unify llamafile-server command into llamafile
- 156f0a6 Embed man page into --help flag of each program
This release introduces support for AMD graphics cards on Windows. Our
release binaries include a prebuilt tinyBLAS DLL. Like our Nvidia DLL,
it works on stock installs and only depends on the graphics driver. GPU
on Windows is also much faster out of the box, thanks to improvements
we've made to our tinyBLAS kernels.
- 1f1c53f Get AMD GPU support working on Windows
- 1d9fa85 Add 2D blocking to tinyBLAS GemmEx (#153)
- c0589f0 Apply 2D blocking to all kernels (#156)
- c2bc6e6 Separate kernel for GemmStridedBatchedEx (#163)
- f6ee33c Read and write column-major matrices better (#164)
- d7cbaf7 Reduce BM/BN/BK to 64/32/64 to 48/12/48
- 04d6e93 Introduce --gpu flag
Apple Metal users should expect to see LLaVA image summarization go
roughly 33% faster. Complete support for Microsoft's new Phi-2 model is
now available, which works great on Raspberry Pi. FreeBSD ARM64 users
can now also enjoy this project. Shell scriptability is improved. We've
also introduced a llamafile-convert
command that makes it easier for
you to create your own llamafiles.
- 922c4f1 Add GPU acceleration of LLaVA image processing on MacOS
- 6423228 Add Phi-2 architecture support
- ce4aac6 Support FreeBSD ARM64
- 1dcf274 Add llamafile-convert command (#112)
- 50bdf69 7d23bc9 Make --log-disable work better
- 7843183 Make default thread count capped at 12 maximum
- 2e276a1 Sync with llama.cpp upstream
- dd4c9d7 Make JSON server crashes more informative
- 8762f13 474b44f Introduce --nocompile flag
- 5cf6e76 Introduce --cli flag
- f0e86e1 Don't schlep weights into CPU when using GPU
- f1410a1 Fix repeat_last_n in OpenAI server
- 3119f09 Increase server max payload size
Known Issues
- Multiple GPUs isn't supported yet.
- CLIP only supports GPU acceleration on Apple Silicon.
Example llamafiles
Our llamafiles on Hugging Face are updated shortly after a release goes live.
Flagship models
Supreme models (highest-end consumer hardware)
- https://hf.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile
- https://hf.co/jartine/WizardCoder-Python-34B-V1.0-llamafile
Tiny models (small enough to use on raspberry pi)
- https://hf.co/jartine/phi-2-llamafile
- https://hf.co/jartine/rocket-3B-llamafile
- https://hf.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF
Other models:
- https://hf.co/jartine/jartine/wizardcoder-13b-python
- https://hf.co/jartine/jartine/Nous-Hermes-Llama2-llamafile
- https://hf.co/jartine/jartine/dolphin-2.5-mixtral-8x7b-llamafile
If you have a slow Internet connection and want to update your llamafiles
without needing to redownload, then see the instructions here: #24 (comment)
llamafile v0.4.1
llamafile lets you distribute and run LLMs with a single file
If you had trouble generating filenames following the "bash one-liners"
blog post using the latest release, then please try again.
- 0984ed8 Fix regression with --grammar flag
Crashes on older Intel / AMD systems should be fixed:
- 3490afa Fix SIGILL on older Intel/AMD CPUs w/o F16C
The OpenAI API compatible endpoint has been improved.
- 9e4bf29 Fix OpenAI server sampling w.r.t. temp and seed
This release improves the documentation.
- 5c7ff6e Improve llamafile manual
- 658b18a Add WSL CUDA to GPU section (#105)
- 586b408 Update README.md so links and curl commands work (#136)
- a56ffd4 Update README to clarify Darwin kernel versioning
- 47d8a8f Fix README changing SSE3 to SSSE3
- 4da8e2e Fix README examples for certain UNIX shells
- faa7430 Change README to list Mixtral Q5 (instead of Q3)
- 6b0b64f Fix CLI README examples
We're making strides to automating our testing process.
Some other improvements:
- 9e972b2 Improve README examples
- 9de5686 Support bos token in llava-cli
- 3d81e22 Set logger callback for Apple Metal
- 9579b73 Make it easier to override CPPFLAGS
Our .llamafiles on Hugging Face have been updated to incorporate these
new release binaries. You can redownload here:
- https://huggingface.co/jartine/llava-v1.5-7B-GGUF/tree/main
- https://huggingface.co/jartine/Mistral-7B-Instruct-v0.2-llamafile/tree/main
- https://huggingface.co/jartine/wizardcoder-13b-python/tree/main
- https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile
Known Issues
LLaVA image processing using the builtin tinyBLAS library may go slow on Windows.
Here's the workaround for using the faster NVIDIA cuBLAS library instead.
- Delete the
.llamafile
directory in your home directory. - Install CUDA
- Install MSVC
- Open the "x64 MSVC command prompt" from Start
- Run llamafile there for the first invocation.
There's a YouTube video tutorial on doing this here: https://youtu.be/d1Fnfvat6nM?si=W6Y0miZ9zVBHySFj
llamafile v0.4
llamafile lets you distribute and run LLMs with a single file
This release features Mixtral support. Support has been added for Qwen
models too. The --chatml
, --samplers
, and other flags are added.
- 820d42d Synchronize with llama.cpp upstream
GPU now works out of the box on Windows. You still need to pass the
-ngl 35
flag, but you're no longer required to install CUDA/MSVC.
- a7de00b Make tinyBLAS go 95% as fast as cuBLAS for token generation (#97)
- 9d85a72 Improve GEMM performance by nearly 2x (#93)
- 72e1c72 Support CUDA without cuBLAS (#82)
- 2849b08 Make it possible for CUDA to extract prebuilt DSOs
Additional fixes and improvements:
- c236a71 Improve markdown and syntax highlighting in server (#88)
- 69ec1e4 Update the llamafile manual
- 782c81c Add SD ops, kernels
- 93178c9 Polyfill $HOME on some Windows systems
- fcc727a Write log to /dev/null when main.log fails to open
- 77cecbe Fix handling of characters that span multiple tokens when streaming
Our .llamafiles on Hugging Face have been updated to incorporate these
new release binaries. You can redownload here:
llamafile v0.3
llamafile lets you distribute and run LLMs with a single file
The llamafile-main
and llamafile-llava-cli
programs have been
unified into a single command named llamafile
. Man pages now exist in
pdf, troff, and postscript format. There's much better support for shell
scripting, thanks to a new --silent-prompt
flag. It's now possible to
shell script vision models like LLaVA using grammar constraints.
- d4e2388 Add --version flag
- baf216a Make ctrl-c work better
- 762ad79 Add
make install
build rule - 7a3e557 Write man pages for all commands
- c895a44 Remove stdout logging in llava-cli
- 6cb036c Make LLaVA more shell script friendly
- 28d3160 Introduce --silent-prompt flag to main
- 1cd334f Allow --grammar to be used on --image prompts
The OpenAI API in llamafile-server
has been improved.
- e8c92bc Make OpenAI API
stop
field optional (#36) - c1c8683 Avoid bind() conflicts on port 8080 w/ server
- 8cb9fd8 Recognize cache_prompt parameter in OpenAI API
Performance regressions have been fixed for Intel and AMD users.
- 73ee0b1 Add runtime dispatching for Q5 weights
- 36b103e Make Q2/Q3 weights go 2x faster on AMD64 AVX2 CPUs
- b4dea04 Slightly speed up LLaVA runtime dispatch on Intel
The zipalign
command is now feature complete.
- 76d47c0 Put finishing touches on zipalign tool
- 7b2fbcb Add support for replacing zip files to zipalign
Some additional improvements:
- 5f69bb9 Add SVG logo
- cd0fae0 Make memory map loader go much faster on MacOS
- c8cd8e1 Fix output path in llamafile-quantize
- dd1e0cd Support attention_bias on LLaMA architecture
- 55467d9 Fix integer overflow during quantization
- ff1b437 Have makefile download cosmocc automatically
- a7cc180 Update grammar-parser.cpp (#48)
- 61944b5 Disable pledge on systems with GPUs
- ccc377e Log cuda build command to stderr
Our .llamafiles on Hugging Face have been updated to incorporate these new release binaries. You can redownload here:
- https://huggingface.co/jartine/llava-v1.5-7B-GGUF/tree/main
- https://huggingface.co/jartine/mistral-7b.llamafile/tree/main
- https://huggingface.co/jartine/wizardcoder-13b-python/tree/main
If you have a slower Internet connection and don't want to re-download, then you don't have to! Instructions are here: