Skip to content

Releases: PygmalionAI/aphrodite-engine

v0.6.3.post1

02 Nov 19:11
f0e00f1
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.6.3...v0.6.3.post1

v0.6.3

02 Nov 13:21
76c05c5
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.6.2.post1...v0.6.3

v0.6.2.post1

16 Oct 16:39
4d3d819
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.6.2...v0.6.2.post1

v0.6.2

22 Sep 01:51
0e0bd02
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.6.1.post1...v0.6.2

v0.6.1.post1

13 Sep 08:09
c744443
Compare
Choose a tag to compare

What's Changed

  • chore: register custom torch ops for flash-attn and flashinfer by @AlpinDale in #724
  • feat: launch API server with uvloop by @AlpinDale in #725
  • chore: fix return statement in Detokenizer.decode_sequence_inplace by @AlpinDale in #727
  • Fix tensor parallelism, libcudart path for some versions of pytorch by @miku448 in #726
  • ci: bump to 0.6.1.post1 by @AlpinDale in #728

Full Changelog: v0.6.1...v0.6.1.post1

v0.6.1

12 Sep 03:48
8e0d376
Compare
Choose a tag to compare

Aphrodite Engine - v0.6.1

What's Changed

Full Changelog: v0.6.0.post1...v0.6.1

v0.6.0.post1

06 Sep 05:08
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.6.0...v0.6.0.post1

v0.6.0

03 Sep 07:11
Compare
Choose a tag to compare

v0.6.0 - "Kept you waiting, huh?" Edition

fixed-actual1

What's Changed

New Contributors

Full Changelog: v0.5.3...v0.6.0

v0.5.3

11 May 22:34
Compare
Choose a tag to compare

What's Changed

A new release, one that took too long again. We have some cool new features, however.

  • ExllamaV2 tensor parallel: You can now run ExllamaV2 quantized models on multiple GPUs. This should be the fastest multi-gpu experience with exllamav2 models.
  • Support for Command-R+
  • Support for DBRX
  • Support for Llama-3
  • Support for Qwen 2 MoE
  • min_tokens sampling param: You can now set a minimum amount of tokens to generate.
  • Fused MoE for AWQ and GPTQ quants: AWQ and GPTQ kernels have been updated with optimized fused MoE code. They should be significantly faster now.
  • CMake build system: Slightly faster, much cleaner builds.
  • CPU support: You can now run aphrodite on CPU only systems! Needs an AVX512-compatible CPU for now.
  • Speculative Decoding: Speculative Decoding is finally here! You can either use a draft model, or use prompt lookup decoding with an ngram model (built-in).
  • Chunked Prefill: Before this, Aphrodite would process prompts in chunks equal to the model's context length. Now, you can enable this option (via --enable-chunked-prefill) to process in chunks of 768 by default, massively increasing the amount of context you can fit. Does not currently work with context shift or FP8 KV cache.
  • Context Shift reworked: Context shift finally works now. Enable it with --context-shift and Aphrodite will cache processed prompts and re-use them.
  • FP8 E4M3 KV Cache: This is for ROCm only. Support will be extended to NVIDIA soon. E4M3 has higher quality compared to E5M2, but doesn't lead to any throughput increase.
  • Auto-truncation in API: The API server can now optionally left-truncate your prompts. Simply pass truncate_prompt_tokens=1024 to truncate any prompt larger than 1024 tokens.
  • Support for Llava vision models: Currently 1.5 is supported. With the next release, we should have 1.6 along with a proper GPT4-V compatible API.
  • LM Format Enforcer: You can now use LMFE for guided generations.
  • EETQ Quantization: EETQ support has been added - a SOTA 8bit quantization method.
  • Arbitrary GGUF model support: We were limited to only Llama models for GGUF, now any GGUF is supported. You will need to convert the model beforehand for them, however.
  • Aphrodite CLI app: You no longer have to type python -m aphrodite.... Simply type aphrodite run meta-llama/Meta-Llama-3-8B to get started. Pass extra flags as normal.
  • Sharded GGUF support: You can now load sharded GGUF models. Pre-conversion needed.
  • NVIDIA P100/GP100 support: Support has been restored.

Thanks to all the new contributors!

Full Changelog: v0.5.2...v0.5.3

v0.5.2

16 Mar 22:50
Compare
Choose a tag to compare

What's Changed

A few fixes and new additions:

  • Support for CohereAI's command-r model: Currently, GGUF is unsupported. You can load the base model with --load-in-4bit or --load-in-smooth if you have an RTX 20xx series (or sm_75).
  • Fix an issue where some GPU blocks were missing. This should give a significant boost to how much context you can use.
  • Fix logprobs when -inf with some models.

Full Changelog: v0.5.1...v0.5.2