Releases: PygmalionAI/aphrodite-engine
Releases Β· PygmalionAI/aphrodite-engine
v0.6.3.post1
What's Changed
- build(deps): bump rollup from 4.21.0 to 4.24.3 in /docs by @dependabot in #796
- fix: compilation of gptq_marlin_gemm object by @AlpinDale in #800
- ci: bump to 0.6.3.post1 by @AlpinDale in #801
New Contributors
- @dependabot made their first contribution in #796
Full Changelog: v0.6.3...v0.6.3.post1
v0.6.3
What's Changed
- Stream models rather than load them completely into RAM. by @50h100a in #785
- feat: windows support by @AlpinDale in #790
- fix: windows wheel url by @AlpinDale in #794
- fix: kobold lite embedded UI on windows by @AlpinDale in #797
- feat: add HQQ quantization support by @AlpinDale in #795
- frontend: minor logging improvements by @AlpinDale in #787
- ci: bump version to 0.6.3 by @AlpinDale in #799
Full Changelog: v0.6.2.post1...v0.6.3
v0.6.2.post1
What's Changed
- fix: kobold api for horde by @AlpinDale in #763
- Fix for a crash from token bans by @Pyroserenus in #764
- Modified throughput benchmark to allow --max-num-seqs by @Pyroserenus in #770
- Simplify construction of sampling_metadata by @50h100a in #766
- Add OLMoE by @fizzAI in #772
- feat: ministral support by @AlpinDale in #776
- Make amd usable by @Naomiusearch in #775
- docker: apply AMD patch in the dockerfile by @AlpinDale in #777
- fix: demote skip_special_tokens assertion to logger error by @AlpinDale in #778
- ci: bump version to 0.6.2.post1 by @AlpinDale in #779
New Contributors
Full Changelog: v0.6.2...v0.6.2.post1
v0.6.2
What's Changed
- feat: FP8 quantization support for AMD ROCm by @AlpinDale in #729
- feat: add experts_int8 support by @AlpinDale in #730
- chore: move update_flash_attn_metadata to attn backend by @AlpinDale in #731
- chore: register lora functions as torch ops by @AlpinDale in #732
- feat: dynamo support for ScalarType by @AlpinDale in #733
- fix: types in AQLM and GGUF for dynamo support by @AlpinDale in #736
- fix:
custom_ar
check by @AlpinDale in #737 - fix: clear engine ref in RPC server by @AlpinDale in #738
- fix: use nvml to get consistent device names by @AlpinDale in #739
- feat: add Exaone model support by @shing100 in #743
- fix: minor bug fixes & clean-ups by @AlpinDale in #744
- chore: refactor
MultiModalConfig
initialization and profiling by @AlpinDale in #745 - chore: various TPU fixes and optimizations by @AlpinDale in #746
- fix: metrics endpoint with RPC server by @AlpinDale in #747
- chore: refactor llama3 rope by @AlpinDale in #748
- feat: add XTC Sampling by @AlpinDale in #740
- ci: fix dep install using pnpm by @ahme-dev in #749
- ci: fix docs deployment by @ahme-dev in #750
- chore: re-enable custom token bans by @AlpinDale in #751
- feat: bring back dynatemp by @AlpinDale in #754
- feat: quant_llm support by @AlpinDale in #755
- fix: add pandas to requirements by @AlpinDale in #756
- docs: update readme and quant docs by @AlpinDale in #757
- ci: bump version to 0.6.2 by @AlpinDale in #758
New Contributors
Full Changelog: v0.6.1.post1...v0.6.2
v0.6.1.post1
What's Changed
- chore: register custom torch ops for flash-attn and flashinfer by @AlpinDale in #724
- feat: launch API server with uvloop by @AlpinDale in #725
- chore: fix return statement in
Detokenizer.decode_sequence_inplace
by @AlpinDale in #727 - Fix tensor parallelism, libcudart path for some versions of pytorch by @miku448 in #726
- ci: bump to 0.6.1.post1 by @AlpinDale in #728
Full Changelog: v0.6.1...v0.6.1.post1
v0.6.1
Aphrodite Engine - v0.6.1
What's Changed
- ci: exclude cu118 from build and add py_limited_api by @AlpinDale in #639
- fix: better async request cancellation by @AlpinDale in #641
- fix: gracefully handle missing chat template by @AlpinDale in #642
- chore: deduplicate nvlink check to cuda platform by @AlpinDale in #643
- fix: hardcoded float16 in embedding mode check by @AlpinDale in #645
- quadratic sampling: separate diff from logits to filter out NaNs. by @50h100a in #644
- fix: RSLoRA support by @AlpinDale in #647
- feat: introduce
BaseAphroditeParameter
by @AlpinDale in #646 - fix: move zeromq rpc frontend to IPC instead of TCP by @AlpinDale in #652
- fix: input processor in internvl2 by @AlpinDale in #653
- fix: multiprocessing timeout by @AlpinDale in #654
- fix: GPTQ/AWQ on Colab by @AlpinDale in #655
- fix: make
merge_async_iterators.is_cancelled()
optional by @AlpinDale in #656 - fix: flashinfer outputs by @AlpinDale in #657
- fix: max_num_batched_tokens should not be limited for lora by @AlpinDale in #658
- fix: lora with pipeline parallel by @AlpinDale in #659
- fix: kill api server when pinging dead engine by @AlpinDale in #660
- fix:
get_num_blocks_touched
logic by @AlpinDale in #661 - chore: update the env.py script and the bug report template by @AlpinDale in #662
- feat: add INT8 W8A16 quant for TPU by @AlpinDale in #663
- feat: allow serving encoder-decoder models in the API server by @AlpinDale in #664
- fix: deps with TPU dockerfile by @AlpinDale in #665
- optimization: reduce end-to-end overhead from python obj allocation by @AlpinDale in #666
- fix: minor adjustments to scheduler and block manager by @AlpinDale in #667
- feat: enable using fp8 kv and prefix caching with chunked prefill by @AlpinDale in #668
- fix: mlpspeculator with padded vocab by @AlpinDale in #669
- feat: option to apply temperature scaling last by @AlpinDale in #670
- chore: decouple
should_modify_greedy_probs_inplace
by @AlpinDale in #671 - chore: better stream termination in async engine by @AlpinDale in #672
- chore: mamba cache single buffer by @AlpinDale in #673
- feat: mamba model support by @AlpinDale in #674
- fix: reinit procedure in
ModelInputForGPUBuilder
by @AlpinDale in #675 - feat: embeddings support for batched OAI endpoint by @AlpinDale in #676
- fix: fp8 checkpoints with fused linear modules by @AlpinDale in #677
- feat: add numpy implementation of
compute_slot_mapping
by @AlpinDale in #678 - fix: chunked prefill with v2 block manager by @AlpinDale in #679
- fix: phi3v batch inference with different aspect ratio images by @AlpinDale in #680
- chore: use mark_dynamic to reduce TPU compile times by @AlpinDale in #681
- chore: bump lmfe to v0.10.6 and include triton for tpu and xpu dockefiles by @AlpinDale in #682
- refactor: base worker input refactor for multi-step by @AlpinDale in #683
- build: add empty device by @AlpinDale in #684
- chore: update flashinfer to v0.1.3 by @AlpinDale in #685
- feat: allow image embeddings for VLM input by @AlpinDale in #686
- feat: add progress bar for loading individual weight modules by @AlpinDale in #640
- chore: use public ECR for neuron image by @AlpinDale in #687
- fix: logit softcapping in flash-attn by @AlpinDale in #688
- chore: use scalar type to dispatch to different
gptq_marlin
kernels by @AlpinDale in #689 - fix: allow passing float for GiB arguments by @AlpinDale in #690
- build: bump cmake to 3.26 by @AlpinDale in #691
- fix: shut down ray dag workers cleanly by @AlpinDale in #692
- feat: add lora loading/unloading api endpoint by @AlpinDale in #693
- feat: add load/unload endpoints for soft-prompts by @AlpinDale in #694
- fix: loading chameleon model with TP>1 by @AlpinDale in #695
- fix: consolidated
is_tpu()
and suppress tpu import warning by @AlpinDale in #696 - fix: manually install triton for other devices to prevent outlines errors by @AlpinDale in #697
- feat: support for Audio modality by @AlpinDale in #698
- chore: migrate gptq_marlin to AphroditeParameters by @AlpinDale in #699
- chore: update fused MoE weight loading by @AlpinDale in #700
- feat: add Solar model support by @AlpinDale in #701
- feat: migrate awq and awq_marlin to AphroditeParameter by @AlpinDale in #702
- chore: spawn engine process from api server process by @AlpinDale in #703
- chore: use the
compressed-tensors
library to avoid code reuse by @AlpinDale in #704 - feat: add aphrodite plugin system by @AlpinDale in #705
- Revert "chore: use the
compressed-tensors
library to avoid code reuse (#704)" by @AlpinDale in #706 - feat: add support for multi-host TPU by @AlpinDale in #707
- fix: import ray under a guard by @AlpinDale in #708
- fix: empty sampler output when temperature is too low by @AlpinDale in #709
- fix: disable embeddings API for chat models by @AlpinDale in #710
- feat: implement mistral tokenizer mode by @AlpinDale in #711
- feat: support profiling with multiple multi-modal inputs per prompt by @AlpinDale in #712
- chore: multi-step args and sequence modifications by @AlpinDale in #713
- chore: set per-rank XLA cache for TPU by @AlpinDale in #714
- chore: add support for up to 2048 block size by @AlpinDale in #715
- fix: install protobuf for cpu by @AlpinDale in #716
- fix: weight loading for scalars by @AlpinDale in #718
- chore: quant config for speculative draft models by @AlpinDale in #719
- feat: enable prompt logprobs in OpenAI API by @AlpinDale in #720
- chore: update grafana template by @AlpinDale in #721
- ci: bump aphrodite to 0.6.1 by @AlpinDale in #722
Full Changelog: v0.6.0.post1...v0.6.1
v0.6.0.post1
What's Changed
- feat: add siglip encoder for llava family by @AlpinDale in #626
- readme: fix model name typo by @Trapper4888 in #627
- feat: multi-image input for minicpmv by @AlpinDale in #628
- feat: Add support for GPU device selection in SpecDecodeBaseSampler by @AlpinDale in #629
- feat: per-tensor token epilogue kernels by @AlpinDale in #630
- chore: optimize evictor v2 performance by @AlpinDale in #631
- feat: initial encoder-decoder support with BART model by @AlpinDale in #633
- fix: default api port and attention selector by @AlpinDale in #634
- fix: clean up incorrect log in worker by @AlpinDale in #636
- bump to v0.6.0.post1 by @AlpinDale in #635
New Contributors
- @Trapper4888 made their first contribution in #627
Full Changelog: v0.6.0...v0.6.0.post1
v0.6.0
v0.6.0 - "Kept you waiting, huh?" Edition
What's Changed
- Fix quants installation on ROCM by @Naomiusearch in #469
- chore: add contribution guidelines + Code of Conduct by @AlpinDale in #507
- Remove
$
from the shell code blocks in README by @matthusby in #538 - [0.6.0] Release Candidate by @AlpinDale in #481
New Contributors
- @matthusby made their first contribution in #538
Full Changelog: v0.5.3...v0.6.0
v0.5.3
What's Changed
A new release, one that took too long again. We have some cool new features, however.
- ExllamaV2 tensor parallel: You can now run ExllamaV2 quantized models on multiple GPUs. This should be the fastest multi-gpu experience with exllamav2 models.
- Support for Command-R+
- Support for DBRX
- Support for Llama-3
- Support for Qwen 2 MoE
min_tokens
sampling param: You can now set a minimum amount of tokens to generate.- Fused MoE for AWQ and GPTQ quants: AWQ and GPTQ kernels have been updated with optimized fused MoE code. They should be significantly faster now.
- CMake build system: Slightly faster, much cleaner builds.
- CPU support: You can now run aphrodite on CPU only systems! Needs an AVX512-compatible CPU for now.
- Speculative Decoding: Speculative Decoding is finally here! You can either use a draft model, or use prompt lookup decoding with an ngram model (built-in).
- Chunked Prefill: Before this, Aphrodite would process prompts in chunks equal to the model's context length. Now, you can enable this option (via
--enable-chunked-prefill
) to process in chunks of 768 by default, massively increasing the amount of context you can fit. Does not currently work with context shift or FP8 KV cache. - Context Shift reworked: Context shift finally works now. Enable it with
--context-shift
and Aphrodite will cache processed prompts and re-use them. - FP8 E4M3 KV Cache: This is for ROCm only. Support will be extended to NVIDIA soon. E4M3 has higher quality compared to E5M2, but doesn't lead to any throughput increase.
- Auto-truncation in API: The API server can now optionally left-truncate your prompts. Simply pass
truncate_prompt_tokens=1024
to truncate any prompt larger than 1024 tokens. - Support for Llava vision models: Currently 1.5 is supported. With the next release, we should have 1.6 along with a proper GPT4-V compatible API.
- LM Format Enforcer: You can now use LMFE for guided generations.
- EETQ Quantization: EETQ support has been added - a SOTA 8bit quantization method.
- Arbitrary GGUF model support: We were limited to only Llama models for GGUF, now any GGUF is supported. You will need to convert the model beforehand for them, however.
- Aphrodite CLI app: You no longer have to type
python -m aphrodite...
. Simply typeaphrodite run meta-llama/Meta-Llama-3-8B
to get started. Pass extra flags as normal. - Sharded GGUF support: You can now load sharded GGUF models. Pre-conversion needed.
- NVIDIA P100/GP100 support: Support has been restored.
Thanks to all the new contributors!
Full Changelog: v0.5.2...v0.5.3
v0.5.2
What's Changed
A few fixes and new additions:
- Support for CohereAI's command-r model: Currently, GGUF is unsupported. You can load the base model with
--load-in-4bit
or--load-in-smooth
if you have an RTX 20xx series (or sm_75). - Fix an issue where some GPU blocks were missing. This should give a significant boost to how much context you can use.
- Fix logprobs when -inf with some models.
Full Changelog: v0.5.1...v0.5.2