Releases: vectorch-ai/ScaleLLM
Releases · vectorch-ai/ScaleLLM
v0.2.5
What's Changed
- ci: fix whell build script by @guocuimi in #418
- kernel: added attention combine kernel to support split kv by @guocuimi in #419
- kernel: refactor and added more unittests for attn combine kernel by @guocuimi in #420
- moe: added token dispatcher interface for MOE layer. by @guocuimi in #421
- moe: added local token dispatcher pytorch implementation for testing by @guocuimi in #422
- nccl: added all2all for nccl process group by @guocuimi in #423
- moe: added all to all token dispatcher pytorch implementation by @guocuimi in #424
- upgrade cutlass to 3.9 by @guocuimi in #425
- kernel: added fused gate for moe by @guocuimi in #426
- chore: added pre-commit-config by @guocuimi in #427
- kernel: added moe permute kernels by @guocuimi in #428
- chore: clean up attn dependencies by @guocuimi in #429
- chore: clean up JinjaChatTemplate by @guocuimi in #430
- test: added different dtype unittests for moe permute kernels by @guocuimi in #431
- refactor: use __ldlu to load/store data and refactor code for moe permute kernels by @guocuimi in #432
- upgrade pytorch to 2.7 by @guocuimi in #434
- chore: build manylinux2_28 builder image by @guocuimi in #435
- fix: fix manylinux2_28 build by @guocuimi in #436
- upgrade vcpkg after switch to manylinux_2_28 by @guocuimi in #437
- chore: add option to install py module into scalellm folder by @guocuimi in #438
- chore: add script to install zsh for devbox by @guocuimi in #439
- ci: enable docker cache by @guocuimi in #441
- kenerl: add kernel for moe permutation with mask map by @guocuimi in #433
- kernel: added align block permutation kernel for moe by @guocuimi in #442
- build: added build for blackwell by @guocuimi in #459
- chore: upgrade cutlass to v4.0 by @guocuimi in #460
- ci: change self-hosted runner tags by @guocuimi in #461
Full Changelog: v0.2.4...v0.2.5
v0.2.4
What's Changed
- ci: add option to skip nvbench build by @guocuimi in #390
- ci: build devel image with cuda 12.8 for blackwell by @guocuimi in #391
- kernel: added query packing support for attention by @guocuimi in #392
- refactor: rename attention to mha to differentiate it from mla by @guocuimi in #393
- kernel: added triton aot compiler by @guocuimi in #394
- kernel: generate smaller kernel instantiations by @guocuimi in #395
- kernel: fix register spilling issue for attention head_dim=256 by @guocuimi in #397
- upgrade libtorch to 2.6.0 and cutlass to 3.8.0 by @guocuimi in #398
- kernel: added simple MLA kernel by @guocuimi in #396
- kernel: added pipeline support for mla by @guocuimi in #399
- kernel: added ping-pong rmem support for MLA by @guocuimi in #400
- kernel: revert experimental TiledMMA separation change. by @guocuimi in #401
- kernel: put query alwasy in registers for mha by @guocuimi in #402
- kernel: use 8 warps to avoid register spilling for mla with hdim=512 by @guocuimi in #403
- kernel: revert mla ping-pong rmem change by @guocuimi in #404
- kernel: refactor mask logic to avoid using hard-coded stride. by @guocuimi in #405
- kernel: added causal mask for MLA kernel by @guocuimi in #406
- kernel: added blk_n=16 for MLA to support sm_86/sm_89 with only 100kb smem by @guocuimi in #407
- kernel: fix mask bugs for MLA by @guocuimi in #408
- kernel: use differnt TiledMma for GEMM qk and pv by @guocuimi in #409
- kernel: added stage support for MLA kernel by @guocuimi in #410
- misc: upgrade cuda version and add devcontainer for manylinux by @guocuimi in #412
- kernel: added q and kv oob handling for MLA kernel by @guocuimi in #413
- kernel: optimize mask loop for MLA kernel by @guocuimi in #414
- kernel: added paged kv support for MLA kernel by @guocuimi in #415
- kernel: fix kv oob issue and added more unittests for paged MLA by @guocuimi in #416
- kernel: use FastDivmod in attention kernels by @guocuimi in #417
Full Changelog: v0.2.3...v0.2.4
v0.2.3
What's Changed
- misc: remove legacy logic to support quantization for other types. by @guocuimi in #350
- upgrade pytorch to 2.5.1 by @guocuimi in #351
- added cuda 12.6 build image by @guocuimi in #353
- fix cmake version issue for manylinux image by @guocuimi in #354
- kernel: added attention kernel for sm80 (Happy new year!) by @guocuimi in #355
- ci: fix package test workflow by @guocuimi in #357
- kernel: refactor attention kernel for readibility by @guocuimi in #358
- dev: config dev container with proper extensions by @guocuimi in #359
- kernel: added attention bench for profiling before optimization by @guocuimi in #360
- kernel: added logits soft cap support for attention by @guocuimi in #362
- tools: added attention traits viewer by @guocuimi in #363
- kernel: added swizzle for shared memory to avoid bank conflict by @guocuimi in #364
- kernel: added causal, alibi, sliding window mask for attention by @guocuimi in #365
- kernel: refactor attention kernel and add more unittests by @guocuimi in #366
- kernel: added M/N OOB handling for attention by @guocuimi in #367
- tools: update svg build to generate small file by @guocuimi in #368
- kernel: Added attention params and tile for different input types. by @guocuimi in #369
- kernel: added mqa and gqa support for attention by @guocuimi in #370
- kernel: added var len and paged kv cache support for attention by @guocuimi in #371
- kernel: added varlen and pagedkv unittests for attention by @guocuimi in #372
- kernel: added attention kernel launch by @guocuimi in #373
- kernel: added build script to generate kernel instantiations for attention by @guocuimi in #374
- kernel: change attention input shape from [head, seq, dim] to [seq, head, dim] by @guocuimi in #375
- kernel: added head_dim=96 support for attention by @guocuimi in #376
- kernel: optimize attention kernel performance by @guocuimi in #377
- upgrade cutlass to 3.7.0 by @guocuimi in #379
- kernel: handle kv block range for attention kernel by @guocuimi in #382
- kernel: use cp_async_zfill instead of cute::clear for oob handling by @guocuimi in #383
- kernel: seperate oob iterations for better performance. by @guocuimi in #384
- refactor: remove batch_prefill interface by @guocuimi in #385
- refactor: stop build flash_infer kernel by @guocuimi in #386
- feat: integrate in-house scale attention and use it by default by @guocuimi in #380
- kernel: only zfill k once to improve perf for attention by @guocuimi in #387
- refactor: skip flash_attn build by @guocuimi in #388
- refactor: clean up kv cache set/get apis and improve slot id calculation perf by @guocuimi in #389
Full Changelog: v0.2.2...v0.2.3
v0.2.2
What's Changed
- kernel: added flash infer attention impl by @guocuimi in #327
- refactor: flatten block tables to 1d tensor by @guocuimi in #328
- kernel: added script to generate instantiation for flashinfer kernels by @guocuimi in #329
- refactor: move flash attn and flash infer into attention folder by @guocuimi in #330
- kernel: port flash infer handler + wrapper logics by @guocuimi in #331
- ut: added unittests for flash infer kernels by @guocuimi in #332
- refactor: replaced last_page_len with kv_indptr for flash infer kernel by @guocuimi in #333
- feat: added pass-in alibi slopes support for flash infer kernel by @guocuimi in #334
- refactor: move paged kv related logic into paged_kv_t by @guocuimi in #335
- ut: added fp8 kv unittests for flash infer kernel by @guocuimi in #336
- ci: added pip cache to avoid redownloading by @guocuimi in #337
- upgrade pytorch to 2.4.1 by @guocuimi in #341
- ci: run package test in docker by @guocuimi in #345
- ci: build cuda 12.4 for scalellm cpp images by @guocuimi in #346
- Upgrade pytorch to 2.5.0 by @guocuimi in #347
- ut: add more tests for different warp layout by @guocuimi in #340
- misc: attention kernel refactoring by @guocuimi in #339
Full Changelog: v0.2.1...v0.2.2
v0.2.1
What's Changed
- feat: added awq marlin qlinear by @guocuimi in #315
- build: speed up compilation for marlin kernels by @guocuimi in #316
- test: added unittests for marlin kernels by @guocuimi in #317
- refactor: clean up build warnings and refactor marlin kernels by @guocuimi in #318
- fix: clean up build warnings: "LOG" redefined by @guocuimi in #319
- cmake: make includes private and disable jinja2cpp build by @guocuimi in #320
- ci: allow build without requiring a physical gpu device by @guocuimi in #321
- fix: put item into asyncio.Queue in a thread-safe way by @guocuimi in #324
- refactor: added static switch for marlin kernel dispatch by @guocuimi in #325
- feat: fix and use marlin kernel for awq by default by @guocuimi in #326
Full Changelog: v0.2.0...v0.2.1
v0.2.0
What's Changed
- kernel: port softcap support for flash attention by @guocuimi in #298
- test: added unittests for attention sliding window by @guocuimi in #299
- model: added gemma2 with softcap and sliding window support by @guocuimi in #300
- kernel: support kernel test in python via pybind by @guocuimi in #301
- test: added unittests for marlin fp16xint4 gemm by @guocuimi in #302
- fix: move eos out of stop token list to honor ignore_eos option by @guocuimi in #305
- refactor: move models to upper folder by @guocuimi in #306
- kernel: port gptq marlin kernel and fp8 marlin kernel by @guocuimi in #307
- rust: upgrade rust libs to latest version by @guocuimi in #309
- refactor: remove the logic loading individual weight from shared partitions by @guocuimi in #311
- feat: added fused column parallel linear by @guocuimi in #313
- feat: added gptq marlin qlinear layer by @guocuimi in #312
- kernel: port awq repack kernel by @guocuimi in #314
Full Changelog: v0.1.9...v0.2.0
v0.1.9
What's Changed
- ci: cancel all previous runs if a new one is triggered by @guocuimi in #283
- pypi: fix invalid classifier by @guocuimi in #284
- refactor: remove exllama kernels by @guocuimi in #285
- kernel: added marlin dense and sparse kernels by @guocuimi in #287
- debug: added environment collection script. by @guocuimi in #288
- kernel: added triton kernel build support by @guocuimi in #289
- feat: added THUDM/glm-4* support by @guocuimi in #292
- fix: handle unfinished utf8 bytes for tiktoken tokenizer by @guocuimi in #293
- triton: fix build error and add example with unittest by @guocuimi in #294
- model: added qwen2 support by @guocuimi in #295
- feat: added sliding window support for QWen2 by @guocuimi in #296
- ci: fix pytest version to avoid flakiness by @guocuimi in #297
Full Changelog: v0.1.8...v0.1.9
v0.1.8
v0.1.7
What's Changed
- build: fix build error with gcc-13 by @guocuimi in #264
- kernel: upgrade cutlass to 3.5.0 + cuda 12.4 for sm89 fp8 support by @guocuimi in #265
- cmake: define header only library instead of symbol link for cutlass and flashinfer by @guocuimi in #266
- feat: added range to support Range-for loops by @guocuimi in #267
- kernel: added attention cpu implementation for testing by @guocuimi in #268
- build: added nvbench as submodule by @guocuimi in #269
- build: upgrade cmake required version from 3.18 to 3.26 by @guocuimi in #270
- ci: build and test in devel docker image by @guocuimi in #272
- ci: use manylinux image to build wheel and run pytest by @guocuimi in #271
- attention: added tile logic using cute::local_tile into cpu attention by @guocuimi in #273
- kernel: added playground for learning and experimenting cute. by @guocuimi in #274
- feat: added rope scaling support for llama3.1 by @guocuimi in #277
- update docs for llama3.1 support and bump up version by @guocuimi in #278
Full Changelog: v0.1.6...v0.1.7
v0.1.6
What's Changed
- alllow deploy docs when triggered on demand by @guocuimi in #253
- [model] support vision language model llava. by @liutongxuan in #178
- dev: fix issues in run_in_docker script by @guocuimi in #254
- dev: added cuda 12.4 build support by @guocuimi in #255
- build: fix multiple definition issue by @guocuimi in #256
- fix: check against num_tokens instead of num_prompt_tokens for shared blocks by @guocuimi in #257
- bugfix: fix invalid max_cache_size when device is cpu. by @liutongxuan in #259
- ci: fail test if not all tests were passed successfully by @guocuimi in #263
- Revert "[model] support vision language model llava. (#178)" by @guocuimi in #262
Full Changelog: v0.1.5...v0.1.6