Skip to content

Releases: vectorch-ai/ScaleLLM

v0.2.5

27 May 07:58
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.2.4...v0.2.5

v0.2.4

02 Mar 02:34
Compare
Choose a tag to compare

What's Changed

  • ci: add option to skip nvbench build by @guocuimi in #390
  • ci: build devel image with cuda 12.8 for blackwell by @guocuimi in #391
  • kernel: added query packing support for attention by @guocuimi in #392
  • refactor: rename attention to mha to differentiate it from mla by @guocuimi in #393
  • kernel: added triton aot compiler by @guocuimi in #394
  • kernel: generate smaller kernel instantiations by @guocuimi in #395
  • kernel: fix register spilling issue for attention head_dim=256 by @guocuimi in #397
  • upgrade libtorch to 2.6.0 and cutlass to 3.8.0 by @guocuimi in #398
  • kernel: added simple MLA kernel by @guocuimi in #396
  • kernel: added pipeline support for mla by @guocuimi in #399
  • kernel: added ping-pong rmem support for MLA by @guocuimi in #400
  • kernel: revert experimental TiledMMA separation change. by @guocuimi in #401
  • kernel: put query alwasy in registers for mha by @guocuimi in #402
  • kernel: use 8 warps to avoid register spilling for mla with hdim=512 by @guocuimi in #403
  • kernel: revert mla ping-pong rmem change by @guocuimi in #404
  • kernel: refactor mask logic to avoid using hard-coded stride. by @guocuimi in #405
  • kernel: added causal mask for MLA kernel by @guocuimi in #406
  • kernel: added blk_n=16 for MLA to support sm_86/sm_89 with only 100kb smem by @guocuimi in #407
  • kernel: fix mask bugs for MLA by @guocuimi in #408
  • kernel: use differnt TiledMma for GEMM qk and pv by @guocuimi in #409
  • kernel: added stage support for MLA kernel by @guocuimi in #410
  • misc: upgrade cuda version and add devcontainer for manylinux by @guocuimi in #412
  • kernel: added q and kv oob handling for MLA kernel by @guocuimi in #413
  • kernel: optimize mask loop for MLA kernel by @guocuimi in #414
  • kernel: added paged kv support for MLA kernel by @guocuimi in #415
  • kernel: fix kv oob issue and added more unittests for paged MLA by @guocuimi in #416
  • kernel: use FastDivmod in attention kernels by @guocuimi in #417

Full Changelog: v0.2.3...v0.2.4

v0.2.3

26 Jan 22:13
Compare
Choose a tag to compare

What's Changed

  • misc: remove legacy logic to support quantization for other types. by @guocuimi in #350
  • upgrade pytorch to 2.5.1 by @guocuimi in #351
  • added cuda 12.6 build image by @guocuimi in #353
  • fix cmake version issue for manylinux image by @guocuimi in #354
  • kernel: added attention kernel for sm80 (Happy new year!) by @guocuimi in #355
  • ci: fix package test workflow by @guocuimi in #357
  • kernel: refactor attention kernel for readibility by @guocuimi in #358
  • dev: config dev container with proper extensions by @guocuimi in #359
  • kernel: added attention bench for profiling before optimization by @guocuimi in #360
  • kernel: added logits soft cap support for attention by @guocuimi in #362
  • tools: added attention traits viewer by @guocuimi in #363
  • kernel: added swizzle for shared memory to avoid bank conflict by @guocuimi in #364
  • kernel: added causal, alibi, sliding window mask for attention by @guocuimi in #365
  • kernel: refactor attention kernel and add more unittests by @guocuimi in #366
  • kernel: added M/N OOB handling for attention by @guocuimi in #367
  • tools: update svg build to generate small file by @guocuimi in #368
  • kernel: Added attention params and tile for different input types. by @guocuimi in #369
  • kernel: added mqa and gqa support for attention by @guocuimi in #370
  • kernel: added var len and paged kv cache support for attention by @guocuimi in #371
  • kernel: added varlen and pagedkv unittests for attention by @guocuimi in #372
  • kernel: added attention kernel launch by @guocuimi in #373
  • kernel: added build script to generate kernel instantiations for attention by @guocuimi in #374
  • kernel: change attention input shape from [head, seq, dim] to [seq, head, dim] by @guocuimi in #375
  • kernel: added head_dim=96 support for attention by @guocuimi in #376
  • kernel: optimize attention kernel performance by @guocuimi in #377
  • upgrade cutlass to 3.7.0 by @guocuimi in #379
  • kernel: handle kv block range for attention kernel by @guocuimi in #382
  • kernel: use cp_async_zfill instead of cute::clear for oob handling by @guocuimi in #383
  • kernel: seperate oob iterations for better performance. by @guocuimi in #384
  • refactor: remove batch_prefill interface by @guocuimi in #385
  • refactor: stop build flash_infer kernel by @guocuimi in #386
  • feat: integrate in-house scale attention and use it by default by @guocuimi in #380
  • kernel: only zfill k once to improve perf for attention by @guocuimi in #387
  • refactor: skip flash_attn build by @guocuimi in #388
  • refactor: clean up kv cache set/get apis and improve slot id calculation perf by @guocuimi in #389

Full Changelog: v0.2.2...v0.2.3

v0.2.2

26 Oct 03:12
Compare
Choose a tag to compare

What's Changed

  • kernel: added flash infer attention impl by @guocuimi in #327
  • refactor: flatten block tables to 1d tensor by @guocuimi in #328
  • kernel: added script to generate instantiation for flashinfer kernels by @guocuimi in #329
  • refactor: move flash attn and flash infer into attention folder by @guocuimi in #330
  • kernel: port flash infer handler + wrapper logics by @guocuimi in #331
  • ut: added unittests for flash infer kernels by @guocuimi in #332
  • refactor: replaced last_page_len with kv_indptr for flash infer kernel by @guocuimi in #333
  • feat: added pass-in alibi slopes support for flash infer kernel by @guocuimi in #334
  • refactor: move paged kv related logic into paged_kv_t by @guocuimi in #335
  • ut: added fp8 kv unittests for flash infer kernel by @guocuimi in #336
  • ci: added pip cache to avoid redownloading by @guocuimi in #337
  • upgrade pytorch to 2.4.1 by @guocuimi in #341
  • ci: run package test in docker by @guocuimi in #345
  • ci: build cuda 12.4 for scalellm cpp images by @guocuimi in #346
  • Upgrade pytorch to 2.5.0 by @guocuimi in #347
  • ut: add more tests for different warp layout by @guocuimi in #340
  • misc: attention kernel refactoring by @guocuimi in #339

Full Changelog: v0.2.1...v0.2.2

v0.2.1

04 Sep 23:00
Compare
Choose a tag to compare

What's Changed

  • feat: added awq marlin qlinear by @guocuimi in #315
  • build: speed up compilation for marlin kernels by @guocuimi in #316
  • test: added unittests for marlin kernels by @guocuimi in #317
  • refactor: clean up build warnings and refactor marlin kernels by @guocuimi in #318
  • fix: clean up build warnings: "LOG" redefined by @guocuimi in #319
  • cmake: make includes private and disable jinja2cpp build by @guocuimi in #320
  • ci: allow build without requiring a physical gpu device by @guocuimi in #321
  • fix: put item into asyncio.Queue in a thread-safe way by @guocuimi in #324
  • refactor: added static switch for marlin kernel dispatch by @guocuimi in #325
  • feat: fix and use marlin kernel for awq by default by @guocuimi in #326

Full Changelog: v0.2.0...v0.2.1

v0.2.0

22 Aug 01:49
Compare
Choose a tag to compare

What's Changed

  • kernel: port softcap support for flash attention by @guocuimi in #298
  • test: added unittests for attention sliding window by @guocuimi in #299
  • model: added gemma2 with softcap and sliding window support by @guocuimi in #300
  • kernel: support kernel test in python via pybind by @guocuimi in #301
  • test: added unittests for marlin fp16xint4 gemm by @guocuimi in #302
  • fix: move eos out of stop token list to honor ignore_eos option by @guocuimi in #305
  • refactor: move models to upper folder by @guocuimi in #306
  • kernel: port gptq marlin kernel and fp8 marlin kernel by @guocuimi in #307
  • rust: upgrade rust libs to latest version by @guocuimi in #309
  • refactor: remove the logic loading individual weight from shared partitions by @guocuimi in #311
  • feat: added fused column parallel linear by @guocuimi in #313
  • feat: added gptq marlin qlinear layer by @guocuimi in #312
  • kernel: port awq repack kernel by @guocuimi in #314

Full Changelog: v0.1.9...v0.2.0

v0.1.9

04 Aug 00:38
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.1.8...v0.1.9

v0.1.8

25 Jul 12:02
2e14170
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.1.7...v0.1.8

v0.1.7

24 Jul 06:12
f0f7e07
Compare
Choose a tag to compare

What's Changed

  • build: fix build error with gcc-13 by @guocuimi in #264
  • kernel: upgrade cutlass to 3.5.0 + cuda 12.4 for sm89 fp8 support by @guocuimi in #265
  • cmake: define header only library instead of symbol link for cutlass and flashinfer by @guocuimi in #266
  • feat: added range to support Range-for loops by @guocuimi in #267
  • kernel: added attention cpu implementation for testing by @guocuimi in #268
  • build: added nvbench as submodule by @guocuimi in #269
  • build: upgrade cmake required version from 3.18 to 3.26 by @guocuimi in #270
  • ci: build and test in devel docker image by @guocuimi in #272
  • ci: use manylinux image to build wheel and run pytest by @guocuimi in #271
  • attention: added tile logic using cute::local_tile into cpu attention by @guocuimi in #273
  • kernel: added playground for learning and experimenting cute. by @guocuimi in #274
  • feat: added rope scaling support for llama3.1 by @guocuimi in #277
  • update docs for llama3.1 support and bump up version by @guocuimi in #278

Full Changelog: v0.1.6...v0.1.7

v0.1.6

04 Jul 00:34
7aeb7fa
Compare
Choose a tag to compare

What's Changed

  • alllow deploy docs when triggered on demand by @guocuimi in #253
  • [model] support vision language model llava. by @liutongxuan in #178
  • dev: fix issues in run_in_docker script by @guocuimi in #254
  • dev: added cuda 12.4 build support by @guocuimi in #255
  • build: fix multiple definition issue by @guocuimi in #256
  • fix: check against num_tokens instead of num_prompt_tokens for shared blocks by @guocuimi in #257
  • bugfix: fix invalid max_cache_size when device is cpu. by @liutongxuan in #259
  • ci: fail test if not all tests were passed successfully by @guocuimi in #263
  • Revert "[model] support vision language model llava. (#178)" by @guocuimi in #262

Full Changelog: v0.1.5...v0.1.6