Releases · vectorch-ai/ScaleLLM

27 May 07:58

github-actions

v0.2.5

e5f18d9

v0.2.5 Latest

Latest

What's Changed

ci: fix whell build script by @guocuimi in #418
kernel: added attention combine kernel to support split kv by @guocuimi in #419
kernel: refactor and added more unittests for attn combine kernel by @guocuimi in #420
moe: added token dispatcher interface for MOE layer. by @guocuimi in #421
moe: added local token dispatcher pytorch implementation for testing by @guocuimi in #422
nccl: added all2all for nccl process group by @guocuimi in #423
moe: added all to all token dispatcher pytorch implementation by @guocuimi in #424
upgrade cutlass to 3.9 by @guocuimi in #425
kernel: added fused gate for moe by @guocuimi in #426
chore: added pre-commit-config by @guocuimi in #427
kernel: added moe permute kernels by @guocuimi in #428
chore: clean up attn dependencies by @guocuimi in #429
chore: clean up JinjaChatTemplate by @guocuimi in #430
test: added different dtype unittests for moe permute kernels by @guocuimi in #431
refactor: use __ldlu to load/store data and refactor code for moe permute kernels by @guocuimi in #432
upgrade pytorch to 2.7 by @guocuimi in #434
chore: build manylinux2_28 builder image by @guocuimi in #435
fix: fix manylinux2_28 build by @guocuimi in #436
upgrade vcpkg after switch to manylinux_2_28 by @guocuimi in #437
chore: add option to install py module into scalellm folder by @guocuimi in #438
chore: add script to install zsh for devbox by @guocuimi in #439
ci: enable docker cache by @guocuimi in #441
kenerl: add kernel for moe permutation with mask map by @guocuimi in #433
kernel: added align block permutation kernel for moe by @guocuimi in #442
build: added build for blackwell by @guocuimi in #459
chore: upgrade cutlass to v4.0 by @guocuimi in #460
ci: change self-hosted runner tags by @guocuimi in #461

Full Changelog: v0.2.4...v0.2.5

Contributors

guocuimi

Assets 14

scalellm-0.2.5+cu118torch2.7.0-cp310-cp310-linux_x86_64.whl

49.2 MB 2025-05-27T07:58:34Z
scalellm-0.2.5+cu118torch2.7.0-cp311-cp311-linux_x86_64.whl

49.2 MB 2025-05-27T07:58:34Z
scalellm-0.2.5+cu118torch2.7.0-cp312-cp312-linux_x86_64.whl

49.2 MB 2025-05-27T07:58:34Z
scalellm-0.2.5+cu118torch2.7.0-cp39-cp39-linux_x86_64.whl

49.2 MB 2025-05-27T07:58:34Z
scalellm-0.2.5+cu126torch2.7.0-cp310-cp310-linux_x86_64.whl

49.7 MB 2025-05-27T07:58:34Z
scalellm-0.2.5+cu126torch2.7.0-cp311-cp311-linux_x86_64.whl

49.7 MB 2025-05-27T07:58:34Z
scalellm-0.2.5+cu126torch2.7.0-cp312-cp312-linux_x86_64.whl

49.7 MB 2025-05-27T07:58:34Z
scalellm-0.2.5+cu126torch2.7.0-cp39-cp39-linux_x86_64.whl

49.7 MB 2025-05-27T07:58:34Z
scalellm-0.2.5+cu128torch2.7.0-cp310-cp310-linux_x86_64.whl

72.7 MB 2025-05-27T07:58:34Z
scalellm-0.2.5+cu128torch2.7.0-cp311-cp311-linux_x86_64.whl

72.7 MB 2025-05-27T07:58:34Z
Source code (zip)

2025-05-27T02:45:33Z
Source code (tar.gz)

2025-05-27T02:45:33Z

02 Mar 02:34

github-actions

v0.2.4

0569c55

v0.2.4

What's Changed

ci: add option to skip nvbench build by @guocuimi in #390
ci: build devel image with cuda 12.8 for blackwell by @guocuimi in #391
kernel: added query packing support for attention by @guocuimi in #392
refactor: rename attention to mha to differentiate it from mla by @guocuimi in #393
kernel: added triton aot compiler by @guocuimi in #394
kernel: generate smaller kernel instantiations by @guocuimi in #395
kernel: fix register spilling issue for attention head_dim=256 by @guocuimi in #397
upgrade libtorch to 2.6.0 and cutlass to 3.8.0 by @guocuimi in #398
kernel: added simple MLA kernel by @guocuimi in #396
kernel: added pipeline support for mla by @guocuimi in #399
kernel: added ping-pong rmem support for MLA by @guocuimi in #400
kernel: revert experimental TiledMMA separation change. by @guocuimi in #401
kernel: put query alwasy in registers for mha by @guocuimi in #402
kernel: use 8 warps to avoid register spilling for mla with hdim=512 by @guocuimi in #403
kernel: revert mla ping-pong rmem change by @guocuimi in #404
kernel: refactor mask logic to avoid using hard-coded stride. by @guocuimi in #405
kernel: added causal mask for MLA kernel by @guocuimi in #406
kernel: added blk_n=16 for MLA to support sm_86/sm_89 with only 100kb smem by @guocuimi in #407
kernel: fix mask bugs for MLA by @guocuimi in #408
kernel: use differnt TiledMma for GEMM qk and pv by @guocuimi in #409
kernel: added stage support for MLA kernel by @guocuimi in #410
misc: upgrade cuda version and add devcontainer for manylinux by @guocuimi in #412
kernel: added q and kv oob handling for MLA kernel by @guocuimi in #413
kernel: optimize mask loop for MLA kernel by @guocuimi in #414
kernel: added paged kv support for MLA kernel by @guocuimi in #415
kernel: fix kv oob issue and added more unittests for paged MLA by @guocuimi in #416
kernel: use FastDivmod in attention kernels by @guocuimi in #417

Full Changelog: v0.2.3...v0.2.4

Contributors

guocuimi

Assets 18

26 Jan 22:13

github-actions

v0.2.3

73af6f6

v0.2.3

What's Changed

misc: remove legacy logic to support quantization for other types. by @guocuimi in #350
upgrade pytorch to 2.5.1 by @guocuimi in #351
added cuda 12.6 build image by @guocuimi in #353
fix cmake version issue for manylinux image by @guocuimi in #354
kernel: added attention kernel for sm80 (Happy new year!) by @guocuimi in #355
ci: fix package test workflow by @guocuimi in #357
kernel: refactor attention kernel for readibility by @guocuimi in #358
dev: config dev container with proper extensions by @guocuimi in #359
kernel: added attention bench for profiling before optimization by @guocuimi in #360
kernel: added logits soft cap support for attention by @guocuimi in #362
tools: added attention traits viewer by @guocuimi in #363
kernel: added swizzle for shared memory to avoid bank conflict by @guocuimi in #364
kernel: added causal, alibi, sliding window mask for attention by @guocuimi in #365
kernel: refactor attention kernel and add more unittests by @guocuimi in #366
kernel: added M/N OOB handling for attention by @guocuimi in #367
tools: update svg build to generate small file by @guocuimi in #368
kernel: Added attention params and tile for different input types. by @guocuimi in #369
kernel: added mqa and gqa support for attention by @guocuimi in #370
kernel: added var len and paged kv cache support for attention by @guocuimi in #371
kernel: added varlen and pagedkv unittests for attention by @guocuimi in #372
kernel: added attention kernel launch by @guocuimi in #373
kernel: added build script to generate kernel instantiations for attention by @guocuimi in #374
kernel: change attention input shape from [head, seq, dim] to [seq, head, dim] by @guocuimi in #375
kernel: added head_dim=96 support for attention by @guocuimi in #376
kernel: optimize attention kernel performance by @guocuimi in #377
upgrade cutlass to 3.7.0 by @guocuimi in #379
kernel: handle kv block range for attention kernel by @guocuimi in #382
kernel: use cp_async_zfill instead of cute::clear for oob handling by @guocuimi in #383
kernel: seperate oob iterations for better performance. by @guocuimi in #384
refactor: remove batch_prefill interface by @guocuimi in #385
refactor: stop build flash_infer kernel by @guocuimi in #386
feat: integrate in-house scale attention and use it by default by @guocuimi in #380
kernel: only zfill k once to improve perf for attention by @guocuimi in #387
refactor: skip flash_attn build by @guocuimi in #388
refactor: clean up kv cache set/get apis and improve slot id calculation perf by @guocuimi in #389

Full Changelog: v0.2.2...v0.2.3

Contributors

guocuimi

Assets 29

26 Oct 03:12

github-actions

v0.2.2

d11c948

v0.2.2

What's Changed

kernel: added flash infer attention impl by @guocuimi in #327
refactor: flatten block tables to 1d tensor by @guocuimi in #328
kernel: added script to generate instantiation for flashinfer kernels by @guocuimi in #329
refactor: move flash attn and flash infer into attention folder by @guocuimi in #330
kernel: port flash infer handler + wrapper logics by @guocuimi in #331
ut: added unittests for flash infer kernels by @guocuimi in #332
refactor: replaced last_page_len with kv_indptr for flash infer kernel by @guocuimi in #333
feat: added pass-in alibi slopes support for flash infer kernel by @guocuimi in #334
refactor: move paged kv related logic into paged_kv_t by @guocuimi in #335
ut: added fp8 kv unittests for flash infer kernel by @guocuimi in #336
ci: added pip cache to avoid redownloading by @guocuimi in #337
upgrade pytorch to 2.4.1 by @guocuimi in #341
ci: run package test in docker by @guocuimi in #345
ci: build cuda 12.4 for scalellm cpp images by @guocuimi in #346
Upgrade pytorch to 2.5.0 by @guocuimi in #347
ut: add more tests for different warp layout by @guocuimi in #340
misc: attention kernel refactoring by @guocuimi in #339

Full Changelog: v0.2.1...v0.2.2

Contributors

guocuimi

Assets 29

04 Sep 23:00

github-actions

v0.2.1

c28c441

v0.2.1

What's Changed

feat: added awq marlin qlinear by @guocuimi in #315
build: speed up compilation for marlin kernels by @guocuimi in #316
test: added unittests for marlin kernels by @guocuimi in #317
refactor: clean up build warnings and refactor marlin kernels by @guocuimi in #318
fix: clean up build warnings: "LOG" redefined by @guocuimi in #319
cmake: make includes private and disable jinja2cpp build by @guocuimi in #320
ci: allow build without requiring a physical gpu device by @guocuimi in #321
fix: put item into asyncio.Queue in a thread-safe way by @guocuimi in #324
refactor: added static switch for marlin kernel dispatch by @guocuimi in #325
feat: fix and use marlin kernel for awq by default by @guocuimi in #326

Full Changelog: v0.2.0...v0.2.1

Contributors

guocuimi

Assets 37

22 Aug 01:49

github-actions

v0.2.0

96b8127

v0.2.0

What's Changed

kernel: port softcap support for flash attention by @guocuimi in #298
test: added unittests for attention sliding window by @guocuimi in #299
model: added gemma2 with softcap and sliding window support by @guocuimi in #300
kernel: support kernel test in python via pybind by @guocuimi in #301
test: added unittests for marlin fp16xint4 gemm by @guocuimi in #302
fix: move eos out of stop token list to honor ignore_eos option by @guocuimi in #305
refactor: move models to upper folder by @guocuimi in #306
kernel: port gptq marlin kernel and fp8 marlin kernel by @guocuimi in #307
rust: upgrade rust libs to latest version by @guocuimi in #309
refactor: remove the logic loading individual weight from shared partitions by @guocuimi in #311
feat: added fused column parallel linear by @guocuimi in #313
feat: added gptq marlin qlinear layer by @guocuimi in #312
kernel: port awq repack kernel by @guocuimi in #314

Full Changelog: v0.1.9...v0.2.0

Contributors

guocuimi

Assets 37

04 Aug 00:38

github-actions

v0.1.9

b6f707f

v0.1.9

What's Changed

ci: cancel all previous runs if a new one is triggered by @guocuimi in #283
pypi: fix invalid classifier by @guocuimi in #284
refactor: remove exllama kernels by @guocuimi in #285
kernel: added marlin dense and sparse kernels by @guocuimi in #287
debug: added environment collection script. by @guocuimi in #288
kernel: added triton kernel build support by @guocuimi in #289
feat: added THUDM/glm-4* support by @guocuimi in #292
fix: handle unfinished utf8 bytes for tiktoken tokenizer by @guocuimi in #293
triton: fix build error and add example with unittest by @guocuimi in #294
model: added qwen2 support by @guocuimi in #295
feat: added sliding window support for QWen2 by @guocuimi in #296
ci: fix pytest version to avoid flakiness by @guocuimi in #297

Full Changelog: v0.1.8...v0.1.9

Contributors

guocuimi

Assets 37

25 Jul 12:02

github-actions

v0.1.8

2e14170

v0.1.8

What's Changed

ci: increase ccache max size from 5GB(default) to 25GB by @guocuimi in #279
upgrade torch to 2.4.0 by @guocuimi in #280
default use cuda 12.1 for wheel package by @guocuimi in #281
ci: fix cuda version for wheel build workflow by @guocuimi in #282

Full Changelog: v0.1.7...v0.1.8

Contributors

guocuimi

Assets 37

24 Jul 06:12

github-actions

v0.1.7

f0f7e07

v0.1.7

What's Changed

build: fix build error with gcc-13 by @guocuimi in #264
kernel: upgrade cutlass to 3.5.0 + cuda 12.4 for sm89 fp8 support by @guocuimi in #265
cmake: define header only library instead of symbol link for cutlass and flashinfer by @guocuimi in #266
feat: added range to support Range-for loops by @guocuimi in #267
kernel: added attention cpu implementation for testing by @guocuimi in #268
build: added nvbench as submodule by @guocuimi in #269
build: upgrade cmake required version from 3.18 to 3.26 by @guocuimi in #270
ci: build and test in devel docker image by @guocuimi in #272
ci: use manylinux image to build wheel and run pytest by @guocuimi in #271
attention: added tile logic using cute::local_tile into cpu attention by @guocuimi in #273
kernel: added playground for learning and experimenting cute. by @guocuimi in #274
feat: added rope scaling support for llama3.1 by @guocuimi in #277
update docs for llama3.1 support and bump up version by @guocuimi in #278

Full Changelog: v0.1.6...v0.1.7

Contributors

guocuimi

Assets 26

04 Jul 00:34

github-actions

v0.1.6

7aeb7fa

v0.1.6

What's Changed

alllow deploy docs when triggered on demand by @guocuimi in #253
[model] support vision language model llava. by @liutongxuan in #178
dev: fix issues in run_in_docker script by @guocuimi in #254
dev: added cuda 12.4 build support by @guocuimi in #255
build: fix multiple definition issue by @guocuimi in #256
fix: check against num_tokens instead of num_prompt_tokens for shared blocks by @guocuimi in #257
bugfix: fix invalid max_cache_size when device is cpu. by @liutongxuan in #259
ci: fail test if not all tests were passed successfully by @guocuimi in #263
Revert "[model] support vision language model llava. (#178)" by @guocuimi in #262

Full Changelog: v0.1.5...v0.1.6

Contributors

liutongxuan and guocuimi

Assets 26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

Releases: vectorch-ai/ScaleLLM

v0.2.5

What's Changed

Contributors

Uh oh!

v0.2.4

What's Changed

Contributors

Uh oh!

v0.2.3

What's Changed

Contributors

Uh oh!

v0.2.2

What's Changed

Contributors

Uh oh!

v0.2.1

What's Changed

Contributors

Uh oh!

v0.2.0

What's Changed

Contributors

Uh oh!

v0.1.9

What's Changed

Contributors

Uh oh!

v0.1.8

What's Changed

Contributors

Uh oh!

v0.1.7

What's Changed

Contributors

Uh oh!

v0.1.6

What's Changed

Contributors

Uh oh!