[SYNC] Sync with CentML/hidet -> hidet-org/hidet #486

vadiklyutiy · 2024-12-24T17:04:11Z

As a follow up for #485 make a sync to have all changes related to torch.

PS by some reason here also appeared 2 old commits 5d50e5b, de47412
plus empty commit 06d903b
TO be honest don't know why they are here. Maybe I did wrong resolving of conflicts.

Promote version of hidet 0.4.0.dev -> 0.5.0.dev]

- sync `requirement.txt` with requirement in `setup.py` - add extras_require - requirements is torch >= 2.3.0

Adding accruacy check for huggingface LLMs in Regression `rtol=0.01` and `atol=0.065` were chosen to make previously "accurate" models to not fail the check --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>

Closes #500

1. Add an option to enable experimental features. Usage: ```python with hidet.option.context(): hidet.option.hexcute_matmul(strategy='enable') .... ``` The valid values can be `enable`, `disable`, and `auto`. The `auto` option will use a heuristic to determine whether to enable the Hexcute kernel on the current GPUs. Now the heuristic is not implemented, and will be implemented in the following PR. Related #640 --------- Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>

Fixes: - use the identical Docker image for all Actions - disable several useless quant tests - promotion version to v0.6.0dev

… current stream (#629) Use torch c++ api to set the current stream to current torch stream. Implementation: - Build a hidet-torch shared library to wrap the original torch C++ API (The original API contains torch defined structure like `CUDAStream` and cannot be easily dlopened during runtime and accessed) - dlopen the newly added hidet-torch library and access torch's current stream - Add option "use_torch_stream" to hidet's option to dynamically set the stream to current torch stream or hidet's stream during runtime - When hidet's CUDA graph mode is on, hidet will still create a new hidet stream and capture the graph on that stream instead of using the torch stream. Benefits: - Removes the overhead of query and calling torch's current stream api from the python side - Could also reduce the overhead occured in Hexcute integration because `set_to_torch_stream` is called in the launch function. We can remove the stream query/switch on python side. Performance improvement (measured on L4 lock frequency@6250MHZ compute/1500MHZ memory): 1. For Hexcute kernel (without cudagraph), I manually disabled CUDA graph on DMWL (vLLM) side, prefill and decoding stage will both use the generic model and call Hexcute kernel directly. command: `python3 benchmark_latency.py --model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 --input-len 1024 --output-len 128 --batch-size 8 --num-iters-warmup 5 --num-iters 10 --max-model-len 32768 --quantization awq_hidet` Comparsion before and after removing stream query and stream switch before Hexcute kernel call (CentML/DMWL#121) Before avg latency: 12.624572871897545 seconds After avg latency: 11.764245539499097 seconds 2. Profile small kermels in hidet and measure latency: - Enable CUDA graph `python bench_op_torch_api.py --params 16x16,16x16 --mode max-autotune matmul` Before: 0.27151119 second After: 0.25410826999999997 second - Disable CUDA graph `python bench_op_torch_api.py --params 16x16,16x16 --mode max-autotune-no-cudagraphs matmul` Before: 0.14555310999999999 second After: 0.11648335 second This is related to #563

vadiklyutiy · 2024-12-26T17:40:00Z

still not satisfied. New rebase brings the same commits

…487) Fix for #486 Below rule in `rule_based_simplifier` `((e1 // c1) // c2, e1 // (c1 * c2))` apply for cases when `e1` is int Var, `c1` is int const, `c2` is fp const. But the rule is incorrect in this case. Apply the rule for `int` constants only.

vadiklyutiy and others added 8 commits December 24, 2024 11:14

Promote version of hidet 0.4.0.dev -> 0.5.0.dev (#370)

c02c9ed

Promote version of hidet 0.4.0.dev -> 0.5.0.dev]

[BUILD] Several changes in wheel building (#392)

34eb22f

- sync `requirement.txt` with requirement in `setup.py` - add extras_require - requirements is torch >= 2.3.0

Adding accruacy check for huggingface LLMs in Regression (#368)

5d50e5b

Adding accruacy check for huggingface LLMs in Regression `rtol=0.01` and `atol=0.065` were chosen to make previously "accurate" models to not fail the check --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>

Empty commit to test out the branch protection rule.

06d903b

[Operators] Support bfloat16 data type in matmul operator (#511)

de47412

Closes #500

[BUG] Several changes inspired by Release (#667)

c583fb2

Fixes: - use the identical Docker image for all Actions - disable several useless quant tests - promotion version to v0.6.0dev

vadiklyutiy added the Task Or "Story" in JIRA's term. label Dec 24, 2024

vadiklyutiy requested a review from yaoyaoding December 24, 2024 17:04

vadiklyutiy self-assigned this Dec 24, 2024

lint + bug fixing

d591a64

vadiklyutiy closed this Dec 26, 2024

vadiklyutiy deleted the vadim/sync2 branch December 26, 2024 19:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYNC] Sync with CentML/hidet -> hidet-org/hidet #486

[SYNC] Sync with CentML/hidet -> hidet-org/hidet #486

vadiklyutiy commented Dec 24, 2024

vadiklyutiy commented Dec 26, 2024

[SYNC] Sync with CentML/hidet -> hidet-org/hidet #486

[SYNC] Sync with CentML/hidet -> hidet-org/hidet #486

Conversation

vadiklyutiy commented Dec 24, 2024

vadiklyutiy commented Dec 26, 2024