Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYNC] Sync with CentML/hidet -> hidet-org/hidet #486

Closed
wants to merge 9 commits into from

Conversation

vadiklyutiy
Copy link
Collaborator

As a follow up for #485 make a sync to have all changes related to torch.

PS by some reason here also appeared 2 old commits 5d50e5b, de47412
plus empty commit 06d903b
TO be honest don't know why they are here. Maybe I did wrong resolving of conflicts.

vadiklyutiy and others added 8 commits December 24, 2024 11:14
Promote version of hidet 0.4.0.dev -> 0.5.0.dev]
- sync `requirement.txt` with requirement in `setup.py`
- add extras_require
- requirements is torch >= 2.3.0
Adding accruacy check for huggingface LLMs in Regression

`rtol=0.01` and `atol=0.065` were chosen to make previously "accurate"
models to not fail the check

---------

Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
1. Add an option to enable experimental features. Usage:
```python
with hidet.option.context():
     hidet.option.hexcute_matmul(strategy='enable')
     ....
```
The valid values can be `enable`, `disable`, and `auto`. The `auto`
option will use a heuristic to determine whether to enable the Hexcute
kernel on the current GPUs. Now the heuristic is not implemented, and
will be implemented in the following PR.

Related #640

---------

Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>
Fixes:
- use the identical Docker image for all Actions
- disable several useless quant tests
- promotion version to v0.6.0dev
… current stream (#629)

Use torch c++ api to set the current stream to current torch stream.

Implementation:
- Build a hidet-torch shared library to wrap the original torch C++ API
(The original API contains torch defined structure like `CUDAStream` and
cannot be easily dlopened during runtime and accessed)
- dlopen the newly added hidet-torch library and access torch's current
stream
- Add option "use_torch_stream" to hidet's option to dynamically set the
stream to current torch stream or hidet's stream during runtime
- When hidet's CUDA graph mode is on, hidet will still create a new
hidet stream and capture the graph on that stream instead of using the
torch stream.

Benefits:
- Removes the overhead of query and calling torch's current stream api
from the python side
- Could also reduce the overhead occured in Hexcute integration because
`set_to_torch_stream` is called in the launch function. We can remove
the stream query/switch on python side.

Performance improvement (measured on L4 lock frequency@6250MHZ
compute/1500MHZ memory):
1. For Hexcute kernel (without cudagraph), I manually disabled CUDA
graph on DMWL (vLLM) side, prefill and decoding stage will both use the
generic model and call Hexcute kernel directly.
command: `python3 benchmark_latency.py --model
hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 --input-len 1024
--output-len 128 --batch-size 8 --num-iters-warmup 5 --num-iters 10
--max-model-len 32768 --quantization awq_hidet`
Comparsion before and after removing stream query and stream switch
before Hexcute kernel call (CentML/DMWL#121)
Before avg latency: 12.624572871897545 seconds
After avg latency: 11.764245539499097 seconds

2. Profile small kermels in hidet and measure latency:
- Enable CUDA graph
`python bench_op_torch_api.py --params 16x16,16x16 --mode max-autotune
matmul`
Before: 0.27151119 second
After: 0.25410826999999997 second
- Disable CUDA graph
`python bench_op_torch_api.py --params 16x16,16x16 --mode
max-autotune-no-cudagraphs matmul`
Before: 0.14555310999999999 second
After: 0.11648335 second

This is related to #563
@vadiklyutiy vadiklyutiy added the Task Or "Story" in JIRA's term. label Dec 24, 2024
@vadiklyutiy vadiklyutiy self-assigned this Dec 24, 2024
@vadiklyutiy
Copy link
Collaborator Author

still not satisfied. New rebase brings the same commits

@vadiklyutiy vadiklyutiy deleted the vadim/sync2 branch December 26, 2024 19:16
vadiklyutiy added a commit that referenced this pull request Dec 26, 2024
…487)

Fix for #486

Below rule in `rule_based_simplifier`
`((e1 // c1) // c2, e1 // (c1 * c2))`
apply for cases when `e1` is int Var, `c1` is int const, `c2` is fp const. But the rule is incorrect in this case. 

Apply the rule for `int` constants only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Task Or "Story" in JIRA's term.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants