-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYNC] Sync with CentML/hidet -> hidet-org/hidet #486
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Promote version of hidet 0.4.0.dev -> 0.5.0.dev]
- sync `requirement.txt` with requirement in `setup.py` - add extras_require - requirements is torch >= 2.3.0
Adding accruacy check for huggingface LLMs in Regression `rtol=0.01` and `atol=0.065` were chosen to make previously "accurate" models to not fail the check --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
1. Add an option to enable experimental features. Usage: ```python with hidet.option.context(): hidet.option.hexcute_matmul(strategy='enable') .... ``` The valid values can be `enable`, `disable`, and `auto`. The `auto` option will use a heuristic to determine whether to enable the Hexcute kernel on the current GPUs. Now the heuristic is not implemented, and will be implemented in the following PR. Related #640 --------- Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>
Fixes: - use the identical Docker image for all Actions - disable several useless quant tests - promotion version to v0.6.0dev
… current stream (#629) Use torch c++ api to set the current stream to current torch stream. Implementation: - Build a hidet-torch shared library to wrap the original torch C++ API (The original API contains torch defined structure like `CUDAStream` and cannot be easily dlopened during runtime and accessed) - dlopen the newly added hidet-torch library and access torch's current stream - Add option "use_torch_stream" to hidet's option to dynamically set the stream to current torch stream or hidet's stream during runtime - When hidet's CUDA graph mode is on, hidet will still create a new hidet stream and capture the graph on that stream instead of using the torch stream. Benefits: - Removes the overhead of query and calling torch's current stream api from the python side - Could also reduce the overhead occured in Hexcute integration because `set_to_torch_stream` is called in the launch function. We can remove the stream query/switch on python side. Performance improvement (measured on L4 lock frequency@6250MHZ compute/1500MHZ memory): 1. For Hexcute kernel (without cudagraph), I manually disabled CUDA graph on DMWL (vLLM) side, prefill and decoding stage will both use the generic model and call Hexcute kernel directly. command: `python3 benchmark_latency.py --model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 --input-len 1024 --output-len 128 --batch-size 8 --num-iters-warmup 5 --num-iters 10 --max-model-len 32768 --quantization awq_hidet` Comparsion before and after removing stream query and stream switch before Hexcute kernel call (CentML/DMWL#121) Before avg latency: 12.624572871897545 seconds After avg latency: 11.764245539499097 seconds 2. Profile small kermels in hidet and measure latency: - Enable CUDA graph `python bench_op_torch_api.py --params 16x16,16x16 --mode max-autotune matmul` Before: 0.27151119 second After: 0.25410826999999997 second - Disable CUDA graph `python bench_op_torch_api.py --params 16x16,16x16 --mode max-autotune-no-cudagraphs matmul` Before: 0.14555310999999999 second After: 0.11648335 second This is related to #563
still not satisfied. New rebase brings the same commits |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As a follow up for #485 make a sync to have all changes related to torch.
PS by some reason here also appeared 2 old commits 5d50e5b, de47412
plus empty commit 06d903b
TO be honest don't know why they are here. Maybe I did wrong resolving of conflicts.