-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync CentML/hidet -> hidet-org/hidet #476
Conversation
- Fix bug in `normalize_launch_dims()` - increase tolerance on tests/cuda/test_cudnn.py. Fails frequently.
Cast return value of `get_parallel_num_workers` to float. Fix #388
…k/run_config.json (#205) Added the successfully compiled models to the json file, as discussed in #192 Models that are runnable after compilation, and passed the correctness checks: - `mnasnet1_0` - `resnext50_32x4d` - `shufflenet_v2_x1_0` - `squeezenet1_1` - `mobilenet_v3_large` - `resnet18` (All tested with input shape `128x3x224x224`, and data type `float16`) Regression run: https://github.com/CentML/hidet/actions/runs/10112541015/job/27967049854 Issue about the correctness checks for models `alexnet` and `vgg16`: #375 --------- Co-authored-by: vadiklyutiy <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: zhumakhan <zhumakhan.nazir@centml.ai> Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com> Co-authored-by: c-fteixeira <163031151+c-fteixeira@users.noreply.github.com> Co-authored-by: xiaocenxiaocen <zx_math@yeah.net> Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai> Co-authored-by: Shang Wang <samshang.wang@mail.utoronto.ca>
- sync `requirement.txt` with requirement in `setup.py` - add extras_require - requirements is torch >= 2.3.0
Save `Task` pickle in the translations cache. The reason - it is very convenient during performance analysis to get smaller test case. Supporting scripts will come soon
This issue (CentML/hidet#56) didn't let us change the transformers version on hidet. Now it is fixed so that it works with new transformers versions. Once the above issue is closed we can relax the transformers version in hidet to be the latest one. --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
In gpt-neo model (related issue: CentML/hidet#338) torch.where accepts tensors with different dtypes. Added type casting to fix the above issue. --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
Some models initialize tensors during the first forward pass and reuse it for next iterations. This causes model to recompile . One temporary solution is to run torch model once before compilation. Related issue is here: CentML/hidet#291 Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
…autograd.function.FunctionCtx (#394) Fixed issues appeared in deberta: CentML/hidet#340 --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
Adding accruacy check for huggingface LLMs in Regression `rtol=0.01` and `atol=0.065` were chosen to make previously "accurate" models to not fail the check --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
The issue is caused by a wrong layout for the bias tensor. For example, we consider a bias tensor of shape (64, ) and its layout can be written as `(64, ): (1, )` However, we can expand the layout by adding axes with 1-shape. For example, `(64, 1):(1, 1)` Since the shape is equal to 1, the stride can be any number. The stride corresponding to the 1-shape actually doesn't affect the computation of the address. But two strides that are equal to one will influence the instruction selection, and the invalid memory instruction leads to the misaligned access. To fix this issue, we force the stride paired with 1-shape to be 0. The layout is equivalent when computing the memory address, and this will help the compiler make the right decision in the instruction selection pass. closes #404 Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>
…tion `parallel_tune` is set to 1 (#406) We have several parallel build steps that is not controlled by `hidet.option.parallel_tune` option. This makes it difficult to do non-parallel build for the purpose of debugging. This PR fixes the issue.
1. Added `torch.Tensor.as_strided` and `torch.flip` 2. Added support for `rounding_mode == 'trunc'` in torch.divide 3. Registered `torch.new_ones` Longformer model compilation fails with: ``` RuntimeError: cudaDeviceSynchronize failed with error: cudaErrorMisalignedAddress ``` aftering running `fused_matmul_f16_pk_cute_rearrange_add` kernel. Also Nvidia Nsight Compute shows that matmul kernel fails to launch. This PR contains all changes needed to reproduce this issue. To reproduce: 1. check out to `zhumakhan/longformer` branch and 4. python3 tests/benchmarks/bench_transformer.py longformer --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
Updated torch docker image from nvcr.io/nvidia/pytorch:24.04-py3 to nvcr.io/nvidia/pytorch:24.07-py3 to resolve the issue below: ``` /usr/local/lib/python3.10/dist-packages/transformer_engine_extensions.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE ``` --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
- add a primitive of exp2 for float types. This primitive could be useful when optimizing the flash attention. Specifically, flash attention rewrites the exponential function as ``` # log2_e = 1.44269504 exp(a) = exp2(a * log2_e) ``` The transformation can hint the nvcc compiler to generate better code (more ffma instructions instead of fmuls and fadds) . --------- Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>
…barrier (#414) Add primitives: - `prmt` - `lop3` - `sub_f16x2`, `fma_f16x2` - `barrier` See the tests and function documentation for the usage of each primitive.
`torch.pow` accepts following type of operands: 1. (a: Tensor, b: Tensor) 2. (a: Numeric, b: Tensor) -> missing case that is handled by this PR 3. (a: Tensor, b: Numeric) --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
This PR adds two utility functions to allow us to profile a function with nsys/ncu in the code. Usage: ```python import os import pytest import torch from hidet.utils.nsys_utils import nsys_run, nsys_get_path def func(b): a = torch.randn(1000, device='cuda') a + b def main(): # profile with nsight system, store the result to 'nsys-reports/...' report = nsys_run(func, b=1) # we can visualize the profiling result by calling the `visualize` method. report.visualize() if __name__ == '__main__': # we need to wrap the call to `main` inside '__main__' so that this script can be "imported" without # side effect, which is one requirement for the utility main() ```
In regression tests, accuracy comparison was added for llms. But it broke the vision model benchmarking script. ``` Running command: python /workspace/hidet/tests/benchmarks/bench_transformer.py bert-base-uncased --params bs=16,seqlen=256 --dtype float16 --backend hidet --mode max-autotune File "/workspace/hidet/tests/benchmarks/run_tests.py", line 82, in <module> latency = float(outputs.split('\n')[-2]) ValueError: could not convert string to float: " device='cuda:0', dtype=torch.float16))" ``` Fixed it! --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
Allow prologue for fp32 `reduce`. `reduce` uses vectorized calculations that don't allow to use fusing(it is possible but not implemented yet). For fp32 there are no vectors and we can enable fusion (with small modification `reduce` kernels itself). Motivation. In llama2 the part of the calculation is fp32 including `pow`+`reduce`. Performance improvement on llama2-7B +0.241%
Right now `pow` with const exp argument is implemented simply. We convert const to const tensor and run elementwise `pow` of 2 tensors. It is simply but not always efficient. llama2 (RMSNorm part) has `x*x` that implemented as `tensor.pow(2)`. Convert `pow(x,2)` to `x*x`. Improvement on llama2-7B is around **0.237%**
….Tensor.scatter_add_` (#429) Closes #424 The additional bug described in the comments in the linked issue([here](CentML/hidet#424 (comment))) is caused by accessing a PyTorch tensor in [this line](https://github.com/CentML/hidet/blob/18f68ae34d8a08ca1b38ee00ac2ca7f15e599d0b/python/hidet/runtime/compiled_task.py#L161) while we were supposed to be accessing a Hidet tensor.
With torch==2.5 all weights are treated as inputs. This PR changes compilation process so that weights are constant tensors in our FlowGraph. More context in the issue: CentML/hidet#557 and closed & reverted PR: CentML/hidet#609 --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
This PR speedup hot start of hidet. In opposite to cold start that mean "nothing in translation cache" hot start means compilation when all tasks compiled and in cache. Historically we never optimize hot start but in use case with vllm it takes sufficient time and slow down model loading. Hot start compilation, vllm+hidet, Llama-3.1-8B-Instruct, g5.xlarge AWS instance with 4 vCPU, hidet compilation time **Before 121s After 51s** Changes - use manual implementation of tree copy that ignore existing files/folders - cache the string representation of `Task` PS With debug dumping we copy a lot of files. There the first point also helps
**Summary of Changes:** The `build_task` and `build_module` function was refactored to support parallel building synchronization using file-based locks (`flock`). This ensures that in a multiprocessing environment, only one process compiles the task from scratch, while others wait for the lock to be released and then load the compiled task from disk. test scripts: ```{python} import os import torch import torch.distributed as dist import torch.multiprocessing as mp # Global constants VECTOR_SIZE = 8 MATRIX_SIZE = (8, 16) # Set environment variables for distributed setting os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12355" os.environ["WORLD_SIZE"] = "2" # Vector-to-matrix multiplication class class VectorMatrixMultiplication(torch.nn.Module): def __init__(self, rank): super(VectorMatrixMultiplication, self).__init__() # Determine device for this rank self.device = torch.device(f"cuda:{rank}") # Initialize vector on this device self.vector = torch.rand(1, VECTOR_SIZE // dist.get_world_size(), device=self.device) # Split matrix so each rank only holds half on its device part_size = MATRIX_SIZE[0] // dist.get_world_size() self.part_matrix = torch.rand((part_size, MATRIX_SIZE[1]), device=self.device) print(f"Rank {rank} vector size: {self.vector.shape}, part_matrix size: {self.part_matrix.shape}") def forward(self, rank): # Perform multiplication for the assigned part result_part = torch.matmul(self.vector, self.part_matrix) # Aggregate results across GPUs handle = dist.all_reduce(result_part, op=dist.ReduceOp.SUM, async_op=True) handle.wait() result = result_part return result # Process function for parallel computation def run(rank, world_size): # Set the device for this rank torch.cuda.set_device(rank) # Initialize the process group dist.init_process_group("nccl", rank=rank, world_size=world_size) # Instantiate the model and move it to the specified GPU model = VectorMatrixMultiplication(rank).to(rank) model_opt = torch.compile(model, backend='hidet', mode='default') result = model_opt(rank) # Print the result for each rank if rank == 0: print("Result from GPU 0:", result) elif rank == 1: print("Result from GPU 1:", result) # Cleanup dist.destroy_process_group() if __name__ == "__main__": world_size = 2 mp.spawn(run, args=(world_size,), nprocs=world_size, join=True) ```
Turn `parallel_k` option default to `'disabled'`. We have at least the following problems: 1. We have this issue CentML/hidet#446 with `parallel_k` 2. `parallel_k` is disabled for dynamic shapes (set always to 1 if dyn dim are present) 3. On the following layer ``` python class Model(nn.Module): def __init__(self): super().__init__() self.linear1 = nn.Linear(4096, 6144) def forward(self, x): y = self.linear1(x) return y ``` I found that dyn shape works faster than static shape: latency 105us vs 122us. The reason is `parallel_k`. When `parallel_k` is disabled static shapes work the same 105us. This is the real layer from llama3-8B. This brings slow down of vllm+hidet when I tried to enable shape specialization in vllm.
…. ) (#294) [Ir][Primitives] add vectorized conversion instructions [Ir][CuTe] add reduce primitives in cute (#295) [Ir][CuTe] add mma primitives (#296) [Ir][CuTe] add other primitives in cute (#297) [Transforms][CuTe] add instruction selection pass (#298) [Transforms][CuTe] add resolve bank conflict pass (#299) [Transforms][CuTe] add resolve auto keywords pass (#300) [Transforms][CuTe] add shared memory allocation pass (#301) [Transforms][CuTe] add vectorize elementwise operation pass (#302) [Transforms][CuTe] add analysis pass (#303) [Transforms][CuTe] add canonicalization pass (#304) [Transforms][CuTe] add deadcode elimination pass (#305) [Transforms][CuTe] refactor cute lowering pass (#306) [Graph][Ops] matmul cute (#307) [Ir] cute miscs (#308) [Tests] cute tests (#309) [Chore] fix ci (#313) --------- Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>
Originally, this pass extracts all the kernels from the IR and adds additional information to the name of the kernels. But we should update all the places where the function variable is used. Previously, this pass only updates the `LaunchKernelStmt`, which is only one place that may use function variables. We should replace function variables in the call expressions as well, which may also use the modified function variables. --------- Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>
…sor map creation (#643) On Hopper architectures, copying multi-dimensional arrays using TMA instructions requires a tensor map created using the `cuTensorMapEncodeTiled` [driver API](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html#group__CUDA__TENSOR__MEMORY_1ga7c7d2aaac9e49294304e755e6f341d7). Previously, the `swizzle`, `interleave` and `l2Promotion` arguments passed to the API call were hard-coded as the value `NONE`. This PR adds the supports for other values for those arguments, as specified by the document.
VLLM (and hence DMWL) passes fake tensors instead of real weight tensors during compilation process. Such fake weight tensors will be treated as traceable inputs by hidet. --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
Currently support shape that is multiple of 16 Currently use unswizzled smem
**vllm+hidet+TP works now!** (with workaround for #650) Fixes: 1. Added `getattr` ops support 2. Added support of `vllm.all_reduce` 3. In `'interpreter'` mode `torch.full` under vllm returns `FakeTensor` and `'interpreter'` mode fails. Made workaround for that. Further investigations regarding that in #645 4. For `bool` `not` we did `bitwise_inverse` but `bitwise_inverse != logical_not`. Fixed. 5. For `42*tensor(dtype=bool)` we return `bool` tensor but should be `int64` tensor. Fixed.
Add `scipy` as a requirement for comp server. It was added to regular `requirements.txt` but not to comp server `requirements.txt`.
Promote the version 0.4.0.dev -> 0.4.0 (#456) Promote the version 0.4.0.dev -> 0.4.0
…ers (#475) The old version of `diffusers` used a depreciated (now completed removed) api in `huggingface_hub`. Thus, it's better to keep the versions up to date to prevent such problems in the future. The error log in our ci: ```text ImportError while importing test module '/__w/hidet/hidet/tests/apps/test_pretrained.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: /__w/_tool/Python/3.8.18/x64/lib/python3.8/importlib/__init__.py:127: in import_module return _bootstrap._gcd_import(name[level:], package, level) tests/apps/test_pretrained.py:14: in <module> from hidet.apps import PretrainedModel, hf /__w/_tool/Python/3.8.18/x64/lib/python3.8/site-packages/hidet/apps/hf.py:16: in <module> from diffusers import StableDiffusionPipeline /__w/_tool/Python/3.8.18/x64/lib/python3.8/site-packages/diffusers/__init__.py:5: in <module> from .utils import ( /__w/_tool/Python/3.8.18/x64/lib/python3.8/site-packages/diffusers/utils/__init__.py:[38](https://github.com/hidet-org/hidet/actions/runs/11719752494/job/32643700257#step:9:40): in <module> from .dynamic_modules_utils import get_class_from_dynamic_module /__w/_tool/Python/3.8.18/x64/lib/python3.8/site-packages/diffusers/utils/dynamic_modules_utils.py:28: in <module> from huggingface_hub import cached_download, hf_hub_download, model_info E ImportError: cannot import name 'cached_download' from 'huggingface_hub' (/__w/_tool/Python/3.8.18/x64/lib/python3.8/site-packages/huggingface_hub/__init__.py) ``` Also increase the tol in one flaky test.
@vadiklyutiy I see you are in the author list of the 131 commits. This might because of the first two commits ("Sync" and "lint") that remove and add the EOF in the version file. Could you remove those two commits and try again? |
… enabled on hidet-org/hidet
@yaoyaoding please switch to Commit tab and take a look there |
Also you can look on previous release #455 |
On the conversation screen shows union of authors and committer. Because I am going to commit I added every wheere. But in history mainly authors does matter. |
in previous release #455 there were 98 commits. All of them I committed. |
Yes, seems both the authors and commiters are merged in my view. Then it looks good to me. Previously I thought it might becuase of the first two commits, but it turned out it is not related. Then everything looks good to me, just kindly remind to bump the version after merge this pr and check the wheel submitted to pypi. |
Pls formally approve this PR |
Sync CentML/hidet -> hidet-org/hidet