Sync CentML/hidet -> hidet-org/hidet #476

vadiklyutiy · 2024-12-19T18:31:58Z

Sync CentML/hidet -> hidet-org/hidet

- Fix bug in `normalize_launch_dims()` - increase tolerance on tests/cuda/test_cudnn.py. Fails frequently.

Cast return value of `get_parallel_num_workers` to float. Fix #388

…k/run_config.json (#205) Added the successfully compiled models to the json file, as discussed in #192 Models that are runnable after compilation, and passed the correctness checks: - `mnasnet1_0` - `resnext50_32x4d` - `shufflenet_v2_x1_0` - `squeezenet1_1` - `mobilenet_v3_large` - `resnet18` (All tested with input shape `128x3x224x224`, and data type `float16`) Regression run: https://github.com/CentML/hidet/actions/runs/10112541015/job/27967049854 Issue about the correctness checks for models `alexnet` and `vgg16`: #375 --------- Co-authored-by: vadiklyutiy <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: zhumakhan <zhumakhan.nazir@centml.ai> Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com> Co-authored-by: c-fteixeira <163031151+c-fteixeira@users.noreply.github.com> Co-authored-by: xiaocenxiaocen <zx_math@yeah.net> Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai> Co-authored-by: Shang Wang <samshang.wang@mail.utoronto.ca>

Closes #223

- sync `requirement.txt` with requirement in `setup.py` - add extras_require - requirements is torch >= 2.3.0

Save `Task` pickle in the translations cache. The reason - it is very convenient during performance analysis to get smaller test case. Supporting scripts will come soon

This issue (CentML/hidet#56) didn't let us change the transformers version on hidet. Now it is fixed so that it works with new transformers versions. Once the above issue is closed we can relax the transformers version in hidet to be the latest one. --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>

In gpt-neo model (related issue: CentML/hidet#338) torch.where accepts tensors with different dtypes. Added type casting to fix the above issue. --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>

Some models initialize tensors during the first forward pass and reuse it for next iterations. This causes model to recompile . One temporary solution is to run torch model once before compilation. Related issue is here: CentML/hidet#291 Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>

…autograd.function.FunctionCtx (#394) Fixed issues appeared in deberta: CentML/hidet#340 --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>

Adding accruacy check for huggingface LLMs in Regression `rtol=0.01` and `atol=0.065` were chosen to make previously "accurate" models to not fail the check --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>

Closes #195

The issue is caused by a wrong layout for the bias tensor. For example, we consider a bias tensor of shape (64, ) and its layout can be written as `(64, ): (1, )` However, we can expand the layout by adding axes with 1-shape. For example, `(64, 1):(1, 1)` Since the shape is equal to 1, the stride can be any number. The stride corresponding to the 1-shape actually doesn't affect the computation of the address. But two strides that are equal to one will influence the instruction selection, and the invalid memory instruction leads to the misaligned access. To fix this issue, we force the stride paired with 1-shape to be 0. The layout is equivalent when computing the memory address, and this will help the compiler make the right decision in the instruction selection pass. closes #404 Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>

…tion `parallel_tune` is set to 1 (#406) We have several parallel build steps that is not controlled by `hidet.option.parallel_tune` option. This makes it difficult to do non-parallel build for the purpose of debugging. This PR fixes the issue.

1. Added `torch.Tensor.as_strided` and `torch.flip` 2. Added support for `rounding_mode == 'trunc'` in torch.divide 3. Registered `torch.new_ones` Longformer model compilation fails with: ``` RuntimeError: cudaDeviceSynchronize failed with error: cudaErrorMisalignedAddress ``` aftering running `fused_matmul_f16_pk_cute_rearrange_add` kernel. Also Nvidia Nsight Compute shows that matmul kernel fails to launch. This PR contains all changes needed to reproduce this issue. To reproduce: 1. check out to `zhumakhan/longformer` branch and 4. python3 tests/benchmarks/bench_transformer.py longformer --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>

Fix #400

Updated torch docker image from nvcr.io/nvidia/pytorch:24.04-py3 to nvcr.io/nvidia/pytorch:24.07-py3 to resolve the issue below: ``` /usr/local/lib/python3.10/dist-packages/transformer_engine_extensions.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE ``` --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>

- add a primitive of exp2 for float types. This primitive could be useful when optimizing the flash attention. Specifically, flash attention rewrites the exponential function as ``` # log2_e = 1.44269504 exp(a) = exp2(a * log2_e) ``` The transformation can hint the nvcc compiler to generate better code (more ffma instructions instead of fmuls and fadds) . --------- Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>

…barrier (#414) Add primitives: - `prmt` - `lop3` - `sub_f16x2`, `fma_f16x2` - `barrier` See the tests and function documentation for the usage of each primitive.

`torch.pow` accepts following type of operands: 1. (a: Tensor, b: Tensor) 2. (a: Numeric, b: Tensor) -> missing case that is handled by this PR 3. (a: Tensor, b: Numeric) --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>

…#421) Closes #265 Additionally, slightly changed the code supporting `torch.nn.EmbeddingBag` to make it support more data types.

This PR adds two utility functions to allow us to profile a function with nsys/ncu in the code. Usage: ```python import os import pytest import torch from hidet.utils.nsys_utils import nsys_run, nsys_get_path def func(b): a = torch.randn(1000, device='cuda') a + b def main(): # profile with nsight system, store the result to 'nsys-reports/...' report = nsys_run(func, b=1) # we can visualize the profiling result by calling the `visualize` method. report.visualize() if __name__ == '__main__': # we need to wrap the call to `main` inside '__main__' so that this script can be "imported" without # side effect, which is one requirement for the utility main() ```

In regression tests, accuracy comparison was added for llms. But it broke the vision model benchmarking script. ``` Running command: python /workspace/hidet/tests/benchmarks/bench_transformer.py bert-base-uncased --params bs=16,seqlen=256 --dtype float16 --backend hidet --mode max-autotune File "/workspace/hidet/tests/benchmarks/run_tests.py", line 82, in <module> latency = float(outputs.split('\n')[-2]) ValueError: could not convert string to float: " device='cuda:0', dtype=torch.float16))" ``` Fixed it! --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>

Allow prologue for fp32 `reduce`. `reduce` uses vectorized calculations that don't allow to use fusing(it is possible but not implemented yet). For fp32 there are no vectors and we can enable fusion (with small modification `reduce` kernels itself). Motivation. In llama2 the part of the calculation is fp32 including `pow`+`reduce`. Performance improvement on llama2-7B +0.241%

Fix forgotten version promotion

Right now `pow` with const exp argument is implemented simply. We convert const to const tensor and run elementwise `pow` of 2 tensors. It is simply but not always efficient. llama2 (RMSNorm part) has `x*x` that implemented as `tensor.pow(2)`. Convert `pow(x,2)` to `x*x`. Improvement on llama2-7B is around **0.237%**

….Tensor.scatter_add_` (#429) Closes #424 The additional bug described in the comments in the linked issue([here](CentML/hidet#424 (comment))) is caused by accessing a PyTorch tensor in [this line](https://github.com/CentML/hidet/blob/18f68ae34d8a08ca1b38ee00ac2ca7f15e599d0b/python/hidet/runtime/compiled_task.py#L161) while we were supposed to be accessing a Hidet tensor.

Reverts CentML/hidet#609

With torch==2.5 all weights are treated as inputs. This PR changes compilation process so that weights are constant tensors in our FlowGraph. More context in the issue: CentML/hidet#557 and closed & reverted PR: CentML/hidet#609 --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>

This PR speedup hot start of hidet. In opposite to cold start that mean "nothing in translation cache" hot start means compilation when all tasks compiled and in cache. Historically we never optimize hot start but in use case with vllm it takes sufficient time and slow down model loading. Hot start compilation, vllm+hidet, Llama-3.1-8B-Instruct, g5.xlarge AWS instance with 4 vCPU, hidet compilation time **Before 121s After 51s** Changes - use manual implementation of tree copy that ignore existing files/folders - cache the string representation of `Task` PS With debug dumping we copy a lot of files. There the first point also helps

**Summary of Changes:** The `build_task` and `build_module` function was refactored to support parallel building synchronization using file-based locks (`flock`). This ensures that in a multiprocessing environment, only one process compiles the task from scratch, while others wait for the lock to be released and then load the compiled task from disk. test scripts: ```{python} import os import torch import torch.distributed as dist import torch.multiprocessing as mp # Global constants VECTOR_SIZE = 8 MATRIX_SIZE = (8, 16) # Set environment variables for distributed setting os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12355" os.environ["WORLD_SIZE"] = "2" # Vector-to-matrix multiplication class class VectorMatrixMultiplication(torch.nn.Module): def __init__(self, rank): super(VectorMatrixMultiplication, self).__init__() # Determine device for this rank self.device = torch.device(f"cuda:{rank}") # Initialize vector on this device self.vector = torch.rand(1, VECTOR_SIZE // dist.get_world_size(), device=self.device) # Split matrix so each rank only holds half on its device part_size = MATRIX_SIZE[0] // dist.get_world_size() self.part_matrix = torch.rand((part_size, MATRIX_SIZE[1]), device=self.device) print(f"Rank {rank} vector size: {self.vector.shape}, part_matrix size: {self.part_matrix.shape}") def forward(self, rank): # Perform multiplication for the assigned part result_part = torch.matmul(self.vector, self.part_matrix) # Aggregate results across GPUs handle = dist.all_reduce(result_part, op=dist.ReduceOp.SUM, async_op=True) handle.wait() result = result_part return result # Process function for parallel computation def run(rank, world_size): # Set the device for this rank torch.cuda.set_device(rank) # Initialize the process group dist.init_process_group("nccl", rank=rank, world_size=world_size) # Instantiate the model and move it to the specified GPU model = VectorMatrixMultiplication(rank).to(rank) model_opt = torch.compile(model, backend='hidet', mode='default') result = model_opt(rank) # Print the result for each rank if rank == 0: print("Result from GPU 0:", result) elif rank == 1: print("Result from GPU 1:", result) # Cleanup dist.destroy_process_group() if __name__ == "__main__": world_size = 2 mp.spawn(run, args=(world_size,), nprocs=world_size, join=True) ```

Turn `parallel_k` option default to `'disabled'`. We have at least the following problems: 1. We have this issue CentML/hidet#446 with `parallel_k` 2. `parallel_k` is disabled for dynamic shapes (set always to 1 if dyn dim are present) 3. On the following layer ``` python class Model(nn.Module): def __init__(self): super().__init__() self.linear1 = nn.Linear(4096, 6144) def forward(self, x): y = self.linear1(x) return y ``` I found that dyn shape works faster than static shape: latency 105us vs 122us. The reason is `parallel_k`. When `parallel_k` is disabled static shapes work the same 105us. This is the real layer from llama3-8B. This brings slow down of vllm+hidet when I tried to enable shape specialization in vllm.

…. ) (#294) [Ir][Primitives] add vectorized conversion instructions [Ir][CuTe] add reduce primitives in cute (#295) [Ir][CuTe] add mma primitives (#296) [Ir][CuTe] add other primitives in cute (#297) [Transforms][CuTe] add instruction selection pass (#298) [Transforms][CuTe] add resolve bank conflict pass (#299) [Transforms][CuTe] add resolve auto keywords pass (#300) [Transforms][CuTe] add shared memory allocation pass (#301) [Transforms][CuTe] add vectorize elementwise operation pass (#302) [Transforms][CuTe] add analysis pass (#303) [Transforms][CuTe] add canonicalization pass (#304) [Transforms][CuTe] add deadcode elimination pass (#305) [Transforms][CuTe] refactor cute lowering pass (#306) [Graph][Ops] matmul cute (#307) [Ir] cute miscs (#308) [Tests] cute tests (#309) [Chore] fix ci (#313) --------- Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>

Originally, this pass extracts all the kernels from the IR and adds additional information to the name of the kernels. But we should update all the places where the function variable is used. Previously, this pass only updates the `LaunchKernelStmt`, which is only one place that may use function variables. We should replace function variables in the call expressions as well, which may also use the modified function variables. --------- Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>

…sor map creation (#643) On Hopper architectures, copying multi-dimensional arrays using TMA instructions requires a tensor map created using the `cuTensorMapEncodeTiled` [driver API](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html#group__CUDA__TENSOR__MEMORY_1ga7c7d2aaac9e49294304e755e6f341d7). Previously, the `swizzle`, `interleave` and `l2Promotion` arguments passed to the API call were hard-coded as the value `NONE`. This PR adds the supports for other values for those arguments, as specified by the document.

VLLM (and hence DMWL) passes fake tensors instead of real weight tensors during compilation process. Such fake weight tensors will be treated as traceable inputs by hidet. --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>

Currently support shape that is multiple of 16 Currently use unswizzled smem

**vllm+hidet+TP works now!** (with workaround for #650) Fixes: 1. Added `getattr` ops support 2. Added support of `vllm.all_reduce` 3. In `'interpreter'` mode `torch.full` under vllm returns `FakeTensor` and `'interpreter'` mode fails. Made workaround for that. Further investigations regarding that in #645 4. For `bool` `not` we did `bitwise_inverse` but `bitwise_inverse != logical_not`. Fixed. 5. For `42*tensor(dtype=bool)` we return `bool` tensor but should be `int64` tensor. Fixed.

Add `scipy` as a requirement for comp server. It was added to regular `requirements.txt` but not to comp server `requirements.txt`.

Promote the version 0.4.0.dev -> 0.4.0 (#456) Promote the version 0.4.0.dev -> 0.4.0

…ers (#475) The old version of `diffusers` used a depreciated (now completed removed) api in `huggingface_hub`. Thus, it's better to keep the versions up to date to prevent such problems in the future. The error log in our ci: ```text ImportError while importing test module '/__w/hidet/hidet/tests/apps/test_pretrained.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: /__w/_tool/Python/3.8.18/x64/lib/python3.8/importlib/__init__.py:127: in import_module return _bootstrap._gcd_import(name[level:], package, level) tests/apps/test_pretrained.py:14: in <module> from hidet.apps import PretrainedModel, hf /__w/_tool/Python/3.8.18/x64/lib/python3.8/site-packages/hidet/apps/hf.py:16: in <module> from diffusers import StableDiffusionPipeline /__w/_tool/Python/3.8.18/x64/lib/python3.8/site-packages/diffusers/__init__.py:5: in <module> from .utils import ( /__w/_tool/Python/3.8.18/x64/lib/python3.8/site-packages/diffusers/utils/__init__.py:[38](https://github.com/hidet-org/hidet/actions/runs/11719752494/job/32643700257#step:9:40): in <module> from .dynamic_modules_utils import get_class_from_dynamic_module /__w/_tool/Python/3.8.18/x64/lib/python3.8/site-packages/diffusers/utils/dynamic_modules_utils.py:28: in <module> from huggingface_hub import cached_download, hf_hub_download, model_info E ImportError: cannot import name 'cached_download' from 'huggingface_hub' (/__w/_tool/Python/3.8.18/x64/lib/python3.8/site-packages/huggingface_hub/__init__.py) ``` Also increase the tol in one flaky test.

yaoyaoding · 2024-12-19T19:04:28Z

@vadiklyutiy I see you are in the author list of the 131 commits. This might because of the first two commits ("Sync" and "lint") that remove and add the EOF in the version file. Could you remove those two commits and try again?

vadiklyutiy · 2024-12-19T19:09:48Z

Could you clarify where did you look?
I see in commit section the following

So, some people authored, I'm commited

… enabled on hidet-org/hidet

yaoyaoding · 2024-12-20T14:26:31Z

Interesting, here is what it looks like from my side:

vadiklyutiy · 2024-12-20T14:34:56Z

@yaoyaoding please switch to Commit tab and take a look there

vadiklyutiy · 2024-12-20T14:36:38Z

Also you can look on previous release #455

vadiklyutiy · 2024-12-20T14:38:48Z

On the conversation screen shows union of authors and committer. Because I am going to commit I added every wheere. But in history mainly authors does matter.

vadiklyutiy · 2024-12-20T14:41:56Z

in previous release #455 there were 98 commits. All of them I committed.
But for example in contribution https://github.com/hidet-org/hidet/graphs/contributors I have 45 commit. Where I am a author are counted only

yaoyaoding · 2024-12-20T14:43:37Z

On the conversation screen shows union of authors and committer. Because I am going to commit I added every wheere. But in history mainly authors does matter.

Yes, seems both the authors and commiters are merged in my view. Then it looks good to me. Previously I thought it might becuase of the first two commits, but it turned out it is not related.

Then everything looks good to me, just kindly remind to bump the version after merge this pr and check the wheel submitted to pypi.

vadiklyutiy · 2024-12-20T14:49:13Z

Then everything looks good to me, just kindly remind to bump the version after merge this pr and check the wheel submitted to pypi.

Pls formally approve this PR
I will change the version in the separate PR. Don't want to intermix it.

vadiklyutiy and others added 30 commits December 19, 2024 22:06

Sync

99de55a

lint

39287f4

[Operators] Extend the functionality of einsum to support Ellipsis (

f2fc272

#374) Closes #371

[BUG] Fix bug in normalize_launch_dims() (#381)

1431033

- Fix bug in `normalize_launch_dims()` - increase tolerance on tests/cuda/test_cudnn.py. Fails frequently.

Fix float return when limited by memory (#389)

5e5e5a0

Cast return value of `get_parallel_num_workers` to float. Fix #388

[Operators] Adding support for the torch.nn.EmbeddingBag (#378)

efe3569

Closes #223

[BUILD] Several changes in wheel building (#392)

9d50fa2

- sync `requirement.txt` with requirement in `setup.py` - add extras_require - requirements is torch >= 2.3.0

[DEBUG] Save Task pickle in translations cache (#380)

7746a3a

Save `Task` pickle in the translations cache. The reason - it is very convenient during performance analysis to get smaller test case. Supporting scripts will come soon

[Bug] Cast dtypes in hidet.where when mismatch (#386)

15ec205

In gpt-neo model (related issue: CentML/hidet#338) torch.where accepts tensors with different dtypes. Added type casting to fix the above issue. --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>

[Bug] Fix hidet.ops.gather, add torch.sign torch.ceil. Disable torch.…

d6dc4c0

…autograd.function.FunctionCtx (#394) Fixed issues appeared in deberta: CentML/hidet#340 --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>

Adding accruacy check for huggingface LLMs in Regression (#368)

39186c3

Adding accruacy check for huggingface LLMs in Regression `rtol=0.01` and `atol=0.065` were chosen to make previously "accurate" models to not fail the check --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>

[Tests] Adding tests for math primitives (#412)

0d2e41a

Closes #195

[Fix] Support writing subbyte data to global memory (#415)

b1f6177

Fix #400

[Primitives] Add CUDA primitives: prmt, lop3, f16x2 sub and fma, and …

d549803

…barrier (#414) Add primitives: - `prmt` - `lop3` - `sub_f16x2`, `fma_f16x2` - `barrier` See the tests and function documentation for the usage of each primitive.

[Fix] fixed torch.pow (#420)

b6e06b7

`torch.pow` accepts following type of operands: 1. (a: Tensor, b: Tensor) 2. (a: Numeric, b: Tensor) -> missing case that is handled by this PR 3. (a: Tensor, b: Numeric) --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>

[Operators] Adding support for the method torch.Tensor.scatter_add_ (…

0b8dbce

…#421) Closes #265 Additionally, slightly changed the code supporting `torch.nn.EmbeddingBag` to make it support more data types.

[Version] Update 0.4.0 -> 0.5.0.dev in setup.py (#433)

8faf803

Fix forgotten version promotion

zhumakhan and others added 16 commits December 19, 2024 22:13

Revert "[BUG] Fix torch2.5 OoM issue" (#635)

097ce2f

Reverts CentML/hidet#609

Adapt to bfloat16 where necessary (#624)

be933f3

matmul_f16 with wgmma (#627)

d2a94e0

Currently support shape that is multiple of 16 Currently use unswizzled smem

[BUG] Add comp server requirements (#661)

53ae60f

Add `scipy` as a requirement for comp server. It was added to regular `requirements.txt` but not to comp server `requirements.txt`.

resolve conflicts

4180123

Promote the version 0.4.0.dev -> 0.4.0 (#456) Promote the version 0.4.0.dev -> 0.4.0

lint

e38da75

vadiklyutiy requested review from yaoyaoding and tatianashp December 19, 2024 18:32

Temporary switch test workflow to old runners because new one are not…

d2dd415

… enabled on hidet-org/hidet

vadiklyutiy merged commit 4508cbd into main Dec 20, 2024
22 checks passed

vadiklyutiy deleted the vadim/sync branch December 26, 2024 19:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync CentML/hidet -> hidet-org/hidet #476

Sync CentML/hidet -> hidet-org/hidet #476

vadiklyutiy commented Dec 19, 2024

yaoyaoding commented Dec 19, 2024

vadiklyutiy commented Dec 19, 2024

yaoyaoding commented Dec 20, 2024

vadiklyutiy commented Dec 20, 2024

vadiklyutiy commented Dec 20, 2024

vadiklyutiy commented Dec 20, 2024

vadiklyutiy commented Dec 20, 2024

yaoyaoding commented Dec 20, 2024

vadiklyutiy commented Dec 20, 2024

Sync CentML/hidet -> hidet-org/hidet #476

Sync CentML/hidet -> hidet-org/hidet #476

Conversation

vadiklyutiy commented Dec 19, 2024

yaoyaoding commented Dec 19, 2024

vadiklyutiy commented Dec 19, 2024

yaoyaoding commented Dec 20, 2024

vadiklyutiy commented Dec 20, 2024

vadiklyutiy commented Dec 20, 2024

vadiklyutiy commented Dec 20, 2024

vadiklyutiy commented Dec 20, 2024

yaoyaoding commented Dec 20, 2024

vadiklyutiy commented Dec 20, 2024