[SYNC] Sync CentML -> hidet-org #455

vadiklyutiy · 2024-07-22T19:35:36Z

Regular sync CentML -> hidet-org

Instead of mocking a ctypes type like `c_pointer_compatible`, these changes make the transformation from Python values <> ctypes values more explicit with direct function calls inside of `CompiledFunction`.

Define complete UNet, with forward pass broken into down, mid, and up sections. Useful diagrams [here](http://jalammar.github.io/illustrated-stable-diffusion/) Uses blocks defined in #97. Heavily reduced version from diffusers containing only necessary features for stable diffusion v2-1. Towards #57. --------- Co-authored-by: vadiklyutiy <156319763+vadiklyutiy@users.noreply.github.com>

Stable diffusion uses fundamentally the same positional embeddings, but since timesteps change, a cache is not possible. There's also small changes in tensor layouts and calculation parameters between the diffusers version and the one from Llama, so I've recreated it here for now. An abstract version that combines both version is TODO. Towards #57.

With CentML/hidet#69 there will be a lot more C++ code introduced into the runtime, I think it's a good idea to have some standardization. For now this is just doing formatting (no linting, which takes more work to set up + has more opinions about right vs. wrong). Summary of changes: - Update `format.sh` to support formatting just Python, C++, or both - Add `clang-format` to the existing lint/format workflow - Apply `clang-format` changes to existing code; I've set up the configuration to try to minimize the number of changes and have excluded the float16/bfloat16 code Example workflow failure @ 4cc430c: <img width="1155" alt="image" src="https://github.com/CentML/hidet/assets/43303581/9566e9dd-bd01-4638-b556-11afaf7e6e52">

Add UNet Down, Up, and Mid block definitions and attention transformer utility layer. Modules are designed so that kwargs passed to constructors are all the same config from huggingface with minimal changes - lots of shared values and too many parameters to list individually. Same kwargs are passed to nested objects. Open to other suggestions, although this is a single use case problem. Towards #57.

Adds supports for LLaMA, GPT-2, and OPT tokenizers using the Hugging Face configuration

Infrastructure for compiled stable diffusion app. Towards #57

@in

**Context:** I made these changes to help with debugging Gemma, the dump produces many operators and this makes it easier for example to find which operators involve the input IDs / position IDs / KV-cache. **Summary of changes:** - Add missing dump_op parameter to ctx.debug() - Dump input indices (e.g. @23) in operator dump - Prevent dump_op and dump_outputs from overriding each other in the single-output case This is an example `41_Concat_def.txt` taken from my Gemma implementation, which corresponds to concatenating past keys in the KV-cache with the current keys. The `Inputs` field shows the indices of the operator inputs, which might be another operator output `@n` or some graph input `@in:n`. ``` Operator: Concat(0: float32(bs, 1, past_seq_len, 256), 1: float32(bs, 1, seq_len, 256), axis=2) Inputs: 0 <- @in:2 1 <- @40 Task: Task( name: concat parameters: x0: tensor(float32, [bs, 1, past_seq_len, 256]) x1: tensor(float32, [bs, 1, seq_len, 256]) out: tensor(float32, [bs, 1, (past_seq_len + seq_len), 256]) inputs: [x0, x1] outputs: [out] computations: out: float32[bs, 1, (past_seq_len + seq_len), 256] where out[v, v_1, v_2, v_3] = ((v_2 < past_seq_len) ? x0[v, v_1, v_2, v_3] : x1[v, v_1, (v_2 - past_seq_len), v_3]) attributes: {} ) ```

…ion(`implement*`) (#127) This PR parallelizes: - `apply_prolog_epilog` (fusion) - IR generation (`implement*`) Right now implemented for host only (no offload to comp server). resnet50 compilation speed on g5.16xlarge `time python tests/benchmarks/bench_vision.py resnet50 --params 1x3x224x224 --dtype float16` Before: 14m 45s After: 12m 51s Speedup: 14.8% matmul compilation speed on g5.16xlarge `time python tests/benchmarks/bench_op.py batch_matmul --params 1x4096x4096,1x4096x4096 --dtype float16` Before: 5m 54s After: 5m 31s Speedup: 6.9%

fix `__shfl_xor_sync`. I don't know why `__shfl_xor_sync` is an alias of `__shfl_down_sync`. Is this intentional? Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>

Closes #450. Output of the example code provided in the issue: ``` /home/jack/dev/hidet/venv/bin/python3.8 /home/jack/.config/JetBrains/RemoteDev-PY/_home_jack_dev_hidet/scratches/scratch_2.py Compiling cpu task tan(x=float32(2, 2), y=float32(2, 2))... Tensor(shape=(2, 2), dtype='float32', device='cpu') [[ 0.2568644 -1.0825194] [-32.35311 -1.5977247]] ```

The previous implementation is incorrect when dealing with a pair of dimensions that are both symbolic. Minimal example: import hidet if __name__ == "__main__": x = hidet.symbol(["n"]) y = hidet.symbol(["m"]) z = x + y print(x.shape, y.shape, z.shape) # before: (n,) (m,) (m,)

) **Overview** Specialize function `Constant._binary()` for compilation speedup **Compilation time improvement results** matmul_f16 with `max_parallel_jobs=1` Before: 2m 11.2s After: 2m 4.4s Speedup: 5.5% **Additional test** matmul_f16 has 177 candidates. I checked that all of them remained the same(no functional changes)

- The attention scalar should be by the head dimension. - The option name `tokens.for_huggingface` is incorrect, see the following: https://github.com/CentML/hidet/blob/eefc9d81afe687e9173c65c68fc3c7eb4e3019a7/python/hidet/option.py#L299-L304 With these changes the LLM app runs correctly before tracing into FlowGraph. Those changes will come later, I'm isolating these minor changes into their own PR here.

Allow access to cluster attributes inside Hidet kernels. Launch kernels with distributed shared memory. See docs: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#distributed-shared-memory https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-block-clusters API: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cluster-group-cg Towards supporting #102 by adding cluster rank primitive in Hidet. See `test_cluster.py` for example usage. To run test on Hopper machines use `pytest --hopper`

Gemma+torch.compile fixes: - process `_enter_autocast` and `_exit_autocast` as nop - support `truediv(float, Tensor)` - and support of eager mode to `tests/benchmarks`

The current exit hook is a no-op

Removes kwargs from stable diffusion app components. Adds documentation and sample code.

Support only transpose operator with rank == 2 --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-46-104.us-east-2.compute.internal> Co-authored-by: Max Hu <hyoung2991@gmail.com>

Revive dynamic shape support with `torch.compile`. It was broken due to changes in pytorch interface.

Adds ResNet and image classifier pipeline functionality. Includes changes from #428 See huggingface implementation for original API inspiration. Resolves CentML/hidet#60

…d` (#175) 1. Add `torch.Tensor.sin` and `torch.Tensor.cos` to `register_method` Gemma passed after that. 2. Add `torch._C._nn.pad` Test Workflow works with torch 2.3.0 after that

Co-authored-by: zita <zita.zhang@mail.utoronto.ca> Co-authored-by: Kevin Tong <kevintong0821@gmail.com> Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>

Introduces a `SyncLLM` and `AsyncLLM` interface to interact with the LLM, closes #164. ### SyncLLM.generate Takes in 1 or a list of n prompts, and 0, 1, or a list of n sampling parameters. - If no sampling parameter is provided, greedy sampling is used. - If 1 prompt and 1 sampling parameter is provided, the return is a single `SequenceOutput`. - If a list of n prompts and 1 sampling parameter is provided, the sampling parameter is applied to all prompts and the return is a list of `SequenceOutput`. - If a list of n prompts and a list of n sampling parameters is provided, the sampling parameters are applied respectively to each prompt. - Any other configuration is invalid. ### AsyncLLM.generate Takes in 1 prompt and 0 or 1 sampling parameters. The same default from the synchronous version applies if no sampling parameters are provided. _Without blocking_, returns a async iterator over `SequenceOutput`, which is updated with every token generated. ### Usage Here's an example script to demonstrate the API. ```py import asyncio import random from hidet.apps.llm import create_llm from hidet.apps.llm.sampler import SamplingParams async def _demo_async(): llm = create_llm("meta-llama/Llama-2-7b-chat-hf", is_async=True) prompts = [ "Hello, how are you?", "How do you feel about the current political climate?", "What is your favorite food?", "What is your favorite color?", "What is your favorite movie?", "What is your favorite book?", "What is your favorite song?", "What is your favorite animal?", "What is your favorite hobby?", "When is your birthday?", ] coros = [] for prompt in prompts: async def f(prompt): await asyncio.sleep(random.randint(1, 60)) print("Incoming request: ", prompt) params = SamplingParams(temperature=0.0, max_tokens=random.randint(10, 100)) stream = llm.generate(prompt, sampling_params=params) final = None async for output in stream: # print(output.tokens) final = output print("=====") print("Completed request: ", prompt) print("Output: ", final.output_text) print("=====") coros.append(f(prompt)) await asyncio.gather(*coros) def demo_async(): asyncio.run(_demo_async()) def demo_sync(): llm = create_llm("meta-llama/Llama-2-7b-chat-hf", is_async=False) prompts = [ "Hello, how are you?", "How do you feel about the current political climate?", "What is your favorite food?", "What is your favorite color?", "What is your favorite movie?", "What is your favorite book?", "What is your favorite song?", "What is your favorite animal?", "What is your favorite hobby?", "When is your birthday?", ] for output in llm.generate(prompts): print("=====") print("Completed request: ", output.prompt) print("Output: ", output.output_text) print("=====") if __name__ == "__main__": demo_sync() # demo_async() ``` --------- Co-authored-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

- Shuffle workload (candidates) to avoid imbalance in compliation time - Modify workload group to make the job number the same as cpu number Co-authored-by: Ubuntu <ubuntu@ip-172-31-46-104.us-east-2.compute.internal>

I noticed that we spend sufficient time on creation process in `parallel_imap`. Add `chunksize` arg to `pool.imap` to decrease the overhead. **Results.** `time python bench_op.py matmul_f16 --params 1x4096x4096,1x4096x4096 --dtype float16` `time python bench_op.py batch_matmul --params 1x4096x4096,1x4096x4096 --dtype float16` | Test | Before(s) | After(s) | Improvement | |--------|--------|--------|--------| | matmul_f16 | 42.768 | 42.138 | 1.5% | | batch_matmul | 34m29.1s | 34m10.1s | 0.9% |

In #342 accidentally disable `search_space=2` for `bench_op.py` Regression script. Fixed it.

…332) [Edit: The issue was encountered while attempting to compile the model `yolov3`] Currently the [`setitem`](https://github.com/CentML/hidet/blob/566f0fe55f441326c3034b7eed44b3fa0b03f38d/python/hidet/graph/frontend/torch/register_functions.py#L280) function in Hidet will fail on two special scenarios when `setvalue` is a tensor: 1. When `setvalue` and `x` are of different dtypes, currently there will be an error that looks like: > RuntimeError: If-then-else operand 1 and 2 have different types (hidet.float16 vs hidet.float32) ((((v < 0) || (2 <= v)) ? false : (((v_1 < 0) || (3 <= v_1)) ? false : (((v_2 < 0) || (3 <= v_2)) ? false : true))) ? setvalue[v_2, v_1, v] : data[v_2, v_1, v]), occurred when interpreting operator.setitem with > setitem(tensor(...), (Ellipsis, slice(None, 2, None)), tensor(...)) Whereas in PyTorch `setvalue` appears to be casted to the same datatype as `x` if possible. 2. When `setvalue` and `x` are on different devices, currently this will result in an error: > RuntimeError: All inputs of an operator must be on the same device, occurred when interpreting operator.setitem with > setitem(tensor(...), (Ellipsis, slice(None, 2, None)), tensor(...)) Whereas in PyTorch the `setvalue` is moved to the same device as `x`.

Previously, an error was encountered during a model compilation attempt: > torch._dynamo.exc.BackendCompilerFailed: backend='hidet' raised: > RuntimeError: Can not interpreting max given arguments: > max(tensor(...)) > Possible candidates are: > torch_max_v3(x: hidet.Tensor, dim: Union[int, hidet.ir.expr.Expr], keepdim: bool = False, *, out: Union[hidet.Tensor, Tuple[hidet.Tensor, ...], List[hidet.Tensor]] = None) -> Tuple[hidet.Tensor, hidet.Tensor] > File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/register_functions.py", line 1067 Despite we indeed have a [function](https://github.com/CentML/hidet/blob/13a806608d40de2de1fcc682adeea8d204189f3c/python/hidet/graph/frontend/torch/register_functions.py#L1056-L1060) that can be used to interpret the `torch.Tensor.max` with described arguments.

… for conv-bert-base model (#351) Added support for `torch.multiply` and `torch.nn.functional.unfold` These ops are needed in `conv-bert-base` models --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>

- Fixed cuda declaration and definition dtype mistmatch - Added 3 more llms: mpt-7b, codellama-7b and mixtral-8x7b. Fitst two are tested and working fine. --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>

Promote nvidia docker container to version 24.4 => Getting pytorch 2.3 Regression passed https://github.com/CentML/hidet/actions/runs/9964867474

Introduce `add_hint_pass`. It adds `__builtin_assume(...)` to .cu code that helps nvcc to understand bounds if `threadIdx` and `blockIdx` and optimize code better. **Performance improvements.** Models model|latency|prev_latency|ratio| |--------|--------|--------|--------| bert-base-uncased|19.8138|20.2316|2.109 densenet121|35.1161|36.7627|4.689 efficientnet_b0|18.9451|19.278|1.757 mobilenet_v2|11.5944|11.8764|2.432 resnet50|29.4878|29.9935|1.715 vit_b_16|125.787|123.672|-1.681 Operators operator|latency|prev_latency|ratio |--------|--------|--------|--------| attn|1.50402|1.50131|-0.18 attn|0.219707|0.227568|3.578 attn_mask_add|1.5892|1.62516|2.263 attn_mask_add|0.226317|0.226507|0.084 batch_matmul|5.2399|5.11547|-2.375 batch_matmul|0.0216016|0.0223425|3.43 conv2d|0.0347093|0.0341758|-1.537 conv2d|0.310521|0.308458|-0.664 conv2d_gemm_f16|0.142542|0.146412|2.715 conv2d_gemm_f16|2.0421|2.07043|1.387 matmul_f16|2.22432|2.30458|3.608 matmul_f16|0.00888628|0.00892615|0.449 reduce|0.01375|0.0138618|0.813

…ents are supported by Hidet (#347) Currently Hidet cannot compile `doctr_reco_predictor` model due to unsupported `torch.Tensor.min`, despite we have already registered `torch.min` function which is functionally equivalent. This PR registers all the missing `torch.Tensor` methods with PyTorch function equivalents already registered.

When we used `__builtin_unreachable()` for hint the info about bounds lost after some code. There was an introduced workaround that added additional hints after loops. After switching to `__builtin_assume()` the issue disappeared. This PR removes the workaround. No performance changes. http://10.24.10.108:8868/Build_History 66fd65c after 3f955de before

…e` module with `align_corners=True` (#344) Closes #343

Closes #359

Recently frequently occurs a fail of regression due to fail start_instance due to "Insufficient capacity". Repeat attempts to start instances 300 times with 60 seconds sleep between repeats. Tested here https://github.com/CentML/hidet/actions/runs/10000711025/job/27664169588

yaoyaoding · 2024-07-23T13:45:52Z

Seems the ci failed on self-hosted runners.

vadiklyutiy · 2024-07-23T13:50:36Z

yes, @c-fteixeira is looking into it

yaoyaoding · 2024-07-23T15:33:29Z

Kindly remind that we need to use "merge" instead of "squash and merge" in this PR.

vadiklyutiy · 2024-07-23T16:02:30Z

sure, I already asked Shang and he enabled merge and rebase option on this repo

vadiklyutiy · 2024-07-23T16:38:30Z

@yaoyaoding @wangshangsam @hjjq
Tests passed.
Please look into and approve if everything is ok

yaoyaoding

LGTM, thanks @vadiklyutiy !

…_fpn` (#455) Closes #264 The error encountered in the linked issues was due to a subtle difference in type promotions when calling `torch.div` with the argument `rounding_mode='floor'`. Specifically, if both of the two operands are of the integer type, then the output would still be integer type. This is different from my original implementation, which first calls `truediv` and then `ops.floor`, which will make the output datatype `float32`. After fixing this issue, another error was encountered: ``` File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/interpreter.py", line 70, in __call__ return self.forward(*args) ^^^^^^^^^^^^^^^^^^^ File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/interpreter.py", line 237, in forward self._raise_exception(e, node.target, exec_func, hidet_args, hidet_kwargs) File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/interpreter.py", line 186, in _raise_exception raise RuntimeError('\n'.join(msg)) torch._dynamo.exc.BackendCompilerFailed: backend='hidet' raised: RuntimeError: Can not interpret torch.nn.functional.batch_norm given arguments: torch.nn.functional.batch_norm(tensor(...), tensor(...), tensor(...), tensor(...), tensor(...), training=False, eps=1e-05) Possible candidates are: batch_norm(x: hidet.Tensor, running_mean: Optional[hidet.Tensor], running_var: Optional[hidet.Tensor], weight: Optional[hidet.Tensor], bias: Optional[hidet.Tensor], training: bool, momentum: float, eps: float) File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/register_functions.py", line 302 ``` This PR also fixes this error by adding default values to some parameters of the `batch_norm` function in `torch.nn.functional.batch_norm`, to match the signature as in the[ PyTorch documents](https://pytorch.org/docs/stable/generated/torch.nn.functional.batch_norm.html).

Closes #228 Additionally, while working on PR #455 , I noticed that we didn't register the function/method `floor_divide`. Adding support for this one is straightforward as it was functionally equivalent to `torch.div(..., rounding_mode='floor')`. I forgot to include the change in that PR, so I am including it here.

…_fpn` (#455) Closes #264 The error encountered in the linked issues was due to a subtle difference in type promotions when calling `torch.div` with the argument `rounding_mode='floor'`. Specifically, if both of the two operands are of the integer type, then the output would still be integer type. This is different from my original implementation, which first calls `truediv` and then `ops.floor`, which will make the output datatype `float32`. After fixing this issue, another error was encountered: ``` File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/interpreter.py", line 70, in __call__ return self.forward(*args) ^^^^^^^^^^^^^^^^^^^ File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/interpreter.py", line 237, in forward self._raise_exception(e, node.target, exec_func, hidet_args, hidet_kwargs) File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/interpreter.py", line 186, in _raise_exception raise RuntimeError('\n'.join(msg)) torch._dynamo.exc.BackendCompilerFailed: backend='hidet' raised: RuntimeError: Can not interpret torch.nn.functional.batch_norm given arguments: torch.nn.functional.batch_norm(tensor(...), tensor(...), tensor(...), tensor(...), tensor(...), training=False, eps=1e-05) Possible candidates are: batch_norm(x: hidet.Tensor, running_mean: Optional[hidet.Tensor], running_var: Optional[hidet.Tensor], weight: Optional[hidet.Tensor], bias: Optional[hidet.Tensor], training: bool, momentum: float, eps: float) File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/register_functions.py", line 302 ``` This PR also fixes this error by adding default values to some parameters of the `batch_norm` function in `torch.nn.functional.batch_norm`, to match the signature as in the[ PyTorch documents](https://pytorch.org/docs/stable/generated/torch.nn.functional.batch_norm.html).

Closes #228 Additionally, while working on PR #455 , I noticed that we didn't register the function/method `floor_divide`. Adding support for this one is straightforward as it was functionally equivalent to `torch.div(..., rounding_mode='floor')`. I forgot to include the change in that PR, so I am including it here.

…_fpn` (#455) Closes #264 The error encountered in the linked issues was due to a subtle difference in type promotions when calling `torch.div` with the argument `rounding_mode='floor'`. Specifically, if both of the two operands are of the integer type, then the output would still be integer type. This is different from my original implementation, which first calls `truediv` and then `ops.floor`, which will make the output datatype `float32`. After fixing this issue, another error was encountered: ``` File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/interpreter.py", line 70, in __call__ return self.forward(*args) ^^^^^^^^^^^^^^^^^^^ File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/interpreter.py", line 237, in forward self._raise_exception(e, node.target, exec_func, hidet_args, hidet_kwargs) File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/interpreter.py", line 186, in _raise_exception raise RuntimeError('\n'.join(msg)) torch._dynamo.exc.BackendCompilerFailed: backend='hidet' raised: RuntimeError: Can not interpret torch.nn.functional.batch_norm given arguments: torch.nn.functional.batch_norm(tensor(...), tensor(...), tensor(...), tensor(...), tensor(...), training=False, eps=1e-05) Possible candidates are: batch_norm(x: hidet.Tensor, running_mean: Optional[hidet.Tensor], running_var: Optional[hidet.Tensor], weight: Optional[hidet.Tensor], bias: Optional[hidet.Tensor], training: bool, momentum: float, eps: float) File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/register_functions.py", line 302 ``` This PR also fixes this error by adding default values to some parameters of the `batch_norm` function in `torch.nn.functional.batch_norm`, to match the signature as in the[ PyTorch documents](https://pytorch.org/docs/stable/generated/torch.nn.functional.batch_norm.html).

Closes #228 Additionally, while working on PR #455 , I noticed that we didn't register the function/method `floor_divide`. Adding support for this one is straightforward as it was functionally equivalent to `torch.div(..., rounding_mode='floor')`. I forgot to include the change in that PR, so I am including it here.

jacklee1792 and others added 30 commits July 22, 2024 23:28

[FFI] Refactor CompiledFunction interface with ctypes (#79)

5ce601d

Instead of mocking a ctypes type like `c_pointer_compatible`, these changes make the transformation from Python values <> ctypes values more explicit with direct function calls inside of `CompiledFunction`.

[Models] Support for tokenizers in C++ runtime (#69)

9033c10

Adds supports for LLaMA, GPT-2, and OPT tokenizers using the Hugging Face configuration

[LLM App] LLM Application initial support (#121)

0067ce6

Stable Diffusion App Infra (#103)

b63caf9

Infrastructure for compiled stable diffusion app. Towards #57

[Ir][Primitives] fix __shfl_xor_sync (#155)

a3b7dab

fix `__shfl_xor_sync`. I don't know why `__shfl_xor_sync` is an alias of `__shfl_down_sync`. Is this intentional? Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>

Gemma+torch.compile fixes(autocast, rtruediv) (#159)

4f80460

Gemma+torch.compile fixes: - process `_enter_autocast` and `_exit_autocast` as nop - support `truediv(float, Tensor)` - and support of eager mode to `tests/benchmarks`

[Operator] triu + tril operators (#146)

096bfcb

[App] Fix LLM app tracing (#158)

e4a0386

[Fixbug] Set _is_exiting correctly (#163)

43ec055

The current exit hook is a no-op

[App] Cleanup SD Implementation (#143)

76bf2f6

Removes kwargs from stable diffusion app components. Adds documentation and sample code.

Support Transpose2D (#77)

39cc879

Support only transpose operator with rank == 2 --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-46-104.us-east-2.compute.internal> Co-authored-by: Max Hu <hyoung2991@gmail.com>

[Models] Gemma implementation (#132)

982b552

Revive dynamic shape support with torch.compile (#162)

b75e5d8

Revive dynamic shape support with `torch.compile`. It was broken due to changes in pytorch interface.

[App] ResNet Compiled App (2/2) - Pipeline (#165)

742a6b6

Adds ResNet and image classifier pipeline functionality. Includes changes from #428 See huggingface implementation for original API inspiration. Resolves CentML/hidet#60

[OPS] Add torch.Tensor.sin, torch.Tensor.cos and `torch._C._nn.pa…

cafaeed

…d` (#175) 1. Add `torch.Tensor.sin` and `torch.Tensor.cos` to `register_method` Gemma passed after that. 2. Add `torch._C._nn.pad` Test Workflow works with torch 2.3.0 after that

[Ir][Primitives] add hopper instructions (#83)

3131abe

Co-authored-by: zita <zita.zhang@mail.utoronto.ca> Co-authored-by: Kevin Tong <kevintong0821@gmail.com> Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>

optimize grouping method (#174)

bee8c2a

- Shuffle workload (candidates) to avoid imbalance in compliation time - Modify workload group to make the job number the same as cpu number Co-authored-by: Ubuntu <ubuntu@ip-172-31-46-104.us-east-2.compute.internal>

vadiklyutiy and others added 12 commits July 22, 2024 23:28

[BUG] Fixed search_space bug in bench_op.py (#348)

ec3af9d

In #342 accidentally disable `search_space=2` for `bench_op.py` Regression script. Fixed it.

[Fix] type casting for attention mask from fp32 -> f16 (#323)

32a2255

- Fixed cuda declaration and definition dtype mistmatch - Added 3 more llms: mpt-7b, codellama-7b and mixtral-8x7b. Fitst two are tested and working fine. --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>

[CI] Promote nvidia docker container to version 24.4 (#354)

c6ce9fb

Promote nvidia docker container to version 24.4 => Getting pytorch 2.3 Regression passed https://github.com/CentML/hidet/actions/runs/9964867474

[Fix] Fixing an error triggered while compiling the `torch.nn.Upsampl…

7db8629

…e` module with `align_corners=True` (#344) Closes #343

[Operators] Adding leaky_relu support (#360)

b9551c4

Closes #359

vadiklyutiy requested review from wangshangsam, yaoyaoding and hjjq July 22, 2024 19:35

yaoyaoding approved these changes Jul 23, 2024

View reviewed changes

vadiklyutiy merged commit cf5cadd into main Jul 23, 2024
19 checks passed

vadiklyutiy deleted the sync-0-4-0 branch July 27, 2024 20:20

vadiklyutiy mentioned this pull request Dec 20, 2024

Sync CentML/hidet -> hidet-org/hidet #476

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYNC] Sync CentML -> hidet-org #455

[SYNC] Sync CentML -> hidet-org #455

vadiklyutiy commented Jul 22, 2024

yaoyaoding commented Jul 23, 2024

vadiklyutiy commented Jul 23, 2024

yaoyaoding commented Jul 23, 2024

vadiklyutiy commented Jul 23, 2024

vadiklyutiy commented Jul 23, 2024

yaoyaoding left a comment

[SYNC] Sync CentML -> hidet-org #455

[SYNC] Sync CentML -> hidet-org #455

Conversation

vadiklyutiy commented Jul 22, 2024

yaoyaoding commented Jul 23, 2024

vadiklyutiy commented Jul 23, 2024

yaoyaoding commented Jul 23, 2024

vadiklyutiy commented Jul 23, 2024

vadiklyutiy commented Jul 23, 2024

yaoyaoding left a comment

Choose a reason for hiding this comment