-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYNC] Sync CentML -> hidet-org #455
Conversation
Instead of mocking a ctypes type like `c_pointer_compatible`, these changes make the transformation from Python values <> ctypes values more explicit with direct function calls inside of `CompiledFunction`.
Define complete UNet, with forward pass broken into down, mid, and up sections. Useful diagrams [here](http://jalammar.github.io/illustrated-stable-diffusion/) Uses blocks defined in #97. Heavily reduced version from diffusers containing only necessary features for stable diffusion v2-1. Towards #57. --------- Co-authored-by: vadiklyutiy <156319763+vadiklyutiy@users.noreply.github.com>
Stable diffusion uses fundamentally the same positional embeddings, but since timesteps change, a cache is not possible. There's also small changes in tensor layouts and calculation parameters between the diffusers version and the one from Llama, so I've recreated it here for now. An abstract version that combines both version is TODO. Towards #57.
With CentML/hidet#69 there will be a lot more C++ code introduced into the runtime, I think it's a good idea to have some standardization. For now this is just doing formatting (no linting, which takes more work to set up + has more opinions about right vs. wrong). Summary of changes: - Update `format.sh` to support formatting just Python, C++, or both - Add `clang-format` to the existing lint/format workflow - Apply `clang-format` changes to existing code; I've set up the configuration to try to minimize the number of changes and have excluded the float16/bfloat16 code Example workflow failure @ 4cc430c: <img width="1155" alt="image" src="https://github.com/CentML/hidet/assets/43303581/9566e9dd-bd01-4638-b556-11afaf7e6e52">
Add UNet Down, Up, and Mid block definitions and attention transformer utility layer. Modules are designed so that kwargs passed to constructors are all the same config from huggingface with minimal changes - lots of shared values and too many parameters to list individually. Same kwargs are passed to nested objects. Open to other suggestions, although this is a single use case problem. Towards #57.
Adds supports for LLaMA, GPT-2, and OPT tokenizers using the Hugging Face configuration
Infrastructure for compiled stable diffusion app. Towards #57
**Context:** I made these changes to help with debugging Gemma, the dump produces many operators and this makes it easier for example to find which operators involve the input IDs / position IDs / KV-cache. **Summary of changes:** - Add missing dump_op parameter to ctx.debug() - Dump input indices (e.g. @23) in operator dump - Prevent dump_op and dump_outputs from overriding each other in the single-output case This is an example `41_Concat_def.txt` taken from my Gemma implementation, which corresponds to concatenating past keys in the KV-cache with the current keys. The `Inputs` field shows the indices of the operator inputs, which might be another operator output `@n` or some graph input `@in:n`. ``` Operator: Concat(0: float32(bs, 1, past_seq_len, 256), 1: float32(bs, 1, seq_len, 256), axis=2) Inputs: 0 <- @in:2 1 <- @40 Task: Task( name: concat parameters: x0: tensor(float32, [bs, 1, past_seq_len, 256]) x1: tensor(float32, [bs, 1, seq_len, 256]) out: tensor(float32, [bs, 1, (past_seq_len + seq_len), 256]) inputs: [x0, x1] outputs: [out] computations: out: float32[bs, 1, (past_seq_len + seq_len), 256] where out[v, v_1, v_2, v_3] = ((v_2 < past_seq_len) ? x0[v, v_1, v_2, v_3] : x1[v, v_1, (v_2 - past_seq_len), v_3]) attributes: {} ) ```
…ion(`implement*`) (#127) This PR parallelizes: - `apply_prolog_epilog` (fusion) - IR generation (`implement*`) Right now implemented for host only (no offload to comp server). resnet50 compilation speed on g5.16xlarge `time python tests/benchmarks/bench_vision.py resnet50 --params 1x3x224x224 --dtype float16` Before: 14m 45s After: 12m 51s Speedup: 14.8% matmul compilation speed on g5.16xlarge `time python tests/benchmarks/bench_op.py batch_matmul --params 1x4096x4096,1x4096x4096 --dtype float16` Before: 5m 54s After: 5m 31s Speedup: 6.9%
fix `__shfl_xor_sync`. I don't know why `__shfl_xor_sync` is an alias of `__shfl_down_sync`. Is this intentional? Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>
Closes #450. Output of the example code provided in the issue: ``` /home/jack/dev/hidet/venv/bin/python3.8 /home/jack/.config/JetBrains/RemoteDev-PY/_home_jack_dev_hidet/scratches/scratch_2.py Compiling cpu task tan(x=float32(2, 2), y=float32(2, 2))... Tensor(shape=(2, 2), dtype='float32', device='cpu') [[ 0.2568644 -1.0825194] [-32.35311 -1.5977247]] ```
The previous implementation is incorrect when dealing with a pair of dimensions that are both symbolic. Minimal example: import hidet if __name__ == "__main__": x = hidet.symbol(["n"]) y = hidet.symbol(["m"]) z = x + y print(x.shape, y.shape, z.shape) # before: (n,) (m,) (m,)
) **Overview** Specialize function `Constant._binary()` for compilation speedup **Compilation time improvement results** matmul_f16 with `max_parallel_jobs=1` Before: 2m 11.2s After: 2m 4.4s Speedup: 5.5% **Additional test** matmul_f16 has 177 candidates. I checked that all of them remained the same(no functional changes)
- The attention scalar should be by the head dimension. - The option name `tokens.for_huggingface` is incorrect, see the following: https://github.com/CentML/hidet/blob/eefc9d81afe687e9173c65c68fc3c7eb4e3019a7/python/hidet/option.py#L299-L304 With these changes the LLM app runs correctly before tracing into FlowGraph. Those changes will come later, I'm isolating these minor changes into their own PR here.
Allow access to cluster attributes inside Hidet kernels. Launch kernels with distributed shared memory. See docs: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#distributed-shared-memory https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-block-clusters API: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cluster-group-cg Towards supporting #102 by adding cluster rank primitive in Hidet. See `test_cluster.py` for example usage. To run test on Hopper machines use `pytest --hopper`
Gemma+torch.compile fixes: - process `_enter_autocast` and `_exit_autocast` as nop - support `truediv(float, Tensor)` - and support of eager mode to `tests/benchmarks`
The current exit hook is a no-op
Removes kwargs from stable diffusion app components. Adds documentation and sample code.
Support only transpose operator with rank == 2 --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-46-104.us-east-2.compute.internal> Co-authored-by: Max Hu <hyoung2991@gmail.com>
Revive dynamic shape support with `torch.compile`. It was broken due to changes in pytorch interface.
Adds ResNet and image classifier pipeline functionality. Includes changes from #428 See huggingface implementation for original API inspiration. Resolves CentML/hidet#60
…d` (#175) 1. Add `torch.Tensor.sin` and `torch.Tensor.cos` to `register_method` Gemma passed after that. 2. Add `torch._C._nn.pad` Test Workflow works with torch 2.3.0 after that
Co-authored-by: zita <zita.zhang@mail.utoronto.ca> Co-authored-by: Kevin Tong <kevintong0821@gmail.com> Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>
Introduces a `SyncLLM` and `AsyncLLM` interface to interact with the LLM, closes #164. ### SyncLLM.generate Takes in 1 or a list of n prompts, and 0, 1, or a list of n sampling parameters. - If no sampling parameter is provided, greedy sampling is used. - If 1 prompt and 1 sampling parameter is provided, the return is a single `SequenceOutput`. - If a list of n prompts and 1 sampling parameter is provided, the sampling parameter is applied to all prompts and the return is a list of `SequenceOutput`. - If a list of n prompts and a list of n sampling parameters is provided, the sampling parameters are applied respectively to each prompt. - Any other configuration is invalid. ### AsyncLLM.generate Takes in 1 prompt and 0 or 1 sampling parameters. The same default from the synchronous version applies if no sampling parameters are provided. _Without blocking_, returns a async iterator over `SequenceOutput`, which is updated with every token generated. ### Usage Here's an example script to demonstrate the API. ```py import asyncio import random from hidet.apps.llm import create_llm from hidet.apps.llm.sampler import SamplingParams async def _demo_async(): llm = create_llm("meta-llama/Llama-2-7b-chat-hf", is_async=True) prompts = [ "Hello, how are you?", "How do you feel about the current political climate?", "What is your favorite food?", "What is your favorite color?", "What is your favorite movie?", "What is your favorite book?", "What is your favorite song?", "What is your favorite animal?", "What is your favorite hobby?", "When is your birthday?", ] coros = [] for prompt in prompts: async def f(prompt): await asyncio.sleep(random.randint(1, 60)) print("Incoming request: ", prompt) params = SamplingParams(temperature=0.0, max_tokens=random.randint(10, 100)) stream = llm.generate(prompt, sampling_params=params) final = None async for output in stream: # print(output.tokens) final = output print("=====") print("Completed request: ", prompt) print("Output: ", final.output_text) print("=====") coros.append(f(prompt)) await asyncio.gather(*coros) def demo_async(): asyncio.run(_demo_async()) def demo_sync(): llm = create_llm("meta-llama/Llama-2-7b-chat-hf", is_async=False) prompts = [ "Hello, how are you?", "How do you feel about the current political climate?", "What is your favorite food?", "What is your favorite color?", "What is your favorite movie?", "What is your favorite book?", "What is your favorite song?", "What is your favorite animal?", "What is your favorite hobby?", "When is your birthday?", ] for output in llm.generate(prompts): print("=====") print("Completed request: ", output.prompt) print("Output: ", output.output_text) print("=====") if __name__ == "__main__": demo_sync() # demo_async() ``` --------- Co-authored-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
- Shuffle workload (candidates) to avoid imbalance in compliation time - Modify workload group to make the job number the same as cpu number Co-authored-by: Ubuntu <ubuntu@ip-172-31-46-104.us-east-2.compute.internal>
I noticed that we spend sufficient time on creation process in `parallel_imap`. Add `chunksize` arg to `pool.imap` to decrease the overhead. **Results.** `time python bench_op.py matmul_f16 --params 1x4096x4096,1x4096x4096 --dtype float16` `time python bench_op.py batch_matmul --params 1x4096x4096,1x4096x4096 --dtype float16` | Test | Before(s) | After(s) | Improvement | |--------|--------|--------|--------| | matmul_f16 | 42.768 | 42.138 | 1.5% | | batch_matmul | 34m29.1s | 34m10.1s | 0.9% |
In #342 accidentally disable `search_space=2` for `bench_op.py` Regression script. Fixed it.
…332) [Edit: The issue was encountered while attempting to compile the model `yolov3`] Currently the [`setitem`](https://github.com/CentML/hidet/blob/566f0fe55f441326c3034b7eed44b3fa0b03f38d/python/hidet/graph/frontend/torch/register_functions.py#L280) function in Hidet will fail on two special scenarios when `setvalue` is a tensor: 1. When `setvalue` and `x` are of different dtypes, currently there will be an error that looks like: > RuntimeError: If-then-else operand 1 and 2 have different types (hidet.float16 vs hidet.float32) ((((v < 0) || (2 <= v)) ? false : (((v_1 < 0) || (3 <= v_1)) ? false : (((v_2 < 0) || (3 <= v_2)) ? false : true))) ? setvalue[v_2, v_1, v] : data[v_2, v_1, v]), occurred when interpreting operator.setitem with > setitem(tensor(...), (Ellipsis, slice(None, 2, None)), tensor(...)) Whereas in PyTorch `setvalue` appears to be casted to the same datatype as `x` if possible. 2. When `setvalue` and `x` are on different devices, currently this will result in an error: > RuntimeError: All inputs of an operator must be on the same device, occurred when interpreting operator.setitem with > setitem(tensor(...), (Ellipsis, slice(None, 2, None)), tensor(...)) Whereas in PyTorch the `setvalue` is moved to the same device as `x`.
Previously, an error was encountered during a model compilation attempt: > torch._dynamo.exc.BackendCompilerFailed: backend='hidet' raised: > RuntimeError: Can not interpreting max given arguments: > max(tensor(...)) > Possible candidates are: > torch_max_v3(x: hidet.Tensor, dim: Union[int, hidet.ir.expr.Expr], keepdim: bool = False, *, out: Union[hidet.Tensor, Tuple[hidet.Tensor, ...], List[hidet.Tensor]] = None) -> Tuple[hidet.Tensor, hidet.Tensor] > File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/register_functions.py", line 1067 Despite we indeed have a [function](https://github.com/CentML/hidet/blob/13a806608d40de2de1fcc682adeea8d204189f3c/python/hidet/graph/frontend/torch/register_functions.py#L1056-L1060) that can be used to interpret the `torch.Tensor.max` with described arguments.
… for conv-bert-base model (#351) Added support for `torch.multiply` and `torch.nn.functional.unfold` These ops are needed in `conv-bert-base` models --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
- Fixed cuda declaration and definition dtype mistmatch - Added 3 more llms: mpt-7b, codellama-7b and mixtral-8x7b. Fitst two are tested and working fine. --------- Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
Promote nvidia docker container to version 24.4 => Getting pytorch 2.3 Regression passed https://github.com/CentML/hidet/actions/runs/9964867474
Introduce `add_hint_pass`. It adds `__builtin_assume(...)` to .cu code that helps nvcc to understand bounds if `threadIdx` and `blockIdx` and optimize code better. **Performance improvements.** Models model|latency|prev_latency|ratio| |--------|--------|--------|--------| bert-base-uncased|19.8138|20.2316|2.109 densenet121|35.1161|36.7627|4.689 efficientnet_b0|18.9451|19.278|1.757 mobilenet_v2|11.5944|11.8764|2.432 resnet50|29.4878|29.9935|1.715 vit_b_16|125.787|123.672|-1.681 Operators operator|latency|prev_latency|ratio |--------|--------|--------|--------| attn|1.50402|1.50131|-0.18 attn|0.219707|0.227568|3.578 attn_mask_add|1.5892|1.62516|2.263 attn_mask_add|0.226317|0.226507|0.084 batch_matmul|5.2399|5.11547|-2.375 batch_matmul|0.0216016|0.0223425|3.43 conv2d|0.0347093|0.0341758|-1.537 conv2d|0.310521|0.308458|-0.664 conv2d_gemm_f16|0.142542|0.146412|2.715 conv2d_gemm_f16|2.0421|2.07043|1.387 matmul_f16|2.22432|2.30458|3.608 matmul_f16|0.00888628|0.00892615|0.449 reduce|0.01375|0.0138618|0.813
…ents are supported by Hidet (#347) Currently Hidet cannot compile `doctr_reco_predictor` model due to unsupported `torch.Tensor.min`, despite we have already registered `torch.min` function which is functionally equivalent. This PR registers all the missing `torch.Tensor` methods with PyTorch function equivalents already registered.
When we used `__builtin_unreachable()` for hint the info about bounds lost after some code. There was an introduced workaround that added additional hints after loops. After switching to `__builtin_assume()` the issue disappeared. This PR removes the workaround. No performance changes. http://10.24.10.108:8868/Build_History 66fd65c after 3f955de before
Recently frequently occurs a fail of regression due to fail start_instance due to "Insufficient capacity". Repeat attempts to start instances 300 times with 60 seconds sleep between repeats. Tested here https://github.com/CentML/hidet/actions/runs/10000711025/job/27664169588
Seems the ci failed on self-hosted runners. |
yes, @c-fteixeira is looking into it |
Kindly remind that we need to use "merge" instead of "squash and merge" in this PR. |
sure, I already asked Shang and he enabled merge and rebase option on this repo |
@yaoyaoding @wangshangsam @hjjq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @vadiklyutiy !
…_fpn` (#455) Closes #264 The error encountered in the linked issues was due to a subtle difference in type promotions when calling `torch.div` with the argument `rounding_mode='floor'`. Specifically, if both of the two operands are of the integer type, then the output would still be integer type. This is different from my original implementation, which first calls `truediv` and then `ops.floor`, which will make the output datatype `float32`. After fixing this issue, another error was encountered: ``` File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/interpreter.py", line 70, in __call__ return self.forward(*args) ^^^^^^^^^^^^^^^^^^^ File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/interpreter.py", line 237, in forward self._raise_exception(e, node.target, exec_func, hidet_args, hidet_kwargs) File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/interpreter.py", line 186, in _raise_exception raise RuntimeError('\n'.join(msg)) torch._dynamo.exc.BackendCompilerFailed: backend='hidet' raised: RuntimeError: Can not interpret torch.nn.functional.batch_norm given arguments: torch.nn.functional.batch_norm(tensor(...), tensor(...), tensor(...), tensor(...), tensor(...), training=False, eps=1e-05) Possible candidates are: batch_norm(x: hidet.Tensor, running_mean: Optional[hidet.Tensor], running_var: Optional[hidet.Tensor], weight: Optional[hidet.Tensor], bias: Optional[hidet.Tensor], training: bool, momentum: float, eps: float) File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/register_functions.py", line 302 ``` This PR also fixes this error by adding default values to some parameters of the `batch_norm` function in `torch.nn.functional.batch_norm`, to match the signature as in the[ PyTorch documents](https://pytorch.org/docs/stable/generated/torch.nn.functional.batch_norm.html).
Closes #228 Additionally, while working on PR #455 , I noticed that we didn't register the function/method `floor_divide`. Adding support for this one is straightforward as it was functionally equivalent to `torch.div(..., rounding_mode='floor')`. I forgot to include the change in that PR, so I am including it here.
…_fpn` (#455) Closes #264 The error encountered in the linked issues was due to a subtle difference in type promotions when calling `torch.div` with the argument `rounding_mode='floor'`. Specifically, if both of the two operands are of the integer type, then the output would still be integer type. This is different from my original implementation, which first calls `truediv` and then `ops.floor`, which will make the output datatype `float32`. After fixing this issue, another error was encountered: ``` File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/interpreter.py", line 70, in __call__ return self.forward(*args) ^^^^^^^^^^^^^^^^^^^ File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/interpreter.py", line 237, in forward self._raise_exception(e, node.target, exec_func, hidet_args, hidet_kwargs) File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/interpreter.py", line 186, in _raise_exception raise RuntimeError('\n'.join(msg)) torch._dynamo.exc.BackendCompilerFailed: backend='hidet' raised: RuntimeError: Can not interpret torch.nn.functional.batch_norm given arguments: torch.nn.functional.batch_norm(tensor(...), tensor(...), tensor(...), tensor(...), tensor(...), training=False, eps=1e-05) Possible candidates are: batch_norm(x: hidet.Tensor, running_mean: Optional[hidet.Tensor], running_var: Optional[hidet.Tensor], weight: Optional[hidet.Tensor], bias: Optional[hidet.Tensor], training: bool, momentum: float, eps: float) File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/register_functions.py", line 302 ``` This PR also fixes this error by adding default values to some parameters of the `batch_norm` function in `torch.nn.functional.batch_norm`, to match the signature as in the[ PyTorch documents](https://pytorch.org/docs/stable/generated/torch.nn.functional.batch_norm.html).
Closes #228 Additionally, while working on PR #455 , I noticed that we didn't register the function/method `floor_divide`. Adding support for this one is straightforward as it was functionally equivalent to `torch.div(..., rounding_mode='floor')`. I forgot to include the change in that PR, so I am including it here.
…_fpn` (#455) Closes #264 The error encountered in the linked issues was due to a subtle difference in type promotions when calling `torch.div` with the argument `rounding_mode='floor'`. Specifically, if both of the two operands are of the integer type, then the output would still be integer type. This is different from my original implementation, which first calls `truediv` and then `ops.floor`, which will make the output datatype `float32`. After fixing this issue, another error was encountered: ``` File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/interpreter.py", line 70, in __call__ return self.forward(*args) ^^^^^^^^^^^^^^^^^^^ File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/interpreter.py", line 237, in forward self._raise_exception(e, node.target, exec_func, hidet_args, hidet_kwargs) File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/interpreter.py", line 186, in _raise_exception raise RuntimeError('\n'.join(msg)) torch._dynamo.exc.BackendCompilerFailed: backend='hidet' raised: RuntimeError: Can not interpret torch.nn.functional.batch_norm given arguments: torch.nn.functional.batch_norm(tensor(...), tensor(...), tensor(...), tensor(...), tensor(...), training=False, eps=1e-05) Possible candidates are: batch_norm(x: hidet.Tensor, running_mean: Optional[hidet.Tensor], running_var: Optional[hidet.Tensor], weight: Optional[hidet.Tensor], bias: Optional[hidet.Tensor], training: bool, momentum: float, eps: float) File "/home/bolin/Desktop/hidet/python/hidet/graph/frontend/torch/register_functions.py", line 302 ``` This PR also fixes this error by adding default values to some parameters of the `batch_norm` function in `torch.nn.functional.batch_norm`, to match the signature as in the[ PyTorch documents](https://pytorch.org/docs/stable/generated/torch.nn.functional.batch_norm.html).
Closes #228 Additionally, while working on PR #455 , I noticed that we didn't register the function/method `floor_divide`. Adding support for this one is straightforward as it was functionally equivalent to `torch.div(..., rounding_mode='floor')`. I forgot to include the change in that PR, so I am including it here.
Regular sync CentML -> hidet-org