Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync CentML/hidet -> hidet-org/hidet #476

Merged
merged 132 commits into from
Dec 20, 2024
Merged

Sync CentML/hidet -> hidet-org/hidet #476

merged 132 commits into from
Dec 20, 2024

Conversation

vadiklyutiy
Copy link
Collaborator

Sync CentML/hidet -> hidet-org/hidet

vadiklyutiy and others added 30 commits December 19, 2024 22:06
- Fix bug in `normalize_launch_dims()`
- increase tolerance on tests/cuda/test_cudnn.py. Fails frequently.
Cast return value of `get_parallel_num_workers` to float. Fix #388
…k/run_config.json (#205)

Added the successfully compiled models to the json file, as discussed in
#192

Models that are runnable after compilation, and passed the correctness
checks:

- `mnasnet1_0`
- `resnext50_32x4d`
- `shufflenet_v2_x1_0`
- `squeezenet1_1`
- `mobilenet_v3_large`
- `resnet18`

(All tested with input shape `128x3x224x224`, and data type `float16`)

Regression run:
https://github.com/CentML/hidet/actions/runs/10112541015/job/27967049854

Issue about the correctness checks for models `alexnet` and `vgg16`:
#375

---------

Co-authored-by: vadiklyutiy <156319763+vadiklyutiy@users.noreply.github.com>
Co-authored-by: zhumakhan <zhumakhan.nazir@centml.ai>
Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
Co-authored-by: c-fteixeira <163031151+c-fteixeira@users.noreply.github.com>
Co-authored-by: xiaocenxiaocen <zx_math@yeah.net>
Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>
Co-authored-by: Shang Wang <samshang.wang@mail.utoronto.ca>
- sync `requirement.txt` with requirement in `setup.py`
- add extras_require
- requirements is torch >= 2.3.0
Save `Task` pickle in the translations cache. 

The reason - it is very convenient during performance analysis to get
smaller test case. Supporting scripts will come soon
This issue (CentML/hidet#56) didn't let us
change the transformers version on hidet.
Now it is fixed so that it works with new transformers versions. Once
the above issue is closed we can relax the transformers version in hidet
to be the latest one.

---------

Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
In gpt-neo model (related issue:
CentML/hidet#338) torch.where accepts tensors
with different dtypes. Added type casting to fix the above issue.

---------

Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
Some models initialize tensors during the first forward pass and reuse
it for next iterations. This causes model to recompile . One temporary
solution is to run torch model once before compilation. Related issue is
here: CentML/hidet#291

Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
…autograd.function.FunctionCtx (#394)

Fixed issues appeared in deberta:
CentML/hidet#340

---------

Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
Adding accruacy check for huggingface LLMs in Regression

`rtol=0.01` and `atol=0.065` were chosen to make previously "accurate"
models to not fail the check

---------

Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
The issue is caused by a wrong layout for the bias tensor.   
For example, we consider a bias tensor of shape (64, ) and its layout
can be written as
`(64, ): (1, )`
However, we can expand the layout by adding axes with 1-shape.  
For example, 
`(64, 1):(1, 1)`
Since the shape is equal to 1, the stride can be any number. The stride
corresponding to the 1-shape actually doesn't affect the computation of
the address. But two strides that are equal to one will influence the
instruction selection, and the invalid memory instruction leads to the
misaligned access.
To fix this issue, we force the stride paired with 1-shape to be 0. The
layout is equivalent when computing the memory address, and this will
help the compiler make the right decision in the instruction selection
pass.
closes #404

Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>
…tion `parallel_tune` is set to 1 (#406)

We have several parallel build steps that is not controlled by
`hidet.option.parallel_tune` option. This makes it difficult to do
non-parallel build for the purpose of debugging. This PR fixes the
issue.
1. Added  `torch.Tensor.as_strided` and `torch.flip`
2. Added support for `rounding_mode == 'trunc'` in torch.divide
3. Registered `torch.new_ones`




Longformer model compilation fails with:
```
RuntimeError: cudaDeviceSynchronize failed with error: cudaErrorMisalignedAddress
```
aftering running `fused_matmul_f16_pk_cute_rearrange_add` kernel. Also
Nvidia Nsight Compute shows that matmul kernel fails to launch. This PR
contains all changes needed to reproduce this issue.

To reproduce:
1. check out to `zhumakhan/longformer` branch and 
4. python3 tests/benchmarks/bench_transformer.py longformer

---------

Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
Updated torch docker image from nvcr.io/nvidia/pytorch:24.04-py3 to
nvcr.io/nvidia/pytorch:24.07-py3
to resolve the issue below:
```
/usr/local/lib/python3.10/dist-packages/transformer_engine_extensions.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
```

---------

Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
- add a primitive of exp2 for float types.  

This primitive could be useful when optimizing the flash attention.
Specifically, flash attention rewrites the exponential function as
```
# log2_e = 1.44269504
exp(a) = exp2(a * log2_e)
```
The transformation can hint the nvcc compiler to generate better code
(more ffma instructions instead of fmuls and fadds) .

---------

Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>
…barrier (#414)

Add primitives:
- `prmt`
- `lop3`
- `sub_f16x2`, `fma_f16x2`
- `barrier`

See the tests and function documentation for the usage of each
primitive.
`torch.pow` accepts following type of operands:
1. (a: Tensor, b: Tensor)
2. (a: Numeric, b: Tensor)  -> missing case that is handled by this PR
3. (a: Tensor, b: Numeric)

---------

Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
…#421)

Closes #265 

Additionally, slightly changed the code supporting
`torch.nn.EmbeddingBag` to make it support more data types.
This PR adds two utility functions to allow us to profile a function
with nsys/ncu in the code.

Usage:

```python
import os
import pytest
import torch
from hidet.utils.nsys_utils import nsys_run, nsys_get_path

def func(b):
    a = torch.randn(1000, device='cuda')
    a + b


def main():
    # profile with nsight system, store the result to 'nsys-reports/...'
    report = nsys_run(func, b=1)

    # we can visualize the profiling result by calling the `visualize` method.
    report.visualize()

if __name__ == '__main__':
    # we need to wrap the call to `main` inside '__main__' so that this script can be "imported" without 
    # side effect, which is one requirement for the utility
    main()
```
In regression tests, accuracy comparison was added for llms. But it
broke the vision model benchmarking script.
```
Running command: python /workspace/hidet/tests/benchmarks/bench_transformer.py bert-base-uncased --params bs=16,seqlen=256 --dtype float16 --backend hidet --mode max-autotune
  File "/workspace/hidet/tests/benchmarks/run_tests.py", line 82, in <module>
    latency = float(outputs.split('\n')[-2])
ValueError: could not convert string to float: "       device='cuda:0', dtype=torch.float16))"
```
Fixed  it!

---------

Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
Allow prologue for fp32 `reduce`. 

`reduce` uses vectorized calculations that don't allow to use fusing(it is possible but not implemented yet). For fp32 there are no vectors and we can enable fusion (with small modification `reduce` kernels itself).

Motivation.
In llama2 the part of the calculation is fp32 including `pow`+`reduce`. 

Performance improvement on llama2-7B +0.241%
Fix forgotten version promotion
Right now `pow` with const exp argument is implemented simply. We
convert const to const tensor and run elementwise `pow` of 2 tensors. It
is simply but not always efficient.

llama2 (RMSNorm part) has `x*x` that implemented as `tensor.pow(2)`. 

Convert `pow(x,2)` to `x*x`.

Improvement on llama2-7B is around **0.237%**
….Tensor.scatter_add_` (#429)

Closes #424 

The additional bug described in the comments in the linked
issue([here](CentML/hidet#424 (comment)))
is caused by accessing a PyTorch tensor in [this
line](https://github.com/CentML/hidet/blob/18f68ae34d8a08ca1b38ee00ac2ca7f15e599d0b/python/hidet/runtime/compiled_task.py#L161)
while we were supposed to be accessing a Hidet tensor.
zhumakhan and others added 16 commits December 19, 2024 22:13
With torch==2.5 all weights are treated as inputs. This PR changes
compilation process so that weights are constant tensors in our
FlowGraph. More context in the issue:
CentML/hidet#557 and closed & reverted PR:
CentML/hidet#609

---------

Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
This PR speedup hot start of hidet. In opposite to cold start that mean
"nothing in translation cache" hot start means compilation when all
tasks compiled and in cache.

Historically we never optimize hot start but in use case with vllm it
takes sufficient time and slow down model loading.

Hot start compilation, vllm+hidet, Llama-3.1-8B-Instruct, g5.xlarge AWS
instance with 4 vCPU, hidet compilation time
**Before 121s
After 51s**

Changes
- use manual implementation of tree copy that ignore existing
files/folders
 - cache the string representation of `Task`

PS With debug dumping we copy a lot of files. There the first point also
helps
**Summary of Changes:**  
The `build_task` and `build_module` function was refactored to support
parallel building synchronization using file-based locks (`flock`). This
ensures that in a multiprocessing environment, only one process compiles
the task from scratch, while others wait for the lock to be released and
then load the compiled task from disk.

test scripts: 
```{python}
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

# Global constants
VECTOR_SIZE = 8
MATRIX_SIZE = (8, 16)

# Set environment variables for distributed setting
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355"
os.environ["WORLD_SIZE"] = "2"

# Vector-to-matrix multiplication class
class VectorMatrixMultiplication(torch.nn.Module):
    def __init__(self, rank):
        super(VectorMatrixMultiplication, self).__init__()
        # Determine device for this rank
        self.device = torch.device(f"cuda:{rank}")
        
        # Initialize vector on this device
        self.vector = torch.rand(1, VECTOR_SIZE // dist.get_world_size(), device=self.device)
        
        # Split matrix so each rank only holds half on its device
        part_size = MATRIX_SIZE[0] // dist.get_world_size()
        self.part_matrix = torch.rand((part_size, MATRIX_SIZE[1]), device=self.device)
        
        print(f"Rank {rank} vector size: {self.vector.shape}, part_matrix size: {self.part_matrix.shape}")

    def forward(self, rank):
        # Perform multiplication for the assigned part
        result_part = torch.matmul(self.vector, self.part_matrix)
        
        # Aggregate results across GPUs
        handle = dist.all_reduce(result_part, op=dist.ReduceOp.SUM, async_op=True)
        handle.wait()
        result = result_part
        return result

# Process function for parallel computation
def run(rank, world_size):
    # Set the device for this rank
    torch.cuda.set_device(rank)
    
    # Initialize the process group
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    
    # Instantiate the model and move it to the specified GPU
    model = VectorMatrixMultiplication(rank).to(rank)
    model_opt = torch.compile(model, backend='hidet', mode='default')
    result = model_opt(rank)
    # Print the result for each rank
    if rank == 0:
        print("Result from GPU 0:", result)
    elif rank == 1:
        print("Result from GPU 1:", result)
    
    # Cleanup
    dist.destroy_process_group()

if __name__ == "__main__":
    world_size = 2
    mp.spawn(run, args=(world_size,), nprocs=world_size, join=True)
    
```
Turn `parallel_k` option default to `'disabled'`. 

We have at least the following problems:
1. We have this issue CentML/hidet#446 with
`parallel_k`
2. `parallel_k` is disabled for dynamic shapes (set always to 1 if dyn
dim are present)
3. On the following layer
``` python
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(4096, 6144)

    def forward(self, x):
        y = self.linear1(x)
        return y
```
I found that dyn shape works faster than static shape: latency 105us vs
122us.
The reason is `parallel_k`. When `parallel_k` is disabled static shapes
work the same 105us.

This is the real layer from llama3-8B. This brings slow down of
vllm+hidet when I tried to enable shape specialization in vllm.
…. ) (#294)

[Ir][Primitives] add vectorized conversion instructions
[Ir][CuTe] add reduce primitives in cute (#295)
[Ir][CuTe] add mma primitives (#296)
[Ir][CuTe] add other primitives in cute (#297)
[Transforms][CuTe] add instruction selection pass (#298)
[Transforms][CuTe] add resolve bank conflict pass (#299)
[Transforms][CuTe] add resolve auto keywords pass (#300)
[Transforms][CuTe] add shared memory allocation pass (#301)
[Transforms][CuTe] add vectorize elementwise operation pass (#302)
[Transforms][CuTe] add analysis pass (#303)
[Transforms][CuTe] add canonicalization pass (#304)
[Transforms][CuTe] add deadcode elimination pass (#305)
[Transforms][CuTe] refactor cute lowering pass (#306)
[Graph][Ops] matmul cute (#307)
[Ir] cute miscs (#308)
[Tests] cute tests (#309)
[Chore] fix ci (#313)
---------

Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>
Originally, this pass extracts all the kernels from the IR and adds
additional information to the name of the kernels. But we should update
all the places where the function variable is used. Previously, this
pass only updates the `LaunchKernelStmt`, which is only one place that
may use function variables. We should replace function variables in the
call expressions as well, which may also use the modified function
variables.

---------

Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>
…sor map creation (#643)

On Hopper architectures, copying multi-dimensional arrays using TMA
instructions requires a tensor map created using the
`cuTensorMapEncodeTiled` [driver
API](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html#group__CUDA__TENSOR__MEMORY_1ga7c7d2aaac9e49294304e755e6f341d7).
Previously, the `swizzle`, `interleave` and `l2Promotion` arguments
passed to the API call were hard-coded as the value `NONE`.

This PR adds the supports for other values for those arguments, as
specified by the document.
VLLM (and hence DMWL) passes fake tensors instead of real weight tensors
during compilation process. Such fake weight tensors will be treated as
traceable inputs by hidet.

---------

Co-authored-by: Zhumakhan <nazirzhumakhan@gmail,.com>
Currently support shape that is multiple of 16
Currently use unswizzled smem
**vllm+hidet+TP works now!** (with workaround for #650)

Fixes:
1. Added `getattr` ops support
2. Added support of `vllm.all_reduce` 
3. In `'interpreter'` mode `torch.full` under vllm returns `FakeTensor`
and `'interpreter'` mode fails. Made workaround for that. Further
investigations regarding that in #645
4. For `bool` `not` we did `bitwise_inverse` but `bitwise_inverse !=
logical_not`. Fixed.
5. For `42*tensor(dtype=bool)` we return `bool` tensor but should be
`int64` tensor. Fixed.
Add `scipy` as a requirement for comp server. It was added to regular
`requirements.txt` but not to comp server `requirements.txt`.
Promote the version 0.4.0.dev -> 0.4.0 (#456)

Promote the version  0.4.0.dev -> 0.4.0
…ers (#475)

The old version of `diffusers` used a depreciated (now completed
removed) api in `huggingface_hub`. Thus, it's better to keep the
versions up to date to prevent such problems in the future.

The error log in our ci:
```text
ImportError while importing test module '/__w/hidet/hidet/tests/apps/test_pretrained.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/__w/_tool/Python/3.8.18/x64/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/apps/test_pretrained.py:14: in <module>
    from hidet.apps import PretrainedModel, hf
/__w/_tool/Python/3.8.18/x64/lib/python3.8/site-packages/hidet/apps/hf.py:16: in <module>
    from diffusers import StableDiffusionPipeline
/__w/_tool/Python/3.8.18/x64/lib/python3.8/site-packages/diffusers/__init__.py:5: in <module>
    from .utils import (
/__w/_tool/Python/3.8.18/x64/lib/python3.8/site-packages/diffusers/utils/__init__.py:[38](https://github.com/hidet-org/hidet/actions/runs/11719752494/job/32643700257#step:9:40): in <module>
    from .dynamic_modules_utils import get_class_from_dynamic_module
/__w/_tool/Python/3.8.18/x64/lib/python3.8/site-packages/diffusers/utils/dynamic_modules_utils.py:28: in <module>
    from huggingface_hub import cached_download, hf_hub_download, model_info
E   ImportError: cannot import name 'cached_download' from 'huggingface_hub' (/__w/_tool/Python/3.8.18/x64/lib/python3.8/site-packages/huggingface_hub/__init__.py)
```

Also increase the tol in one flaky test.
@yaoyaoding
Copy link
Member

@vadiklyutiy I see you are in the author list of the 131 commits. This might because of the first two commits ("Sync" and "lint") that remove and add the EOF in the version file. Could you remove those two commits and try again?

@vadiklyutiy
Copy link
Collaborator Author

Could you clarify where did you look?
I see in commit section the following

image

So, some people authored, I'm commited

@yaoyaoding
Copy link
Member

Interesting, here is what it looks like from my side:

image

@vadiklyutiy
Copy link
Collaborator Author

@yaoyaoding please switch to Commit tab and take a look there

@vadiklyutiy
Copy link
Collaborator Author

Also you can look on previous release #455

@vadiklyutiy
Copy link
Collaborator Author

On the conversation screen shows union of authors and committer. Because I am going to commit I added every wheere. But in history mainly authors does matter.

@vadiklyutiy
Copy link
Collaborator Author

in previous release #455 there were 98 commits. All of them I committed.
But for example in contribution https://github.com/hidet-org/hidet/graphs/contributors I have 45 commit. Where I am a author are counted only

@yaoyaoding
Copy link
Member

On the conversation screen shows union of authors and committer. Because I am going to commit I added every wheere. But in history mainly authors does matter.

Yes, seems both the authors and commiters are merged in my view. Then it looks good to me. Previously I thought it might becuase of the first two commits, but it turned out it is not related.

Then everything looks good to me, just kindly remind to bump the version after merge this pr and check the wheel submitted to pypi.

@vadiklyutiy
Copy link
Collaborator Author

Then everything looks good to me, just kindly remind to bump the version after merge this pr and check the wheel submitted to pypi.

Pls formally approve this PR
I will change the version in the separate PR. Don't want to intermix it.

@vadiklyutiy vadiklyutiy merged commit 4508cbd into main Dec 20, 2024
22 checks passed
@vadiklyutiy vadiklyutiy deleted the vadim/sync branch December 26, 2024 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.