transpose 2d v1 #434

zhiwei-fang · 2024-03-01T18:51:32Z

This is a special version for the current transpose operator. The current transpose operator will handle a general N-dimension transpose, while this PR implement a 2D version to speed up 2D transpose.
Thread coarsening and (static) shared memory have been used.
Benchmark result:

Running command: python /home/zhiwei/hidet/.github/scripts/bench/bench_op.py transpose2d --params 3000x4000 --dtype float16
type        id  name         runfile        param_id  param_name      dtype_id  dtype_name    hardware_config      latency
--------  ----  -----------  -----------  ----------  ------------  ----------  ------------  -----------------  ---------
operator     3  transpose2d  bench_op.py           7  3000x4000              1  float16                           0.181748

Right now `pow` with const exp argument is implemented simply. We convert const to const tensor and run elementwise `pow` of 2 tensors. It is simply but not always efficient. llama2 (RMSNorm part) has `x*x` that implemented as `tensor.pow(2)`. Convert `pow(x,2)` to `x*x`. Improvement on llama2-7B is around **0.237%**

transpose 2d v1

1408300

zhiwei-fang requested review from xinli-git and hjjq March 1, 2024 18:51

zhiwei-fang closed this Mar 1, 2024

zhiwei-fang deleted the transpose branch March 1, 2024 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transpose 2d v1 #434

transpose 2d v1 #434

zhiwei-fang commented Mar 1, 2024

transpose 2d v1 #434

transpose 2d v1 #434

Conversation

zhiwei-fang commented Mar 1, 2024