CPU AVX implementation for Softmax, Norm #357

fishingguy456 · 2023-09-14T20:52:18Z

Working but inefficient batch matmul. Takes path of matmul_f32_x86 instead of cpu autoscheduler.

works on 8x8 at least but bad exp save for omp changes working and faster than pytorch works and is fast but exp is WIP remove useless files minor changes for rebase delete trash fix trash fix trash initial commit works on 8x8 at least but bad exp save for omp changes working and faster than pytorch works and is fast but exp is WIP remove useless files minor changes for rebase delete trash fix trash fix trash change imports fix for diff size, compiledmodule error fix

works on 8x8 at least but bad exp save for omp changes working and faster than pytorch works and is fast but exp is WIP remove useless files minor changes for rebase delete trash fix trash fix trash

… of 8

yaoyaoding

Thanks @fishingguy456 for your first PR to hidet!

I left some comments. In general,

do not forget to run tests, lint and format before submit the PR.
for the new operators, we should add some tests for them. See the examples in tests/operators/....
our current design allow one task to have cpu and cuda implementation, and they share the same property of whether allow prologue, epilogue. When we want to change the allow properties, it is better to create a new task override the original one, so that it does not interfere with our original operator. In the future, we might want to add a device parameter to these functions (like allow_prologue(self, device) -> bool) so that we do not need creating a new class. But for now, let's create a new class and add resolve rule.

yaoyaoding · 2023-09-18T19:24:21Z

python/hidet/graph/ops/matmul/resolve.py

+    def run_batch_matmul(self, a: Tensor, b: Tensor, is_cpu: bool) -> Tensor:
+        if is_cpu:


We can directly check the device of tensor, without the need to pass is_cpu as parameter.

is_cpu = a.device.is_cpu()

yaoyaoding · 2023-09-18T19:26:33Z

python/hidet/graph/ops/normalize/norm.py

-    def allow_epilogue(self) -> bool:
-        return True
-


If the cpu and cuda task have different behavior, it is better to create a subclass of the task and override the subclass:

class CPUNormalizeTask(NormalizeTask): ...

And implement the resolve rule to convert the normalize operator to corresponding cpu_normalize operator.

yaoyaoding · 2023-09-18T19:32:01Z

python/hidet/graph/ops/normalize/norm.py

+        norm_cpu_kernel.kind = "cpu_kernel"
+        avx_f32x8_find_sum.kind = "cpu_internal"


Better to avoid setting the function attributes outside its definition. Instead, use

from hidet.lang import attrs @hidet.script def norm_cpu_kernel(...): attrs.func_kind = "cpu_kernel" ...

yaoyaoding · 2023-09-18T19:32:57Z

python/hidet/graph/ops/softmax.py

@@ -16,6 +16,9 @@
 from hidet.ir.builders import StmtBuilder
 from hidet.ir.primitives import active_mask, shfl_down_sync, shfl_sync
 from .utils import Task, TensorNode, compute, reduce
+from typing import List, Union


Move to the top.

Remember to run format & lint, see https://docs.hidet.org/stable/developer-guides/contributing.html#contributing

yaoyaoding · 2023-09-18T19:33:13Z

python/hidet/graph/ops/softmax.py

@@ -153,3 +156,143 @@ def softmax_kernel(xs: xdtype[shape], ys: xdtype[shape]):
        ir_module = module.ir_module()

        return ir_module
+
+    def implement_cpu(self, working_dir: str) -> Union[IRModule, List[IRModule]]:
+        # if not all(is_constant(dim) for dim in self.inputs[0].shape)\


Suggested change

# if not all(is_constant(dim) for dim in self.inputs[0].shape)\

yaoyaoding · 2023-09-18T19:34:24Z

python/hidet/graph/ops/softmax.py

+    def allow_epilogue(self) -> bool:
+        return False
+
+    def allow_prologue(self) -> bool:
+        return False


Create a CPU version of the operator because cuda version allows prologue & epilogue.

yaoyaoding · 2023-09-18T19:35:19Z

python/hidet/graph/ops/softmax.py

+            softmax_cpu_kernel.kind = "cpu_kernel"
+            apply_exponent.kind = "cpu_internal"


yaoyaoding · 2023-09-18T19:36:14Z

python/hidet/ir/expr.py

+        if not (isinstance(func_var, Var) and isinstance(args, tuple)):
+            print(func_var, args)
+            print(type(args[0]))
+            print(type(func_var), type(args))


Suggested change

if not (isinstance(func_var, Var) and isinstance(args, tuple)):

print(func_var, args)

print(type(args[0]))

print(type(func_var), type(args))

yaoyaoding · 2023-09-18T19:37:19Z

python/hidet/ir/primitives/cpu/avx.py

+    from hidet.ir.func import Function
+
+    @script
+    def avx_x86_f32x8_find_sum(x: f32x8) -> f32:


Is there any convention to use "find" in the function name?

If not, I would prefer to name directly as "avx_x86_f32x8_sum" and "avx_x86_f32x8_max".

…e more

yaoyaoding · 2024-01-07T02:45:34Z

Thanks @fishingguy456 ! Could you also add a test for softmax?

Hi @BolinSNLHM, could you have a look of this PR? I did not check the kernel implementation details.

BolinSNLHM · 2024-01-07T23:38:21Z

python/hidet/ir/primitives/cpu/avx.py

+    def avx_x86_f32x8_sum(x: f32x8) -> f32:
+        attrs.func_kind = "cpu_internal"
+        attrs.func_name = "avx_x86_float32x8_sum"
+        sum_vec = call_primitive_func(
+            'avx_x86_float32x4_add',
+            [
+                call_primitive_func('avx_x86_float32x8_extract_half', [x, 0b0]),
+                call_primitive_func('avx_x86_float32x8_extract_half', [x, 0b1]),
+            ],
+        )
+        sum_vec = call_primitive_func('avx_x86_float32x4_hadd', [sum_vec, sum_vec])
+        sum_vec = call_primitive_func('avx_x86_float32x4_hadd', [sum_vec, sum_vec])
+        return call_primitive_func('avx_x86_float32x4_extract_last', [sum_vec])
+
+    assert isinstance(avx_x86_f32x8_sum, Function)
+    register_primitive_function(avx_x86_f32x8_sum.name, avx_x86_f32x8_sum)
+
+    @script
+    def avx_x86_f32x8_scalar_max(x: f32x8) -> f32:
+        attrs.func_kind = "cpu_internal"
+        attrs.func_name = "avx_x86_float32x8_scalar_max"
+        y = call_primitive_func('avx_x86_float32x8_permute_2f128', [x, x, 1])
+        m1 = call_primitive_func('avx_x86_float32x8_max', [x, y])
+        m2 = call_primitive_func('avx_x86_float32x8_permute', [m1, 0b01001110])
+        m3 = call_primitive_func('avx_x86_float32x8_max', [m1, m2])
+        m4 = call_primitive_func('avx_x86_float32x8_permute', [m3, 0b10110001])
+        m = call_primitive_func('avx_x86_float32x8_max', [m3, m4])
+        return call_primitive_func('avx_x86_float32x8_extract_last', [m])


Would it be possible to only declare the primitives (e.g., avx_x86_f32x8_extract_half) in this file, and then define functions like avx_x86_f32x8_sum as helper functions in Hidet Script in a separate file where it would be needed? The code should work as it is, but it looks a bit odd to have hidet.script decorator and multiple calls to call_primitive_func here...

BolinSNLHM · 2024-01-07T23:40:02Z

python/hidet/graph/ops/matmul/matmul_f32_x86.py

@@ -73,7 +73,7 @@ def __init__(self, a: TensorNode, b: TensorNode):
        )

    def allow_epilogue(self) -> bool:
-        return True
+        return False


Why should we change this to False? 🤔

yaoyaoding · 2024-01-08T03:47:12Z

python/hidet/ir/primitives/cpu/avx.py

+    from hidet.lang import script, attrs
+    from hidet.ir.dtypes import f32x8, f32
+    from hidet.ir.func import Function
+
+    @script
+    def avx_x86_f32x8_sum(x: f32x8) -> f32:
+        attrs.func_kind = "cpu_internal"
+        attrs.func_name = "avx_x86_float32x8_sum"
+        sum_vec = call_primitive_func(
+            'avx_x86_float32x4_add',
+            [
+                call_primitive_func('avx_x86_float32x8_extract_half', [x, 0b0]),
+                call_primitive_func('avx_x86_float32x8_extract_half', [x, 0b1]),
+            ],
+        )
+        sum_vec = call_primitive_func('avx_x86_float32x4_hadd', [sum_vec, sum_vec])
+        sum_vec = call_primitive_func('avx_x86_float32x4_hadd', [sum_vec, sum_vec])
+        return call_primitive_func('avx_x86_float32x4_extract_last', [sum_vec])
+
+    assert isinstance(avx_x86_f32x8_sum, Function)
+    register_primitive_function(avx_x86_f32x8_sum.name, avx_x86_f32x8_sum)
+
+    @script
+    def avx_x86_f32x8_scalar_max(x: f32x8) -> f32:
+        attrs.func_kind = "cpu_internal"
+        attrs.func_name = "avx_x86_float32x8_scalar_max"
+        y = call_primitive_func('avx_x86_float32x8_permute_2f128', [x, x, 1])
+        m1 = call_primitive_func('avx_x86_float32x8_max', [x, y])
+        m2 = call_primitive_func('avx_x86_float32x8_permute', [m1, 0b01001110])
+        m3 = call_primitive_func('avx_x86_float32x8_max', [m1, m2])
+        m4 = call_primitive_func('avx_x86_float32x8_permute', [m3, 0b10110001])
+        m = call_primitive_func('avx_x86_float32x8_max', [m3, m4])
+        return call_primitive_func('avx_x86_float32x8_extract_last', [m])


I recommand to move these user-defined functions over avx (not the ones provided by underlying vector library) like avx_x86_f32x8_sum to another file called avx_helpers.py.

For functions like avx_x86_float32x4_extract_last, we also need to define a wrapper function like

def avx_x86_float32x4_extract_last(x: Expr) -> Call: return call_primitive_func('avx_x86_float32x4_extract_last', [x])

In the new file, we directly use avx_x86_float32x4_extract_last(...) in the hidet script, instead of calling call_primitive_func.

yaoyaoding · 2024-01-09T19:36:13Z

Thanks @fishingguy456 !

fishingguy456 added 30 commits August 4, 2023 11:46

works on multidimensional, axis=-1

7896c45

initial commit

ff90ed5

works on 8x8 at least but bad exp save for omp changes working and faster than pytorch works and is fast but exp is WIP remove useless files minor changes for rebase delete trash fix trash fix trash

change imports

fc61204

fix for diff size, compiledmodule error fix

f84201f

works on multidimensional, axis=-1

6f2e43c

initial commit

25f22cf

works on 8x8 at least but bad exp save for omp changes working and faster than pytorch works and is fast but exp is WIP remove useless files minor changes for rebase delete trash fix trash fix trash

initial commit

aafbb0f

works on 8x8 at least but bad exp save for omp changes working and faster than pytorch works and is fast but exp is WIP remove useless files minor changes for rebase delete trash fix trash fix trash

change imports

44993e2

fix for diff size, compiledmodule error fix

a86d866

works on multidimensional, axis=-1

b59ffa2

wrap up softmax, starting layernorm

7edf0eb

layernorm kinda works but not rly

44c04b3

better code for softmax

2ccc4b6

layernorm works for last layer

13ea5dc

move find sum and find max to registered function

d89036d

find max in registered func

b0659f6

not working softmax on not last dim, minor changes

904760b

layernorm works for any dims

29b7ba7

comments

0c8dc3a

tuning, fix for flowgraph operator resolve

77fe8d9

softmax works

ac40695

commented tensors dont work, i.e. axis is not last 2 AND not multiple…

4938a1f

… of 8

actually works rn frfr so fast 💯

1d447cf

cleanup

30224ce

more cleanup

67d4d56

random testing stuff

09ca2f8

allow epilogue

8352dd8

better epiloguing

27f6cbb

janky matmul resolve

cce1d42

fishingguy456 added 12 commits September 14, 2023 13:20

actually works rn frfr so fast 💯

d0b99a4

cleanup

67a43a5

more cleanup

4443780

random testing stuff

4088fc6

allow epilogue

7430696

better epiloguing

8a1167e

janky matmul resolve

0f4876f

still epilogue problem?

49c072f

Merge remote-tracking branch 'origin/main'

0bd13d8

clean up for pr

de74231

fix test

9ab0bac

lint

f779a1d

yaoyaoding requested changes Sep 18, 2023

View reviewed changes

fishingguy456 added 7 commits September 18, 2023 22:07

minor pr edits

124fb09

pytests, cpu child class

6c4efd9

potential fix for failing tests? but prob not will have to investigat…

40fd71f

…e more

weird diff

90c4ffb

merge conflict resolve build.py

587ba64

remove shady batch mat mul

89d5646

lint thing

a3a4b03

BolinSNLHM reviewed Jan 7, 2024

View reviewed changes

yaoyaoding reviewed Jan 8, 2024

View reviewed changes

fishingguy456 added 3 commits January 8, 2024 12:40

move helpers to new file

aec95d2

lint

7a41b5c

change tolerance for flaky test for test_dynamic_shape

dcc6a45

yaoyaoding approved these changes Jan 8, 2024

View reviewed changes

yaoyaoding merged commit 52fe368 into hidet-org:main Jan 9, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU AVX implementation for Softmax, Norm #357

CPU AVX implementation for Softmax, Norm #357

fishingguy456 commented Sep 14, 2023

yaoyaoding left a comment

yaoyaoding Sep 18, 2023

yaoyaoding Sep 18, 2023

yaoyaoding Sep 18, 2023

yaoyaoding Sep 18, 2023

yaoyaoding Sep 18, 2023

yaoyaoding Sep 18, 2023

yaoyaoding Sep 18, 2023

yaoyaoding Sep 18, 2023

yaoyaoding Sep 18, 2023

yaoyaoding Sep 18, 2023

yaoyaoding commented Jan 7, 2024

BolinSNLHM Jan 7, 2024

BolinSNLHM Jan 7, 2024

yaoyaoding Jan 8, 2024 •

edited

Loading

yaoyaoding Jan 8, 2024

yaoyaoding commented Jan 9, 2024

		def run_batch_matmul(self, a: Tensor, b: Tensor, is_cpu: bool) -> Tensor:
		if is_cpu:

		norm_cpu_kernel.kind = "cpu_kernel"
		avx_f32x8_find_sum.kind = "cpu_internal"

		softmax_cpu_kernel.kind = "cpu_kernel"
		apply_exponent.kind = "cpu_internal"

CPU AVX implementation for Softmax, Norm #357

CPU AVX implementation for Softmax, Norm #357

Conversation

fishingguy456 commented Sep 14, 2023

yaoyaoding left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaoyaoding commented Jan 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaoyaoding Jan 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaoyaoding commented Jan 9, 2024

yaoyaoding Jan 8, 2024 •

edited

Loading