Mthreads/master #298

machuanjiang · 2024-11-18T07:20:36Z

PR Category

Operator | OP Test | Benchmark

Type of Change

Bug Fix | Performance Optimization | Refactor

Description

mthreads musa backend compatible modification

Issue

N/A

Progress

Change is properly reviewed (1 reviewer required, 2 recommended).
Change is responded to an issue.
Change is fully covered by a UT.

Performance

N/A

Signed-off-by: Jian Li <jian.li@mthreads.com>

config: {BLOCK_M: 8, num_warps: 8} will cause the number of registers within a single thread to be exceeded when the tensor shape is 4096 * 2304, so reduce BLOCK_M to 4 to supprot cumsum.

Signed-off-by: Jian Li <jian.li@mthreads.com>

- Torch_musa does not support fp64 input type, so CPU is used as a reference

- Does not support test_accuracy_groupnorm - Some use cases have accuracy issues in test_embedding

Signed-off-by: Jian Li <jian.li@mthreads.com>

Modify the function parameter type declaration so that it can run in python 3.8 --------- Co-authored-by: zhengyang <zhengyang@baai.ac.cn>

Signed-off-by: jiaqi.wang <jiaqi.wang@mthreads.com>

Add _weight_norm op, while the original _weight_norm op changed to _weight_norm_interface op.

* add Ops & UT & Bench * add full zero ones Ops & UT & Bench * split normal op * [Operator] init slice&select scatter * code format * PR comment * split test_special_ops * add K-S test * split special perf * Exponential added. (#138) * exponential added. * Added K-S tests to exponential_, fp64 corrected. * aligned with aten prototype * Exponential_ uses uint64 offsets in Triton kernel. * Update pyproject config for new test dependencies. * resolve conflict * Use int64 indexing when needed & fix argmax (#146) 1. fix amax, armax and triu, use int64 indexing when the largest tensor's size_in_bytes exceed int32's max; 2. change the tiling scheme for argmax to loop in the reduction dimension, instead of data-size-dependent-tile-size * test for op * test for op * Making libentry thread safe (#136) * libentry now is lock protected. * Add multithreading tests for libentry. * polish code. * add argparse * fix desc * fix num * Update test_specific_ops.py * split UT files * fix * fix * [Operator] Optimize CrossEntropyLoss (#131) reimplement cross_entropy_loss forward and backward support; indices/probabilities/weight/reduction/ignore_index/label_smoothing; perform better than torch eager on large scale tensors * Exponential added. (#138) * exponential added. * Added K-S tests to exponential_, fp64 corrected. * aligned with aten prototype * Exponential_ uses uint64 offsets in Triton kernel. * Update pyproject config for new test dependencies. * Use int64 indexing when needed & fix argmax (#146) 1. fix amax, armax and triu, use int64 indexing when the largest tensor's size_in_bytes exceed int32's max; 2. change the tiling scheme for argmax to loop in the reduction dimension, instead of data-size-dependent-tile-size * Making libentry thread safe (#136) * libentry now is lock protected. * Add multithreading tests for libentry. * polish code. * [Test] Test for op (#151) * [chore] solve slice&select scatter's test cases * [fix] fix slice&select scatter's test cases * [chore] remove out-of-range indices in select_scatter's test cases * [chore] simplify slice_scatter's test cases * [fix] Added range that is deleted by mistake * Merge branch 'master' into slice&select_scatter * [chore] reformat * [fix] typo * [chore] Considering perf, pause the replacement of some aTen operators * slice_scatter * select_scatter * index_select * [fix] Add libentry in op.cumsum * [fix] Del slice&select scatter's perf tests * [Chore] Add pytest mark for slice&select scatter's test * [Fix] Correct slice_scatter test * [Fix] Replace CPU Tensor --------- Co-authored-by: Bowen12992 <zhangbluestars@gmail.com> Co-authored-by: Tongxin Bai <waffle.bai@gmail.com> Co-authored-by: Clement Chan <iclementine@outlook.com> Co-authored-by: Bowen <81504862+Bowen12992@users.noreply.github.com> Co-authored-by: StrongSpoon <35829812+StrongSpoon@users.noreply.github.com>

* benchmark fix * add seven new testing parameters * move shapes info to yaml file * Added the BenchmarkMetrics & BenchmarkResult abstraction

* [Bugfix] Handle negative input dimensions in 'cat' operator Co-authored-by: 2niuhe<tang.kang1@zte.com.cn>

* Add Script to Calculate Summary Information for Benchmark Results

* specializing slice_scatter. WIP. * polish and refine 2d_inner cases. * fix slice_scatter error on 1d inputs. * test slice_scatter fallback

* Relocate select and slice benchmarks to test_select_and_slice_perf.py * sort keys for summary result * clean cuda cache after benchmark * fix repeat_interleave * modify format for summary info

Signed-off-by: jiaqi.wang <jiaqi.wang@mthreads.com>

Signed-off-by: chuanjiang.ma <chuanjiang.ma@mthreads.com>

Signed-off-by: jiaqi.wang <jiaqi.wang@mthreads.com>

Signed-off-by: chuanjiang.ma <chuanjiang.ma@mthreads.com>

StrongSpoon

review done.

StrongSpoon · 2024-11-21T01:41:04Z

benchmark/test_norm_perf.py

@@ -99,7 +99,7 @@ def weight_norm_input_fn(shape, dtype, device):
        weight_norm_interface_input_fn,
    ),
    ("weight_norm", torch._weight_norm, weight_norm_input_fn),
-    ("vector_norm", torch.linalg.vector_norm, unary_input_fn),
+    # ("vector_norm", torch.linalg.vector_norm, unary_input_fn),


why not support vector_norm and var_mean?

poor performance recently, maybe requires more optimization work in compiler, and we prefer to regard it as not supported yet.

StrongSpoon · 2024-11-21T01:44:33Z

benchmark/test_special_perf.py

    # Complex Operations
-    ("resolve_neg", torch.resolve_neg, [torch.cfloat], resolve_neg_input_fn),
-    ("resolve_conj", torch.resolve_conj, [torch.cfloat], resolve_conj_input_fn),
+    # ("resolve_neg", torch.resolve_neg, [torch.cfloat], resolve_neg_input_fn),


same reason as vector_norm

StrongSpoon · 2024-11-21T01:55:08Z

src/flag_gems/ops/weightnorm.py

@@ -236,7 +236,7 @@ def norm_kernel(
    v_shape0,
    v_shape1,
    v_shape2,
-    eps,
+    eps: tl.constexpr,


recommend not to hint eps as tl.constexpr

not our change but a code sync issue, previous version in flaggems marked "eps" as tl.constexpr, we will sync the change, thanks

StrongSpoon · 2024-11-21T02:09:03Z

tests/test_norm_ops.py

@@ -74,6 +74,7 @@ def test_accuracy_groupnorm(N, C, H, W, num_groups, dtype):
    gems_assert_close(res_bias_grad, ref_bias_grad, dtype, reduce_dim=N * HW)


+@pytest.mark.skip("triton_musa unsupport")


please figure out the reason why not support LayerNorm, cause group norm with similar algorithm is supported.

it may be a compiler bug and under locating, we will support this op later after the bugs fixed.

StrongSpoon · 2024-11-21T02:12:02Z

tests/test_reduction_ops.py

@@ -137,6 +138,7 @@ def test_accuracy_cross_entropy_loss_indices(
    gems_assert_close(res_in_grad, ref_in_grad, dtype, reduce_dim=shape[dim])


+@pytest.mark.skip("random error")


what's the absolute difference between result and reference?

already fixed in our latest triton but in this time test, we still provided elder version of triton, just make it skip in this time test, is it okay for you?

StrongSpoon · 2024-11-21T02:36:13Z

tests/test_special_ops.py

-    value_tensor = torch.tensor(value, device="cuda", dtype=dtype)
-    ref_out_tensor = torch.fill(ref_x, value_tensor)
+    value_tensor = torch.tensor(value, device="musa", dtype=dtype)
+    ref_value_tensor = to_reference(value_tensor, False)


this could be merged to master as a bug fix.

StrongSpoon · 2024-11-21T02:37:27Z

tests/accuracy_utils.py

@@ -127,6 +127,15 @@ def to_reference(inp, upcast=False):
    return ref_inp


+def to_reference_gpu(inp, upcast=False, device='musa'):


is it used in test code?

confirmed, should be removed, we will fix it.

StrongSpoon · 2024-11-21T02:39:43Z

src/flag_gems/utils/pointwise_dynamic.py

@@ -1075,7 +1075,8 @@ def __init__(self, op_desc: FunctionSchema, scalar_fn: JITFunction, config=None)

        assert isinstance(scalar_fn, JITFunction)
        self._scalar_fn = scalar_fn
-        self._scalar_fn_cache_key = scalar_fn.cache_key
+        # FIXME: cache_key is too long and make open file failed.
+        self._scalar_fn_cache_key = scalar_fn.cache_key[:33]


I'm not sure if slicing will bring risk. theoretically, there might exist a small probability that two keys have the same prefix.

it risks but very affects very little, the collision probability is $P≈2^{-128}$, if do insist that should be modified, we will fix it. Anyway, we will deep in again and check the root reason of the failure, sorry about that

StrongSpoon · 2024-11-21T02:40:06Z

src/flag_gems/testing/__init__.py

@@ -9,6 +9,11 @@
    torch.bfloat16: 0.016,
 }

+RESOLUTION_DROPOUT = {


is it used in the test?

We will remove it along with "to_reference_gpu"

StrongSpoon · 2024-11-21T02:43:02Z

src/flag_gems/ops/rsub.py

+    return y - x * alpha
+
+
+def rsub(A, B, *, alpha=1):


actually aten::rsub calls the sub kernel. now that you reimplemented it, just register it into the library;)

"rsub.py" will be removed.

…shapes (#307)

Signed-off-by: machuanjiang <chuanjiang.ma@mthreads.com>

Signed-off-by: chuanjiang.ma <chuanjiang.ma@mthreads.com>

StrongSpoon

done. just fix epsilon and we could start testing.

StrongSpoon · 2024-11-22T01:23:31Z

src/flag_gems/ops/weightnorm.py

@@ -274,7 +274,7 @@ def norm_bwd_kernel(
    v_shape0,
    v_shape1,
    v_shape2,
-    eps,
+    eps: tl.constexpr,


Signed-off-by: chuanjiang.ma <chuanjiang.ma@mthreads.com>

1. one test in special_op test change the device type from cuda to musa Signed-off-by: chuanjiang.ma <chuanjiang.ma@mthreads.com>

yuzhe-wu and others added 30 commits October 17, 2024 17:30

add triton_musa submodule

a15f66e

Modify testcase from cuda to musa.

c49cdd2

Workaround for musa testcase.

e95c2ed

modify test_unary_pointwise_ops from cuda to musa

7a231ec

modify test_reduction_ops from cuda to musa

5244d94

Fix bug of reduceOp and shared memory.

4f659c3

fix dropout bug.

91eb9d7

fix softmax exceeds shared memory error

cc503cc

Promote cpu reference accuracy to float32

6ade9db

Fix max_dim fp16 bugs.

d9da822

Modify benchmark performance test script to musa

b01bf37

Add torch_musa unsupported op test case.

c379fcb

Support bert model.

1491604

Update submodule url.

d643777

Comment v3/v4 test case.

2d7001c

fix: vectornorm upcast to fp64

35cbf02

fix: group_norm modify case because of out-of-shared-memory

8c582f8

Promote golden accuracy from fp32 to fp64.

80a30d5

Rebase on master commit 1e49d6.

c1a8d1b

Update triton_musa submodule.

06f43e0

align perf utils to profiling pack 0717

478a72e

rebase on master commit 9000685

0223862

Signed-off-by: Jian Li <jian.li@mthreads.com>

Support op cumsum.

34f4726

config: {BLOCK_M: 8, num_warps: 8} will cause the number of registers within a single thread to be exceeded when the tensor shape is 4096 * 2304, so reduce BLOCK_M to 4 to supprot cumsum.

fix embedding tensor usage

eb569ea

Signed-off-by: Jian Li <jian.li@mthreads.com>

uncomment supported op test

120c392

Signed-off-by: Jian Li <jian.li@mthreads.com>

Support isclose() and allclose()

78fa6f0

- Torch_musa does not support fp64 input type, so CPU is used as a reference

Open up some tests that have already passed

387dc89

- Does not support test_accuracy_groupnorm - Some use cases have accuracy issues in test_embedding

rebase on master commit 801377f

adc70de

Signed-off-by: Jian Li <jian.li@mthreads.com>

rebase on master commit 2e55d66

5709bf1

Signed-off-by: Jian Li <jian.li@mthreads.com>

fix distribution ops warps num

1e4f32d

Signed-off-by: Jian Li <jian.li@mthreads.com>

ZaccurLi and others added 23 commits October 30, 2024 19:38

[Operator] Fix vstack (#237)

129f697

Modify the function parameter type declaration so that it can run in python 3.8 --------- Co-authored-by: zhengyang <zhengyang@baai.ac.cn>

[bugfix] upcast input tensor to fp64 for cpu reference (#238)

daf5dd2

[Operator] Add upsample_nearest2d op [MooreThreads] (#193)

f2d701c

SW-46093: rebase on master commit 98924c

adf7fe8

Signed-off-by: jiaqi.wang <jiaqi.wang@mthreads.com>

succeed for norm (#233)

ec82efe

Add _weight_norm op, while the original _weight_norm op changed to _weight_norm_interface op.

[bugfix] fix normal error of float input (#245)

8dfeece

[Operator] Add slice&select_scatter's benchmark (#262)

2be7c9c

benchmark fix (#229)

2f22ebc

* benchmark fix * add seven new testing parameters * move shapes info to yaml file * Added the BenchmarkMetrics & BenchmarkResult abstraction

[Bugfix] Handle negative input dimension in 'cat' operator (#261)

5001b44

* [Bugfix] Handle negative input dimensions in 'cat' operator Co-authored-by: 2niuhe<tang.kang1@zte.com.cn>

[Operator] Add repeat_interleave_self_tensor (#230)

f263174

Add Script to Calculate Summary Information for Benchmark Results (#271)

f0ba16c

* Add Script to Calculate Summary Information for Benchmark Results

Specializing slice_scatter. (#270)

34547fb

* specializing slice_scatter. WIP. * polish and refine 2d_inner cases. * fix slice_scatter error on 1d inputs. * test slice_scatter fallback

[Operator] Add randperm op (#185)

9344647

Enhance Benchmarking for repeat_interleave Operation (#274)

43d8b9b

* Relocate select and slice benchmarks to test_select_and_slice_perf.py * sort keys for summary result * clean cuda cache after benchmark * fix repeat_interleave * modify format for summary info

SW-46093: rebase on master commit 2bd92c

953f6de

Signed-off-by: jiaqi.wang <jiaqi.wang@mthreads.com>

add test entry & use skip to avoid error

4e1afbd

Signed-off-by: jiaqi.wang <jiaqi.wang@mthreads.com>

SW-46066: Adapt libdevice calls to the musa backend.

f25e735

SW-46093: update mt test script by waiting frontend

db37f46

Signed-off-by: chuanjiang.ma <chuanjiang.ma@mthreads.com>

SW-46093: adjust the skipped ops

3ccdfcd

Signed-off-by: jiaqi.wang <jiaqi.wang@mthreads.com>

SW-46093: currently test accuracy ref to cpu

504373e

Signed-off-by: chuanjiang.ma <chuanjiang.ma@mthreads.com>

SW-47833: skip one case for half type op not supported by torch_musa

9647bb8

Signed-off-by: chuanjiang.ma <chuanjiang.ma@mthreads.com>

mthreads: resolved conflicts with master

6b7a664

StrongSpoon reviewed Nov 21, 2024

View reviewed changes

StrongSpoon and others added 3 commits November 21, 2024 12:11

[Benchmark] [no ci] change core shapes of diag to avoid oom on large …

b5fb405

…shapes (#307)

mthreads: update test suite setup

5b90773

Signed-off-by: machuanjiang <chuanjiang.ma@mthreads.com>

mthreads: fix review comments

063ed55

Signed-off-by: chuanjiang.ma <chuanjiang.ma@mthreads.com>

StrongSpoon reviewed Nov 22, 2024

View reviewed changes

machuanjiang added 2 commits November 22, 2024 09:53

mthreads: remove tl.constexpr hint for eps

2f70002

Signed-off-by: chuanjiang.ma <chuanjiang.ma@mthreads.com>

mthreads: skip weight_nor testing for driver complaining

19c6f03

1. one test in special_op test change the device type from cuda to musa Signed-off-by: chuanjiang.ma <chuanjiang.ma@mthreads.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mthreads/master #298

Mthreads/master #298

machuanjiang commented Nov 18, 2024

StrongSpoon left a comment

StrongSpoon Nov 21, 2024

machuanjiang Nov 21, 2024

StrongSpoon Nov 21, 2024

machuanjiang Nov 21, 2024

StrongSpoon Nov 21, 2024

machuanjiang Nov 21, 2024

machuanjiang Nov 21, 2024

StrongSpoon Nov 21, 2024

machuanjiang Nov 21, 2024

StrongSpoon Nov 21, 2024

machuanjiang Nov 21, 2024

StrongSpoon Nov 21, 2024

StrongSpoon Nov 21, 2024

machuanjiang Nov 21, 2024

StrongSpoon Nov 21, 2024

machuanjiang Nov 21, 2024

StrongSpoon Nov 21, 2024

machuanjiang Nov 21, 2024

StrongSpoon Nov 21, 2024

machuanjiang Nov 21, 2024

StrongSpoon left a comment

StrongSpoon Nov 22, 2024

		@@ -74,6 +74,7 @@ def test_accuracy_groupnorm(N, C, H, W, num_groups, dtype):
		gems_assert_close(res_bias_grad, ref_bias_grad, dtype, reduce_dim=N * HW)


		@pytest.mark.skip("triton_musa unsupport")

		@@ -137,6 +138,7 @@ def test_accuracy_cross_entropy_loss_indices(
		gems_assert_close(res_in_grad, ref_in_grad, dtype, reduce_dim=shape[dim])


		@pytest.mark.skip("random error")

		@@ -127,6 +127,15 @@ def to_reference(inp, upcast=False):
		return ref_inp


		def to_reference_gpu(inp, upcast=False, device='musa'):

Mthreads/master #298

Are you sure you want to change the base?

Mthreads/master #298

Conversation

machuanjiang commented Nov 18, 2024

PR Category

Type of Change

Description

Issue

Progress

Performance

StrongSpoon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrongSpoon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment