New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Dequant sycl Kernel #1300

Open

sunjiweiswift wants to merge 15 commits into main from jiwei/dequant_op

Contributor

sunjiweiswift commented Jan 20, 2025 •

edited

Loading

Use an independent dequant kernel with onednn matmul to complete the calculation of the first token
You can modify the dequant kernel to support more WOQs

sunjiweiswift added 11 commits

January 15, 2025 06:25


          save

2ffcfde


          save

68d1ea6


          add M > 1

d8d1b03


          add M > 1

f74acae


          add M > 1

1fbb638


          save

c927650


          save

ca010ac


          pass UT

4fc4a14


          add M=1024

a5ae28a


          modify M> 1for dequant+matmul


          add bfloat16

f2168eb

sunjiweiswift force-pushed the jiwei/dequant_op branch from 82c2d23 to f2168eb Compare

January 20, 2025 02:59

sunjiweiswift requested review from mingfeima, airMeng and EikanWang

January 20, 2025 03:00

airMeng requested a review from xytintel

January 20, 2025 13:28

airMeng reviewed

View reviewed changes

Contributor

airMeng left a comment

I will suggest to reuse the code between dequantization and dequantized GEMM as much as possible.

test/xpu/test_linalg_xpu.py

		@@ -232,7 +232,7 @@ def _test(m, k, n, transpose_a, transpose_b, test_equal=True):


		@unittest.skipIf(IS_WINDOWS, "Skipped on Windows!")

Contributor

airMeng Jan 20, 2025

It should work on Windows too, right? Can you validate?

src/ATen/native/xpu/sycl/Dequant_int4.cpp Show resolved Hide resolved

sunjiweiswift enabled auto-merge

January 21, 2025 02:43


          Merge branch 'main' into jiwei/dequant_op

c31d2d6

sunjiweiswift changed the title ~~Jiwei/dequant op~~ Dequant sycl Kernel of Int4

sunjiweiswift changed the title ~~Dequant sycl Kernel of Int4~~ Dequant sycl Kernel

sunjiweiswift added 2 commits

January 22, 2025 18:31


          Merge branch 'main' into jiwei/dequant_op

e01518b


          use select_from_group

5133ef3

EikanWang reviewed

View reviewed changes

src/ATen/native/xpu/LinearInt4.cpp Show resolved Hide resolved

src/ATen/native/xpu/sycl/Dequant_int4.cpp Show resolved Hide resolved

src/ATen/native/xpu/sycl/Dequant_int4.cpp

+                    int n,
+                    int k,
+                    const uint8_t* weight_int4,
+                    const scalar_t* ScaleAndZeros,

Contributor

EikanWang Jan 22, 2025

It would be better to unify the coding style. Although, the coding style of torch-xpu-ops is a mess. However, we are working on it and will enable the linter ASAP. Therefore, it would be nice if you could keep the coding style consistency. @xytintel , @fengyuan14 FYI.

Contributor Author

sunjiweiswift Jan 23, 2025

Can you provide a reference cpp?

mingfeima reviewed

View reviewed changes

src/ATen/native/xpu/LinearInt4.cpp Show resolved Hide resolved

src/ATen/native/xpu/sycl/Dequant_int4.cpp

+                  float tmp[TileN];
+                  bool high4 = sg_id % 2 != 0;
+                  for (int in = 0; in < TileN; in++) {

mingfeima Jan 23, 2025

can we do an increamental of 2 instead of 1 to remove this high4 check?

for (int in = 0; in < TileN; in += 2) {
  low4 = tmp[in + 0];
  high4 = tmp[in + 1];

and also since `TileN` is constexpr, it is possible to `unroll` it?
}

Contributor Author

sunjiweiswift Jan 23, 2025

sure

Contributor Author

sunjiweiswift Jan 24, 2025

Sorry, I looked at it carefully. Because it is along the N direction, if +2, the code will be more complicated

src/ATen/native/xpu/sycl/Dequant_int4.cpp

+                static_assert(TileK == 1);
+                int k = weight.size(0);
+                int n = weight.size(1);
+                int nsg_k = k / GroupK;

mingfeima Jan 23, 2025

shall we check before doing the div for integer here:

TORCH_CHECK(k % GroupK == 0 && n % GroupN == 0);

src/ATen/native/xpu/sycl/Dequant_int4.cpp

+                  float tmp[TileN];
+                  bool high4 = sg_id % 2 != 0;
+                  for (int in = 0; in < TileN; in++) {
+                    int scale_offset =

mingfeima Jan 23, 2025

Be aware of the indexing here: integer div might be slow, i am not sure that whether the compiler will do the optimization here or not. But a more promising way it to move k / block_size and sg_id & TileK / block_size out of the loop.

src/ATen/native/xpu/sycl/Dequant_int4.cpp Outdated

+                        : static_cast<int8_t>((srcu8 & 0x0f) - 8) * scale + zero_point;
+                  }
+                  float tmpT[TileN];

mingfeima Jan 23, 2025

does sycl has __shared__?

Contributor

airMeng Jan 23, 2025

sycl::local_accessor yes we shall update these


          fix some reiview commits

2d93e1e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet