Sns enable lp kn blocking #26957

IvanNovoselov · 2024-10-08T14:28:57Z

Details:

Perform weights repacking outside of the blocking cycles
Functionally enable blocking for I8,U8 and BF16
The blocking will be temporary disabled until blocking heuristic is updated in 156014

Tickets:

154729

IvanNovoselov · 2024-10-18T09:28:28Z

src/common/snippets/src/lowered/loop_port.cpp

Note, it was a bug in LoopPort comparison operators: we compared values of shared pointers instead of comparing expression ports

IvanNovoselov · 2024-10-24T14:58:24Z

src/common/snippets/src/lowered/expressions/buffer_expression.cpp

Note: to improve serialization of Buffer expressions. They didn't print any expression info

IvanNovoselov · 2024-10-24T14:59:50Z

src/plugins/intel_cpu/src/emitters/snippets/x64/jit_kernel_emitter.cpp

Note: this change is to fix a bug when the kernel spoiled runtime_args register in the static pipeline and brgemm tried to read amx config from an invalid address.

IvanNovoselov · 2024-10-24T15:50:12Z

@v-Golubev, @a-sidorova the PR is ready for review. Please, take a look

a-sidorova · 2024-10-25T04:57:10Z

src/plugins/intel_cpu/src/transformations/snippets/x64/op/brgemm_utils.hpp

+template<typename T, typename = typename std::enable_if<(std::is_same<T, size_t>::value || std::is_same<T, int64_t>::value), bool>::type>
+T compute_out_leading_dim(T n_block, const ov::element::Type& precision) {
+    return snippets::utils::is_dynamic_value<T>(n_block) ?
+           n_block :
+           std::max(n_block, static_cast<T>(compute_inner_n_block(precision)));
+}


The comment for the discussion:
I believe that LDB - is stride for second input of Brgemm block (not of MatMul).

If there is not CopyB, LDB = snippets::utils::get_dim_stride(expr->get_input_port(1))

If there is CopyB, we should take a stride as CopyB wrote. As far as I understand, CopyB stores the data in blocked layout. It means that LDB in this case should be aligned with inner_block: rnd_up(N, inner_block). Since at the moment we use max, our LDB is not aligned with block size and we cannot support avx_vnni_2 in bf16 case (it expects blocked weights).

IvanNovoselov · 2024-10-25T09:55:26Z

Please note, I reverted Disable K,N blocking until blocking heuristic is updated in order to run the CI on updated AdjustBrgemmCopyBLoopPorts::update_loop_info.
Will re-apply the commit when the CI is green

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.hpp

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp

.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.cpp

src/plugins/intel_cpu/src/transformations/snippets/x64/pass/lowered/brgemm_cpu_blocking.cpp

a-sidorova · 2024-10-25T07:24:50Z

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp

    if (linear_ir->is_dynamic())
+        loopPortsAdjuster.optimize();


Note: One more argument to the ticket 148891. If we have PassPipeline in the base class, CPURuntimeConfigurator could just add PositionedPass loopPortsAdjuster and no need to make a copy from the base class to the derived 🤔

Copy of what? The update method?
If so, then I totally agree

.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.hpp

.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.cpp

a-sidorova · 2024-10-25T12:27:17Z

src/plugins/intel_cpu/src/transformations/snippets/x64/pass/lowered/brgemm_cpu_blocking.cpp


    const auto copy_b_expr = linear_ir.get_expr_by_node(brgemm->get_brgemm_copy());
-    copy_b_expr->get_input_port_descriptor(0)->set_subtensor({k_block, n_block});
-    copy_b_expr->get_output_port_descriptor(0)->set_subtensor({k_block, n_block});
+    copy_b_expr->get_input_port_descriptor(0)->set_subtensor({get_full_dim_value(), get_full_dim_value()});


Just question to think: If loop by M is missed (dimension M < M_block_size), can we execute CopyB in loop by K, N with Brgemm? Can be there some perf improvements?

It's a fair question, but I don't think so because in order to repack the first line (or block) we need to access the whole vnni_factor * N subtensor. It means that to produce repacked K_blk x N_blk subtensor for brgemm, we would need to access div_up(K_blk/vnni_factor) x N subtensor. So basically repacking inside blocking cycles requires higher memory bandwidth than matrix multiplication.
BTW did me measure performance when the low precision blocking was initially introduced in PR 23292. @v-Golubev, maybe you have some data?

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.hpp

.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.hpp

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp

.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.cpp

src/plugins/intel_cpu/tests/unit/snippets_transformations/x64/lowered/brgemm_blocking.cpp

src/plugins/intel_cpu/src/transformations/snippets/x64/op/brgemm_utils.hpp

IvanNovoselov · 2024-10-31T11:05:13Z

@v-Golubev, @a-sidorova I addressed your comments, please take a second look.

a-sidorova

👍🏼

.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.hpp

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp

.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.cpp

github-actions bot added the category: CPU OpenVINO CPU plugin label Oct 8, 2024

IvanNovoselov force-pushed the sns_enable_lp_kn_blocking branch from ffb0cb7 to bfe24f4 Compare October 18, 2024 09:16

IvanNovoselov marked this pull request as ready for review October 18, 2024 09:17

IvanNovoselov requested review from a team as code owners October 18, 2024 09:17

IvanNovoselov commented Oct 18, 2024

View reviewed changes

IvanNovoselov added the do_not_review label Oct 18, 2024

IvanNovoselov force-pushed the sns_enable_lp_kn_blocking branch 3 times, most recently from 30e077d to bbd612b Compare October 24, 2024 11:20

IvanNovoselov commented Oct 24, 2024

View reviewed changes

IvanNovoselov removed the do_not_review label Oct 24, 2024

IvanNovoselov assigned a-sidorova and v-Golubev Oct 24, 2024

IvanNovoselov force-pushed the sns_enable_lp_kn_blocking branch from c0d5c02 to f5e678c Compare October 24, 2024 16:01

a-sidorova reviewed Oct 25, 2024

View reviewed changes

IvanNovoselov force-pushed the sns_enable_lp_kn_blocking branch from f5e678c to 5843a07 Compare October 25, 2024 09:53

v-Golubev reviewed Oct 25, 2024

View reviewed changes

a-sidorova reviewed Oct 28, 2024

View reviewed changes

src/plugins/intel_cpu/src/transformations/snippets/x64/op/brgemm_utils.hpp Show resolved Hide resolved

IvanNovoselov requested review from v-Golubev and a-sidorova October 31, 2024 11:05

a-sidorova approved these changes Nov 1, 2024

View reviewed changes

.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.hpp Outdated Show resolved Hide resolved

v-Golubev approved these changes Nov 4, 2024

View reviewed changes

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp Show resolved Hide resolved

.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.cpp Outdated Show resolved Hide resolved

IvanNovoselov force-pushed the sns_enable_lp_kn_blocking branch 2 times, most recently from 6cabbbe to d96b5ae Compare November 7, 2024 10:25

IvanNovoselov force-pushed the sns_enable_lp_kn_blocking branch from d96b5ae to 4697fa0 Compare November 7, 2024 11:20

a-sidorova mentioned this pull request Nov 7, 2024

[Snippets][CPU] Added first input of BrgemmCPU repacking on K_tail tensor #27452

Open

1 task

IvanNovoselov force-pushed the sns_enable_lp_kn_blocking branch from 4697fa0 to a2a6d83 Compare November 7, 2024 12:30

[Snippets] Perform weights repacking outside of blocking loops.

a2a6d83

IvanNovoselov added this pull request to the merge queue Nov 7, 2024

Merged via the queue into openvinotoolkit:master with commit 8edb150 Nov 7, 2024
166 checks passed

IvanNovoselov deleted the sns_enable_lp_kn_blocking branch November 7, 2024 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sns enable lp kn blocking #26957

Sns enable lp kn blocking #26957

IvanNovoselov commented Oct 8, 2024 •

edited

Loading

IvanNovoselov Oct 18, 2024

IvanNovoselov Oct 24, 2024

IvanNovoselov Oct 24, 2024

IvanNovoselov commented Oct 24, 2024

a-sidorova Oct 25, 2024

IvanNovoselov commented Oct 25, 2024 •

edited

Loading

a-sidorova Oct 25, 2024

IvanNovoselov Oct 29, 2024

a-sidorova Oct 25, 2024

IvanNovoselov Oct 29, 2024

IvanNovoselov commented Oct 31, 2024

a-sidorova left a comment

Sns enable lp kn blocking #26957

Sns enable lp kn blocking #26957

Conversation

IvanNovoselov commented Oct 8, 2024 • edited Loading

Details:

Tickets:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IvanNovoselov commented Oct 24, 2024

Choose a reason for hiding this comment

IvanNovoselov commented Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IvanNovoselov commented Oct 31, 2024

a-sidorova left a comment

Choose a reason for hiding this comment

IvanNovoselov commented Oct 8, 2024 •

edited

Loading

IvanNovoselov commented Oct 25, 2024 •

edited

Loading