-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sns enable lp kn blocking #26957
Sns enable lp kn blocking #26957
Conversation
ffb0cb7
to
bfe24f4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, it was a bug in LoopPort comparison operators: we compared values of shared pointers instead of comparing expression ports
30e077d
to
bbd612b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: to improve serialization of Buffer expressions. They didn't print any expression info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this change is to fix a bug when the kernel spoiled runtime_args register in the static pipeline and brgemm tried to read amx config from an invalid address.
@v-Golubev, @a-sidorova the PR is ready for review. Please, take a look |
c0d5c02
to
f5e678c
Compare
template<typename T, typename = typename std::enable_if<(std::is_same<T, size_t>::value || std::is_same<T, int64_t>::value), bool>::type> | ||
T compute_out_leading_dim(T n_block, const ov::element::Type& precision) { | ||
return snippets::utils::is_dynamic_value<T>(n_block) ? | ||
n_block : | ||
std::max(n_block, static_cast<T>(compute_inner_n_block(precision))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment for the discussion:
I believe that LDB
- is stride for second input of Brgemm
block (not of MatMul
).
- If there is not
CopyB
,LDB = snippets::utils::get_dim_stride(expr->get_input_port(1))
- If there is
CopyB
, we should take a stride asCopyB
wrote. As far as I understand,CopyB
stores the data in blocked layout. It means thatLDB
in this case should be aligned withinner_block
:rnd_up(N, inner_block)
. Since at the moment we usemax
, our LDB is not aligned with block size and we cannot support avx_vnni_2 in bf16 case (it expects blocked weights).
f5e678c
to
5843a07
Compare
Please note, I reverted Disable K,N blocking until blocking heuristic is updated in order to run the CI on updated AdjustBrgemmCopyBLoopPorts::update_loop_info. |
src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.hpp
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp
Show resolved
Hide resolved
.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.cpp
Outdated
Show resolved
Hide resolved
.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.cpp
Show resolved
Hide resolved
.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/transformations/snippets/x64/pass/lowered/brgemm_cpu_blocking.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/transformations/snippets/x64/pass/lowered/brgemm_cpu_blocking.cpp
Outdated
Show resolved
Hide resolved
if (linear_ir->is_dynamic()) | ||
loopPortsAdjuster.optimize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: One more argument to the ticket 148891. If we have PassPipeline
in the base class, CPURuntimeConfigurator
could just add PositionedPass loopPortsAdjuster
and no need to make a copy from the base class to the derived 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy of what? The update
method?
If so, then I totally agree
.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.hpp
Outdated
Show resolved
Hide resolved
.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.cpp
Outdated
Show resolved
Hide resolved
.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.cpp
Outdated
Show resolved
Hide resolved
|
||
const auto copy_b_expr = linear_ir.get_expr_by_node(brgemm->get_brgemm_copy()); | ||
copy_b_expr->get_input_port_descriptor(0)->set_subtensor({k_block, n_block}); | ||
copy_b_expr->get_output_port_descriptor(0)->set_subtensor({k_block, n_block}); | ||
copy_b_expr->get_input_port_descriptor(0)->set_subtensor({get_full_dim_value(), get_full_dim_value()}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just question to think: If loop by M is missed (dimension M < M_block_size), can we execute CopyB in loop by K, N with Brgemm? Can be there some perf improvements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a fair question, but I don't think so because in order to repack the first line (or block) we need to access the whole vnni_factor * N
subtensor. It means that to produce repacked K_blk x N_blk subtensor for brgemm, we would need to access div_up(K_blk/vnni_factor) x N
subtensor. So basically repacking inside blocking cycles requires higher memory bandwidth than matrix multiplication.
BTW did me measure performance when the low precision blocking was initially introduced in PR 23292. @v-Golubev, maybe you have some data?
src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.hpp
Show resolved
Hide resolved
.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.hpp
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp
Outdated
Show resolved
Hide resolved
.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.cpp
Show resolved
Hide resolved
src/plugins/intel_cpu/tests/unit/snippets_transformations/x64/lowered/brgemm_blocking.cpp
Show resolved
Hide resolved
src/plugins/intel_cpu/src/transformations/snippets/x64/op/brgemm_utils.hpp
Show resolved
Hide resolved
@v-Golubev, @a-sidorova I addressed your comments, please take a second look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏼
.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp
Show resolved
Hide resolved
.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.cpp
Outdated
Show resolved
Hide resolved
6cabbbe
to
d96b5ae
Compare
d96b5ae
to
4697fa0
Compare
4697fa0
to
a2a6d83
Compare
Details:
Tickets: