Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sns enable lp kn blocking #26957

Conversation

IvanNovoselov
Copy link
Contributor

@IvanNovoselov IvanNovoselov commented Oct 8, 2024

Details:

  • Perform weights repacking outside of the blocking cycles
  • Functionally enable blocking for I8,U8 and BF16
  • The blocking will be temporary disabled until blocking heuristic is updated in 156014

Tickets:

  • 154729

@github-actions github-actions bot added the category: CPU OpenVINO CPU plugin label Oct 8, 2024
@IvanNovoselov IvanNovoselov marked this pull request as ready for review October 18, 2024 09:17
@IvanNovoselov IvanNovoselov requested review from a team as code owners October 18, 2024 09:17
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, it was a bug in LoopPort comparison operators: we compared values of shared pointers instead of comparing expression ports

@IvanNovoselov IvanNovoselov force-pushed the sns_enable_lp_kn_blocking branch 3 times, most recently from 30e077d to bbd612b Compare October 24, 2024 11:20
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: to improve serialization of Buffer expressions. They didn't print any expression info

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this change is to fix a bug when the kernel spoiled runtime_args register in the static pipeline and brgemm tried to read amx config from an invalid address.

@IvanNovoselov
Copy link
Contributor Author

@v-Golubev, @a-sidorova the PR is ready for review. Please, take a look

Comment on lines 53 to 58
template<typename T, typename = typename std::enable_if<(std::is_same<T, size_t>::value || std::is_same<T, int64_t>::value), bool>::type>
T compute_out_leading_dim(T n_block, const ov::element::Type& precision) {
return snippets::utils::is_dynamic_value<T>(n_block) ?
n_block :
std::max(n_block, static_cast<T>(compute_inner_n_block(precision)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment for the discussion:
I believe that LDB - is stride for second input of Brgemm block (not of MatMul).

  1. If there is not CopyB, LDB = snippets::utils::get_dim_stride(expr->get_input_port(1))
  2. If there is CopyB, we should take a stride as CopyB wrote. As far as I understand, CopyB stores the data in blocked layout. It means that LDB in this case should be aligned with inner_block: rnd_up(N, inner_block). Since at the moment we use max, our LDB is not aligned with block size and we cannot support avx_vnni_2 in bf16 case (it expects blocked weights).

@IvanNovoselov
Copy link
Contributor Author

IvanNovoselov commented Oct 25, 2024

Please note, I reverted Disable K,N blocking until blocking heuristic is updated in order to run the CI on updated AdjustBrgemmCopyBLoopPorts::update_loop_info.
Will re-apply the commit when the CI is green

if (linear_ir->is_dynamic())
loopPortsAdjuster.optimize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: One more argument to the ticket 148891. If we have PassPipeline in the base class, CPURuntimeConfigurator could just add PositionedPass loopPortsAdjuster and no need to make a copy from the base class to the derived 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy of what? The update method?
If so, then I totally agree


const auto copy_b_expr = linear_ir.get_expr_by_node(brgemm->get_brgemm_copy());
copy_b_expr->get_input_port_descriptor(0)->set_subtensor({k_block, n_block});
copy_b_expr->get_output_port_descriptor(0)->set_subtensor({k_block, n_block});
copy_b_expr->get_input_port_descriptor(0)->set_subtensor({get_full_dim_value(), get_full_dim_value()});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just question to think: If loop by M is missed (dimension M < M_block_size), can we execute CopyB in loop by K, N with Brgemm? Can be there some perf improvements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a fair question, but I don't think so because in order to repack the first line (or block) we need to access the whole vnni_factor * N subtensor. It means that to produce repacked K_blk x N_blk subtensor for brgemm, we would need to access div_up(K_blk/vnni_factor) x N subtensor. So basically repacking inside blocking cycles requires higher memory bandwidth than matrix multiplication.
BTW did me measure performance when the low precision blocking was initially introduced in PR 23292. @v-Golubev, maybe you have some data?

@IvanNovoselov
Copy link
Contributor Author

@v-Golubev, @a-sidorova I addressed your comments, please take a second look.

Copy link
Contributor

@a-sidorova a-sidorova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼

@IvanNovoselov IvanNovoselov force-pushed the sns_enable_lp_kn_blocking branch 2 times, most recently from 6cabbbe to d96b5ae Compare November 7, 2024 10:25
@IvanNovoselov IvanNovoselov added this pull request to the merge queue Nov 7, 2024
Merged via the queue into openvinotoolkit:master with commit 8edb150 Nov 7, 2024
166 checks passed
@IvanNovoselov IvanNovoselov deleted the sns_enable_lp_kn_blocking branch November 7, 2024 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: CPU OpenVINO CPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants