Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIEX] Improve iterative scheduling convergence strategy #212

Merged
merged 5 commits into from
Oct 15, 2024

Conversation

gbossu
Copy link
Collaborator

@gbossu gbossu commented Oct 15, 2024

This brings a new convergence strategy that can bias the depth of some SUnits. Doing so can shift part of the schedule up or down in the scoreboard and avoid resource conflicts without necessarily increasing the latency of the whole region.

Looking at Add2D, this helps us place the VLD instructions better and avoid bank conflicts with the next iteration of the loop. We can then disable the heuristic in WAWStickyRegistersEdges and more aggressively remove WAW edges on status registers. This is what the first commit does.

QoR is fine, there are still major regressions coming from Neg_aie2_1 and HardSigmoidTemplated_int8_0 (same as already mentioned in #183 (comment)), but other regressions are avoided.
And those two benchmarks are expected to be tackled in the new post-pipeliner soon.

| Core_Compute_Cycle_Count   | Neg_aie2_1    | HardSigmoidTemplated_int8_0 |     | Add2D_0      | Add2D_Standalone_0 | Add2D_Standalone_1 |     | Shrink_aie2_0 | Tanh_int8_0  | TanhTemplated_aie2_int8 | Tanh_int8_1  | Add2D_1      | HardSigmoidTemplated_bf16_0 | Average diff |
| -------------------------- | ------------- | --------------------------- | --- | ------------ | ------------------ | ------------------ | --- | ------------- | ------------ | ----------------------- | ------------ | ------------ | --------------------------- | ------------ |
| Baseline                   | 343(+0.00%)   | 257(+0.00%)                 | ... | 217(+0.00%)  | 322(+0.00%)        | 482(+0.00%)        | ... | 672(+0.00%)   | 349(+0.00%)  | 310(+0.00%)             | 421(+0.00%)  | 466(+0.00%)  | 617(+0.00%)                 | +0.00%       |
| Disabling heursitics       | 463(+34.99%)  | 284(+10.51%)                | ... | 230(+5.99%)  | 335(+4.04%)        | 511(+6.02%)        | ... | 672(+0.00%)   | 339(-2.87%)  | 300(-3.23%)             | 407(-3.33%)  | 466(+0.00%)  | 556(-9.89%)                 | +0.12%       |
| New convergence heuristic  | 462(-0.22%)   | 284(+0.00%)                 |     | 217(-5.65%)  | 322(-3.88%)        | 482(-5.68%)        |     | 658(-2.08%)   | 339(+0.00%)  | 300(+0.00%)             | 407(+0.00%)  | 434(-6.87%)  | 556(+0.00%)                 | -0.07%       |
| Overall diff               | REGR(+34.69%) | REGR(+10.51%)               |     | SAME(+0.00%) | SAME(+0.00%)       | SAME(+0.00%)       |     | IMPR(-2.08%)  | IMPR(-2.87%) | IMPR(-3.23%)            | IMPR(-3.33%) | IMPR(-6.87%) | IMPR(-9.89%)                | +0.05%       |

Better review commit by commit :)

We are now expecting the schedulers to have improved enough to handle
most regressions on their own. (Some of these improvements in next
commits)
Ultimately we will have:
createTopDownScoreboard
checkResourceConflictsTopDown
createBottomUpScoreboard
checkResourceConflictsBottomUp

Note that createTopDownScoreboard is now extra careful and blocks extra
cycles for tiny loops. But this is still an NFC change because that
scoreboard is only used to determine the number of resrouces that "stick
out".
This allows to bias the depth of some SUnits in the hope of moving them
up or down in the scoreboard and avoid resource conflicts without
necessarily increasing the latency of the whole region.
MachineInstr *ConflictMI = nullptr;
for (const MachineBundle &B : SuccBundles) {
for (MachineInstr *MI : B.getInstrs()) {
if (HR.getHazardType(Scoreboard, MI->getDesc(), HR.getMemoryBanks(MI),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of MI->getDesc(), please use *HR.getSelectedAltDescs().getDesc(&MI)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't realise that part was already merged. Will do

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that would take away the pre condition of no MSPs?

for (int C = FirstBlockedCycle; C <= LastBlockedCycle; ++C) {
Scoreboard[C].blockResources();
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this NFC?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned it in the commit, in the context of how this was used, this is an NFC.

This prepares for future work where Multi-slot pseudos won't necessarily
be materialized.
Copy link
Collaborator

@andcarminati andcarminati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@gbossu gbossu merged commit ab2b5d2 into aie-public Oct 15, 2024
8 checks passed
@gbossu gbossu deleted the gaetan.loopaware.convergence branch October 15, 2024 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants