Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-104635: Eliminate redundant STORE_FAST instructions in the compiler #105040

Closed
wants to merge 13 commits into from

Conversation

corona10
Copy link
Member

@corona10 corona10 commented May 28, 2023

import pyperf

runner = pyperf.Runner()
runner.timeit(name="bench dead_store",
              stmt="""
_, _, _, a = 1, 2, 3, 4
""")


Mean +- std dev: [base] 10.8 ns +- 0.2 ns -> [super] 10.1 ns +- 0.1 ns: 1.07x faster

@corona10 corona10 changed the title gh-104635: Naive dead store elimination in cfg opt gh-104635: Naive dead store elimination in cfg opt (experiment) May 28, 2023
Copy link
Member

@carljm carljm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems worth doing if the reported 1% win from pyperformance is real. We can also do a bit better here by allowing this optimization to help apply_static_swaps.

I'm not sure if there are reasons to avoid new super-instructions? Would like feedback from @markshannon on that.

@corona10 corona10 force-pushed the dead_store branch 2 times, most recently from 103c2c7 to a9f902c Compare May 31, 2023 07:45
@corona10 corona10 requested a review from carljm May 31, 2023 09:23
@corona10 corona10 marked this pull request as ready for review May 31, 2023 09:24
@corona10 corona10 changed the title gh-104635: Naive dead store elimination in cfg opt (experiment) gh-104635: Naive dead store elimination in cfg opt May 31, 2023
@corona10 corona10 changed the title gh-104635: Naive dead store elimination in cfg opt gh-104635: Naive dead store elimination in basic block optimization May 31, 2023
Copy link
Member

@carljm carljm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me! Seems ok to do the swap thing either in this PR or separately. Interested in any feedback from @iritkatriel / @markshannon / @brandtbucher.

@carljm
Copy link
Member

carljm commented May 31, 2023

We might also want to expand this so it can look more than one instruction ahead for another write to the same location, skipping over an allowlist of instructions that we know don't read that same local, don't execute arbitrary code, and can't raise an exception. E.g. LOAD_FAST of a different local, POP_TOP, NOP, probably others.

@corona10
Copy link
Member Author

corona10 commented Jun 1, 2023

@mdboom Hi, Can we run the pyperformance benchmark for this PR?

@corona10 corona10 changed the title gh-104635: Naive dead store elimination in basic block optimization gh-104635: Eliminate redundant :STORE_FAST instructions in the compiler Jun 1, 2023
@corona10 corona10 changed the title gh-104635: Eliminate redundant :STORE_FAST instructions in the compiler gh-104635: Eliminate redundant STORE_FAST instructions in the compiler Jun 1, 2023
@corona10 corona10 requested review from carljm and iritkatriel June 1, 2023 06:44
@corona10
Copy link
Member Author

corona10 commented Jun 1, 2023

We might also want to expand this so it can look more than one instruction ahead for another write to the same location, skipping over an allowlist of instructions that we know don't read that same local, don't execute arbitrary code, and can't raise an exception.

Yeah, great idea.
Currently, we can not optimize the following case.

_, _, a, _ = foo

But if we expand the optimization, we can do it.
Do we have to expand the optimization in this PR or handle it on a separate PR?

Copy link
Member

@carljm carljm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, presuming pyperformance comes back looking neutral or better.

Do we have to expand the optimization in this PR or handle it on a separate PR?

Either way seems fine to me.

Comment on lines +1097 to +1098
('NOP', 0, 3),
('POP_TOP', 0, 4),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a cool benefit I didn't realize we'd also get! LOAD_*/POP_TOP pairs can be eliminated entirely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this would just become LOAD_CONST NOP NOP STORE_FAST though, right?

Copy link
Member

@carljm carljm Jun 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; I think that's an orthogonal improvement to the LOAD/POP elimination optimization, allowing it to ignore intervening NOP. Or maybe it has to do rather with making it multi-pass? Or both? I haven't checked.

Copy link
Member Author

@corona10 corona10 Jun 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can run the loop of optimization until there are no opcode changes.
(I didn't test yet)

@corona10
Copy link
Member Author

corona10 commented Jun 2, 2023

@carljm

Looks good to me, presuming pyperformance comes back looking neutral or better.

I remeasured the benchmark with my bare metal machine,
As you expected, it shows 1.0x faster (neutral) with latest pyperformance 1.0.8.

I would like to merge this PR if possible, but it looks like we have to wait for Mark's task
And if the runtime super instruction is removed, we should re-consider adding POP_TOP__POP_TOP and POP_TOP__STORE_FAST.

The best thing would be to improve this PR with more optimization without adding super instruction or new single opcode concatenated by multiple opcodes.

Do you have any ideas or suggestions?

@carljm
Copy link
Member

carljm commented Jun 2, 2023

Do you have any ideas or suggestions?

I think we should wait for #105230 to land, given the motivations outlined in #105229. Then we should re-measure the performance impact of this. I think it would be good to do a stats run of pyperformance with this change and see how often POP_TOP is followed by another POP_TOP or by STORE_FAST, in order to help inform whether we should add POP_TOP_POP_TOP and POP_TOP_STORE_FAST superinstructions (in the compiler).

@corona10
Copy link
Member Author

corona10 commented Jun 4, 2023

I think it would be good to do a stats run of pyperformance with this change and see how often POP_TOP is followed by another POP_TOP or by STORE_FAST,

When I ran the pyperformance, they showed the very low ratio of executions.

@corona10 corona10 closed this Jun 5, 2023
@corona10
Copy link
Member Author

corona10 commented Jun 5, 2023

Replaced by #105320

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants