-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single arity immediate per br_table instruction and make arity immediate 4-byte wide #354
Conversation
Codecov Report
@@ Coverage Diff @@
## master #354 +/- ##
=======================================
Coverage 99.55% 99.55%
=======================================
Files 49 49
Lines 14529 14535 +6
=======================================
+ Hits 14464 14470 +6
Misses 65 65 |
This should be benchmarked. |
There is 5% regression on mul benchmark:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only works provided br_table
is validated (i.e. all its targets have the same type). Otherwise, this case cause stack underflow (regression tests attached).
(I have not investigated if situation was better before).
58baadd
to
cdcf9af
Compare
Basically this change gives no speed benefit, just penalty on the mul256 benchmark. Should we merge this? |
|
I don't think llvm uses |
cdcf9af
to
cca8042
Compare
I don't have good news. These are execution benchmarks results. I removed statistically insignificant results, except for eli_interpreter (as it is relevant for br_table). I also repeated keccak and eli_interpreter cases separately, but the results were the same or worse. The issue is with keccak, and it actually uses
|
On laptop/skylake seems to be no regression. no_turbo:
turbo (unstable results):
Maybe change in immediates layout affects timings on memory loads (e.g. some immediates accesses cross cache line). I can also see increased branch prediction misses from 2% to 3% on Haswell (first hardware setup). I'm recommending creating a separate commit where only arity immediate value order is changed. |
The first commit here does only reordering. |
The reordering commit does not change anything. The "Tweak performance 1" lowers the regression by ~2x.
This PR should really not touch any instruction except |
So I also dumped alignments of 4-byte reads in
|
The "Tweak performance 3" make all reads aligned, but there is still 3-4% performance regression. |
I benchmarked every commit once again on Keccak workload and compared that with immediates alignment, but this rather does not explain the timings:
The |
In #376 I checked what is the impact of only changing arity immediate value from |
ee0459c
to
d31ab80
Compare
I removed last commit because was to invasive ( This can land in current form. |
d31ab80
to
e567b36
Compare
I checked the assembly before and after. Looks like for GCC the issues is with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be put on hold because of the found performance issues.
I have some other changes planned around branching, so it may make the decision easier to make later.
3195ab0
to
279e58a
Compare
Rebased. |
@chfast can you run the benchmarks again? Perhaps numbers are different now after a month's worth of changes. |
lib/fizzy/execute.cpp
Outdated
@@ -670,27 +669,31 @@ ExecutionResult execute(Instance& instance, FuncIdx func_idx, span<const uint64_ | |||
case Instr::br_if: | |||
case Instr::return_: | |||
{ | |||
const auto arity = read<uint8_t>(immediates); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also check if making this uint32_t
makes any difference, as attempted in #376.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need a commit for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed variant with uint32_t
arity.
lib/fizzy/execute.cpp
Outdated
break; | ||
} | ||
case Instr::br_table: | ||
{ | ||
const auto br_table_size = read<uint32_t>(immediates); | ||
const auto arity = read<uint8_t>(immediates); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here.
|
With arity immediate being 4 bytes, this is nice.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not ideal, because it mixes two different optimization changes. But it is good enough to merge.
test/unittests/parser_expr_test.cpp
Outdated
} | ||
|
||
TEST(parser_expr, instr_br_table) | ||
TEST(parser_expr, DISABLED_instr_br_table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is annoying to update, should I just delete it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated it anyway.
03107a2
to
e0ae51d
Compare
Changed |
e0ae51d
to
0c795c7
Compare
Get rid of duplicated arity in
br_table
immediates.