-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make emitted egal code more loopy #54121
Conversation
We do run the pass that inlines memcmps, but it's a backend pass. I've been chipping away at moving it to the middle end in llvm/llvm-project#77370, but it still needs some love |
The probably explains why the performance is better than I expected |
I played a bit with it here #52719, which is what made me start doing the LLVM work |
@nanosoldier |
The package evaluation job you requested has completed - possible new issues were detected. |
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
Performance looks good |
Performance is fine, but as shown in pkgeval, we're missing some edge cases, since we're double booking the meaning of has padding to also mean bots comparable. I need to think about whether to implement the edge case or just fall back. |
b5a5ea1
to
fe67c79
Compare
@nanosoldier |
The package evaluation job you requested has completed - possible new issues were detected. |
ee760e0
to
1ac52df
Compare
The strategy here is to look at (data, padding) pairs and RLE them into loops, so that repeated adjacent patterns use a loop rather than getting unrolled. On the test case from #54109, this makes compilation essentially instant, while also being faster at runtime (turns out LLVM spends a massive amount of time AND the answer is bad). There's some obvious further enhancements possible here: 1. The `memcmp` constant is small. LLVM has a pass to inline these with better code. However, we don't have it turned on. We should consider vendoring it, though we may want to add some shorcutting to it to avoid having it iterate through each function. 2. This only does one level of sequence matching. It could be recursed to turn things into nested loops. However, this solves the immediate issue, so hopefully it's a useful start. Fixes #54109.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very fancy! Sgtm (though I didn't review the recursive algorithm fully, I trust our tests would have caught any issues with it)
@@ -621,18 +624,17 @@ void jl_compute_field_offsets(jl_datatype_t *st) | |||
// if we have no fields, we can trivially skip the rest | |||
if (st == jl_symbol_type || st == jl_string_type) { | |||
// opaque layout - heap-allocated blob | |||
static const jl_datatype_layout_t opaque_byte_layout = {0, 0, 1, -1, 1, {0}}; | |||
static const jl_datatype_layout_t opaque_byte_layout = {0, 0, 1, -1, 1, { .haspadding = 0, .fielddesc_type=0, .isbitsegal=1, .arrayelem_isboxed=0, .arrayelem_isunion=0 }}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this was a GCC extension. Did it finally make it into the standard with reasonable behavior? It is certainly much better syntax for readability/maintainability!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's in C99. The C++ version was a GCC extension until C++20, so that may be what you're thinking of.
// Emit memcmp. TODO: LLVM has a pass to expand this for additional | ||
// performance. | ||
Value *this_answer = ctx.builder.CreateCall(prepare_call(memcmp_func), | ||
{ ptr1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This load from a gc value appears to be unrooted (as required by the ABI for memcp). Please copy the code from the other memcmp branch to add the roots to this (unless I missed something)
As requested in post-commit review on #54121.
As requested in post-commit review on #54121.
The strategy here is to look at (data, padding) pairs and RLE them into loops, so that repeated adjacent patterns use a loop rather than getting unrolled. On the test case from #54109, this makes compilation essentially instant, while also being faster at runtime (turns out LLVM spends a massive amount of time AND the answer is bad).
There's some obvious further enhancements possible here:
memcmp
constant is small. LLVM has a pass to inline these with better code. However, we don't have it turned on. We should consider vendoring it, though we may want to add some shorcutting to it to avoid having it iterate through each function.However, this solves the immediate issue, so hopefully it's a useful start. Fixes #54109.