-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stdlib] Optimize compiler times by prevent huge loop unrolling when filling inline arrays. #4046
base: main
Are you sure you want to change the base?
Conversation
Could we make unroll_threshold into a parameter that's defaulted to something a bit more reasonable, like 64? That way people can customize it, since unrolling 1000 iterations is already icache pollution. |
@parameter | ||
for i in range(size): | ||
for i in range(unrolled): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to unroll the inner loop, not the outer one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: msaelices@bb401c4
@@ -140,8 +140,21 @@ struct InlineArray[ | |||
_inline_array_construction_checks[size]() | |||
__mlir_op.`lit.ownership.mark_initialized`(__get_mvalue_as_litref(self)) | |||
|
|||
alias unroll_threshold = 1000 | |||
alias unrolled = size // unroll_threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
math.align_down
makes intent more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure: msaelices@bb401c4
Unroll the inner loop Signed-off-by: Manuel Saelices <msaelices@gmail.com>
… a parameter Signed-off-by: Manuel Saelices <msaelices@gmail.com>
57c599e
to
edadb98
Compare
Done: msaelices@edadb98 |
@owenhilyard Thanks for the review. Could you please take another look? |
@@ -131,7 +132,7 @@ struct InlineArray[ | |||
|
|||
@always_inline | |||
@implicit | |||
fn __init__(out self, fill: Self.ElementType): | |||
fn __init__[batch_size: Int = 100](out self, fill: Self.ElementType): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a power of 2 in order to keep the chunks aligned well. 32 or 64 is good enough to not bloat the icache but still be fast, especially since processing the loop variable will happen in parallel with the vector ops on any superscalar processor, which is most of them at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's an inline array of bytes, then 100 will need to do some really odd things with instructions, especially on AVX512 since it will move 64, then 32, then 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: msaelices@212caea
Signed-off-by: Manuel Saelices <msaelices@gmail.com>
FTR, this is another case for open source |
@weiweichen @npanchen curious to get your thoughts on this PR when you get a chance |
Stay tuned my friend 😄 |
I don't think I have strong opinion on this, you trade compilation time with run time, and sometimes unroll more gives your more performant code, sometimes it doesn't depends on how much more optimization can be achieved with unrolling more. |
Specifically to this PR having partial unroll is definitely better than full unroll. I'm curious why 64 has been chosen as that unroll factor seems too large, especially if HW has loop stream detector -like feature. @msaelices ? But generally having any unrolling at compile time on parameterized |
64 was my choice. It's mostly chosen since 64 bytes is 1 AVX512 operation, meaning fast creation of byte arrays initialized to a single value without relying on the compiler to unroll the loop further. That's balanced against larger types, where 64 instances may be multiple instructions to fill per loop iteration. In that case, even 8-byte pointers, one can still expect 8 items per iteration, which is a bit over optimal on most current CPUs since that gets close to bloating icache, but CPUs are also getting deeper with wider decode. If we could have some kind of
Does that pass handle 16, 32 and 64 bit values as well? I think the goal for this should aim to be good general purpose perf, since there is an override for the unroll factor as a parameter. If we trip a byte specific optimization but give up vectorized handling of values of other common widths, then it's going to be codebase specific as to whether this is a performance gain. From some test code in godbolt, it looks like code in this pattern can trip the memset detection in some other way, at least on clang trunk, but something, either Clang or LLVM, doesn't seem to bother for lower width operations. Overall, I'd prefer to have the unroll control in there even if we set it to 1 by default. |
The
InlineArray(fill=...)
constructor loops for the initialization of all the pointee items in the array in comptime.This makes the compiler really slow when the size of the inline array is >2k, and even shows a warning if the size is bigger than 64k:
IMO, I don't find useful to have 64k
init_pointee_copy
lines when we can unroll by batches, keeping the compiler fast and the actual runtime still fast.