[stdlib] Optimize compiler times by prevent huge loop unrolling when filling inline arrays. #4046

msaelices · 2025-03-02T22:04:02Z

The InlineArray(fill=...) constructor loops for the initialization of all the pointee items in the array in comptime.

This makes the compiler really slow when the size of the inline array is >2k, and even shows a warning if the size is bigger than 64k:

IMO, I don't find useful to have 64k init_pointee_copy lines when we can unroll by batches, keeping the compiler fast and the actual runtime still fast.

inline arrays.

owenhilyard · 2025-03-03T02:42:22Z

Could we make unroll_threshold into a parameter that's defaulted to something a bit more reasonable, like 64? That way people can customize it, since unrolling 1000 iterations is already icache pollution.

owenhilyard · 2025-03-03T02:43:17Z

mojo/stdlib/src/collections/inline_array.mojo

        @parameter
-        for i in range(size):
+        for i in range(unrolled):


We probably want to unroll the inner loop, not the outer one.

Done: msaelices@bb401c4

owenhilyard · 2025-03-03T02:43:50Z

mojo/stdlib/src/collections/inline_array.mojo

@@ -140,8 +140,21 @@ struct InlineArray[
        _inline_array_construction_checks[size]()
        __mlir_op.`lit.ownership.mark_initialized`(__get_mvalue_as_litref(self))

+        alias unroll_threshold = 1000
+        alias unrolled = size // unroll_threshold


math.align_down makes intent more clear.

Sure: msaelices@bb401c4

Unroll the inner loop Signed-off-by: Manuel Saelices <msaelices@gmail.com>

… a parameter Signed-off-by: Manuel Saelices <msaelices@gmail.com>

msaelices · 2025-03-03T20:28:56Z

Could we make unroll_threshold into a parameter that's defaulted to something a bit more reasonable, like 64? That way people can customize it, since unrolling 1000 iterations is already icache pollution.

Done: msaelices@edadb98

msaelices · 2025-03-03T20:30:03Z

@owenhilyard Thanks for the review. Could you please take another look?

owenhilyard · 2025-03-03T21:03:58Z

mojo/stdlib/src/collections/inline_array.mojo

@@ -131,7 +132,7 @@ struct InlineArray[

    @always_inline
    @implicit
-    fn __init__(out self, fill: Self.ElementType):
+    fn __init__[batch_size: Int = 100](out self, fill: Self.ElementType):


I think this should be a power of 2 in order to keep the chunks aligned well. 32 or 64 is good enough to not bloat the icache but still be fast, especially since processing the loop variable will happen in parallel with the vector ops on any superscalar processor, which is most of them at this point.

If it's an inline array of bytes, then 100 will need to do some really odd things with instructions, especially on AVX512 since it will move 64, then 32, then 4.

Done: msaelices@212caea

Signed-off-by: Manuel Saelices <msaelices@gmail.com>

soraros · 2025-03-06T18:54:55Z

FTR, this is another case for open source algorithm.functional.

JoeLoser · 2025-03-10T16:55:52Z

@weiweichen @npanchen curious to get your thoughts on this PR when you get a chance

JoeLoser · 2025-03-10T16:56:13Z

FTR, this is another case for open source algorithm.functional.

Stay tuned my friend 😄

weiweichen · 2025-03-11T21:33:54Z

@weiweichen @npanchen curious to get your thoughts on this PR when you get a chance

I don't think I have strong opinion on this, you trade compilation time with run time, and sometimes unroll more gives your more performant code, sometimes it doesn't depends on how much more optimization can be achieved with unrolling more.

npanchen · 2025-03-12T01:35:56Z

@weiweichen @npanchen curious to get your thoughts on this PR when you get a chance

Specifically to this PR having partial unroll is definitely better than full unroll. I'm curious why 64 has been chosen as that unroll factor seems too large, especially if HW has loop stream detector -like feature. @msaelices ?

But generally having any unrolling at compile time on parameterized fill function is a really tricky question. For example, unroll can be harmful when dtype is Int8 as it's going to disable memset recognition that LLVM has: LoopIdiomPass does expect non-unrolled loop and LLVM has no reroll (yet).
Obviously, there're cases when partial unroll is still beneficial. Full unroll when trip count is small is definitely worth it.

owenhilyard · 2025-03-12T12:40:50Z

@weiweichen @npanchen curious to get your thoughts on this PR when you get a chance

Specifically to this PR having partial unroll is definitely better than full unroll. I'm curious why 64 has been chosen as that unroll factor seems too large, especially if HW has loop stream detector -like feature. @msaelices ?

64 was my choice. It's mostly chosen since 64 bytes is 1 AVX512 operation, meaning fast creation of byte arrays initialized to a single value without relying on the compiler to unroll the loop further. That's balanced against larger types, where 64 instances may be multiple instructions to fill per loop iteration. In that case, even 8-byte pointers, one can still expect 8 items per iteration, which is a bit over optimal on most current CPUs since that gets close to bloating icache, but CPUs are also getting deeper with wider decode. If we could have some kind of TrivialCopy marker trait, then it would be more doable to differentiate the case of byte copy initialization and needing to do a full copy constructor. Right now, I'm leaning towards expecting mostly SIMD sub-types and pointers in inline arrays.

But generally having any unrolling at compile time on parameterized fill function is a really tricky question. For example, unroll can be harmful when dtype is Int8 as it's going to disable memset recognition that LLVM has: LoopIdiomPass does expect non-unrolled loop and LLVM has no reroll (yet). Obviously, there're cases when partial unroll is still beneficial. Full unroll when trip count is small is definitely worth it.

Does that pass handle 16, 32 and 64 bit values as well? I think the goal for this should aim to be good general purpose perf, since there is an override for the unroll factor as a parameter. If we trip a byte specific optimization but give up vectorized handling of values of other common widths, then it's going to be codebase specific as to whether this is a performance gain. From some test code in godbolt, it looks like code in this pattern can trip the memset detection in some other way, at least on clang trunk, but something, either Clang or LLVM, doesn't seem to bother for lower width operations.

Overall, I'd prefer to have the unroll control in there even if we set it to 1 by default.

msaelices added 2 commits March 2, 2025 22:57

Optimize compiler times by prevent huge loop unrolling when filling

b5bdf12

inline arrays.

Merge branch 'main' into inlinearray-compiler-slowness

4bcd693

msaelices mentioned this pull request Mar 2, 2025

Increase the MAX_MESSAGE_LEN alias from 2k to 16k msaelices/mojo-websockets#4

Open

owenhilyard reviewed Mar 3, 2025

View reviewed changes

msaelices added 3 commits March 3, 2025 21:18

Merge branch 'main' into inlinearray-compiler-slowness

b54cdf7

Use math.align_down() that makes the intent more obvious

bb401c4

Unroll the inner loop Signed-off-by: Manuel Saelices <msaelices@gmail.com>

Smaller batch_size in the InlineArray(fill) constructor, which is now…

edadb98

… a parameter Signed-off-by: Manuel Saelices <msaelices@gmail.com>

msaelices force-pushed the inlinearray-compiler-slowness branch from 57c599e to edadb98 Compare March 3, 2025 20:28

msaelices requested a review from owenhilyard March 3, 2025 20:29

owenhilyard reviewed Mar 3, 2025

View reviewed changes

Change the batch_size to be power of two. Add param to the docstring

212caea

Signed-off-by: Manuel Saelices <msaelices@gmail.com>

msaelices requested a review from owenhilyard March 3, 2025 21:48

owenhilyard approved these changes Mar 4, 2025

View reviewed changes

msaelices added 2 commits March 6, 2025 20:01

Merge branch 'main' into inlinearray-compiler-slowness

51b4239

Merge branch 'main' into inlinearray-compiler-slowness

e799437

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stdlib] Optimize compiler times by prevent huge loop unrolling when filling inline arrays. #4046

[stdlib] Optimize compiler times by prevent huge loop unrolling when filling inline arrays. #4046

msaelices commented Mar 2, 2025

owenhilyard commented Mar 3, 2025

owenhilyard Mar 3, 2025

msaelices Mar 3, 2025

owenhilyard Mar 3, 2025

msaelices Mar 3, 2025

msaelices commented Mar 3, 2025

msaelices commented Mar 3, 2025

owenhilyard Mar 3, 2025

owenhilyard Mar 3, 2025

msaelices Mar 3, 2025

soraros commented Mar 6, 2025

JoeLoser commented Mar 10, 2025

JoeLoser commented Mar 10, 2025

weiweichen commented Mar 11, 2025

npanchen commented Mar 12, 2025

owenhilyard commented Mar 12, 2025 •

edited

Loading

[stdlib] Optimize compiler times by prevent huge loop unrolling when filling inline arrays. #4046

Are you sure you want to change the base?

[stdlib] Optimize compiler times by prevent huge loop unrolling when filling inline arrays. #4046

Conversation

msaelices commented Mar 2, 2025

owenhilyard commented Mar 3, 2025

owenhilyard Mar 3, 2025

Choose a reason for hiding this comment

msaelices Mar 3, 2025

Choose a reason for hiding this comment

owenhilyard Mar 3, 2025

Choose a reason for hiding this comment

msaelices Mar 3, 2025

Choose a reason for hiding this comment

msaelices commented Mar 3, 2025

msaelices commented Mar 3, 2025

owenhilyard Mar 3, 2025

Choose a reason for hiding this comment

owenhilyard Mar 3, 2025

Choose a reason for hiding this comment

msaelices Mar 3, 2025

Choose a reason for hiding this comment

soraros commented Mar 6, 2025

JoeLoser commented Mar 10, 2025

JoeLoser commented Mar 10, 2025

weiweichen commented Mar 11, 2025

npanchen commented Mar 12, 2025

owenhilyard commented Mar 12, 2025 • edited Loading

owenhilyard commented Mar 12, 2025 •

edited

Loading