-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compiler stack usage improvements #6239
Conversation
I found some code in the wild that needs 9mb of stack to lower. It's a pain to even diagnose the problem definitively, because it requires plumbing platform-specific linker flags to grant more stack. This commit: - Reduces peak stack usage of similar code in the repo (the FFT) - Increases the stack size for lowering and codegen to 32mb on all platforms, using stack switching techniques. We started doing this on Windows a while ago and it hasn't bitten us, so let's try on more platforms. - Gives user control over the amount of stack used for lowering and codegen. It shouldn't be necessary except when diagnosing problems like this in future. Using the control I was able to determine that the correctness tests all pass with 500k of stack, and the apps all pass with 1MB, so 32MB ought to be enough for anybody. I found a never-checked-in test for the mux helper which uses 10MB of stack and really shouldn't need to, so I added that (and opened an issue) as an example of how to grant more stack when necessary, even though 10MB is less than our default now. Also fixed an incorrect comment on the Block node.
For the pathological case found in the wild, this reduces stack usage to 3-4 MB, so this PR is not just about increasing the stack size. |
src/AsyncProducers.cpp
Outdated
@@ -457,28 +457,51 @@ class TightenProducerConsumerNodes : public IRMutator { | |||
|
|||
Stmt make_producer_consumer(const string &name, bool is_producer, Stmt body, const Scope<int> &scope) { | |||
if (const LetStmt *let = body.as<LetStmt>()) { | |||
if (expr_uses_vars(let->value, scope)) { | |||
return ProducerConsumer::make(name, is_producer, body); | |||
Stmt orig = body; // Only used to keep a reference to the let chain in scope. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but why is that necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated comment. It's because 'body' may be the only reference-counted Stmt keeping the first LetStmt alive, but we're mutating body to point to its innards. We don't want that first LetStmt to have its refcount hit zero and make our pointer dangle.
Can this not be diagnosed with |
It depends. For the person with the problem it couldn't be done without first messing with system settings to give the initial shell more than 8MB of stack to begin with, because ulimit only goes down. |
On my system at least I start with an 8MB stack, but I can increase it via |
IIRC that did not work for the reporter and me (possibly because their build system complicates the issue by having an already-launched persistent daemon). Either way, it's good to have a mechanism to let the compiler use more stack that works on every platform. |
Sorry, I didn't mean to argue against the PR with that question. It was just for clarification. |
Hmm, looks like we're seeing a real failure on the new test with LLVM 14 on Windows. The macos failure is unrelated |
Maybe 10mb is too tight for windows. I bumped it to 12 and we'll see if it passes. |
src/SkipStages.cpp
Outdated
@@ -78,37 +78,48 @@ class PredicateFinder : public IRVisitor { | |||
op->body.accept(this); | |||
if (should_pop) { | |||
varying.pop(op->name); | |||
//internal_assert(!expr_uses_var(predicate, op->name)); | |||
// internal_assert(!expr_uses_var(predicate, op->name)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should either remove this line or add a comment about why we aren't doing this check.
src/UnrollLoops.cpp
Outdated
@@ -83,6 +83,11 @@ class UnrollLoops : public IRMutator { | |||
Stmt iters; | |||
for (int i = e->value - 1; i >= 0; i--) { | |||
Stmt iter = substitute(for_loop->name, for_loop->min + i, body); | |||
// It's necessary to simplify eagerly this iteration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "eagerly simplify"
src/Util.cpp
Outdated
void *stack = mmap(nullptr, stack_size.size + guard_band, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); | ||
internal_assert(stack); | ||
|
||
mprotect((char *)stack + stack_size.size, guard_band, PROT_NONE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we check the return value?
Looks like we're getting segfaults on osx-arm64 |
On Arm Macos the dubious function pointer casting that caused the deprecation of these routines was indeed invalid - only the bottom 32-bits made it through the variadic call. I switched it to pass args in a thread local instead. Should be safer. |
Looks like correctness_unroll_huge_mux is segfaulting on llvm14-x86-64-windows |
I think it's time to CRD into the windows buildbot and just binary search our way via the env-var to the right stack size for the mux test case. |
Windows is currently passing. |
Oh, good. I saw a Windows failure earlier, but maybe it was re-run? |
I found some code in the wild that needs 9mb of stack to lower. It's a
pain to even diagnose the problem definitively, because it requires
plumbing platform-specific linker flags though the build system to
grant more stack to rule out infinite recursion.
This PR:
platforms, using stack switching techniques. We started doing this on
Windows a while ago and it hasn't bitten us, so let's try on more
platforms.
codegen. It shouldn't be necessary except when diagnosing problems like
this in future.
Using the control I was able to determine that the correctness tests all
pass with 500k of stack, and the apps all pass with 2MB, so 32MB ought
to be enough for anybody.
I found a never-checked-in test for the mux helper which uses 10MB of
stack and really shouldn't need to, so I added that (and opened an
issue) as an example of how to grant more stack when necessary, even
though 10MB is less than our default now.
Also fixed an incorrect comment on the Block node.