Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compiler stack usage improvements #6239

Merged
merged 19 commits into from
Oct 1, 2021
Merged

compiler stack usage improvements #6239

merged 19 commits into from
Oct 1, 2021

Conversation

abadams
Copy link
Member

@abadams abadams commented Sep 10, 2021

I found some code in the wild that needs 9mb of stack to lower. It's a
pain to even diagnose the problem definitively, because it requires
plumbing platform-specific linker flags though the build system to
grant more stack to rule out infinite recursion.

This PR:

  • Reduces peak stack usage of similar code in the open source repo
  • Increases the stack size for lowering and codegen to 32mb on all
    platforms, using stack switching techniques. We started doing this on
    Windows a while ago and it hasn't bitten us, so let's try on more
    platforms.
  • Gives user control over the amount of stack used for lowering and
    codegen. It shouldn't be necessary except when diagnosing problems like
    this in future.

Using the control I was able to determine that the correctness tests all
pass with 500k of stack, and the apps all pass with 2MB, so 32MB ought
to be enough for anybody.

I found a never-checked-in test for the mux helper which uses 10MB of
stack and really shouldn't need to, so I added that (and opened an
issue) as an example of how to grant more stack when necessary, even
though 10MB is less than our default now.

Also fixed an incorrect comment on the Block node.

I found some code in the wild that needs 9mb of stack to lower. It's a
pain to even diagnose the problem definitively, because it requires
plumbing platform-specific linker flags to grant more stack.

This commit:

- Reduces peak stack usage of similar code in the repo (the FFT)
- Increases the stack size for lowering and codegen to 32mb on all
platforms, using stack switching techniques. We started doing this on
Windows a while ago and it hasn't bitten us, so let's try on more
platforms.
- Gives user control over the amount of stack used for lowering and
codegen. It shouldn't be necessary except when diagnosing problems like
this in future.

Using the control I was able to determine that the correctness tests all
pass with 500k of stack, and the apps all pass with 1MB, so 32MB ought
to be enough for anybody.

I found a never-checked-in test for the mux helper which uses 10MB of
stack and really shouldn't need to, so I added that (and opened an
issue) as an example of how to grant more stack when necessary, even
though 10MB is less than our default now.

Also fixed an incorrect comment on the Block node.
@abadams
Copy link
Member Author

abadams commented Sep 10, 2021

For the pathological case found in the wild, this reduces stack usage to 3-4 MB, so this PR is not just about increasing the stack size.

@@ -457,28 +457,51 @@ class TightenProducerConsumerNodes : public IRMutator {

Stmt make_producer_consumer(const string &name, bool is_producer, Stmt body, const Scope<int> &scope) {
if (const LetStmt *let = body.as<LetStmt>()) {
if (expr_uses_vars(let->value, scope)) {
return ProducerConsumer::make(name, is_producer, body);
Stmt orig = body; // Only used to keep a reference to the let chain in scope.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but why is that necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated comment. It's because 'body' may be the only reference-counted Stmt keeping the first LetStmt alive, but we're mutating body to point to its innards. We don't want that first LetStmt to have its refcount hit zero and make our pointer dangle.

src/Util.cpp Outdated Show resolved Hide resolved
@alexreinking
Copy link
Member

I found some code in the wild that needs 9mb of stack to lower. It's a
pain to even diagnose the problem definitively, because it requires
plumbing platform-specific linker flags though the build system to
grant more stack to rule out infinite recursion.

Can this not be diagnosed with ulimit on Linux?

@abadams
Copy link
Member Author

abadams commented Sep 11, 2021

I found some code in the wild that needs 9mb of stack to lower. It's a
pain to even diagnose the problem definitively, because it requires
plumbing platform-specific linker flags though the build system to
grant more stack to rule out infinite recursion.

Can this not be diagnosed with ulimit on Linux?

It depends. For the person with the problem it couldn't be done without first messing with system settings to give the initial shell more than 8MB of stack to begin with, because ulimit only goes down.

@alexreinking
Copy link
Member

It depends. For the person with the problem it couldn't be done without first messing with system settings to give the initial shell more than 8MB of stack to begin with, because ulimit only goes down.

On my system at least I start with an 8MB stack, but I can increase it via ulimit -s unlimited. This was necessary for several of the Haskell benchmarks in the Perceus paper since it would routinely use up too much stack space.

@abadams
Copy link
Member Author

abadams commented Sep 11, 2021

IIRC that did not work for the reporter and me (possibly because their build system complicates the issue by having an already-launched persistent daemon). Either way, it's good to have a mechanism to let the compiler use more stack that works on every platform.

@alexreinking
Copy link
Member

Either way, it's good to have a mechanism to let the compiler use more stack that works on every platform.

Sorry, I didn't mean to argue against the PR with that question. It was just for clarification.

@alexreinking alexreinking added the release_notes For changes that may warrant a note in README for official releases. label Sep 12, 2021
@alexreinking
Copy link
Member

alexreinking commented Sep 14, 2021

Hmm, looks like we're seeing a real failure on the new test with LLVM 14 on Windows.

The macos failure is unrelated

@abadams
Copy link
Member Author

abadams commented Sep 14, 2021

Maybe 10mb is too tight for windows. I bumped it to 12 and we'll see if it passes.

@@ -78,37 +78,48 @@ class PredicateFinder : public IRVisitor {
op->body.accept(this);
if (should_pop) {
varying.pop(op->name);
//internal_assert(!expr_uses_var(predicate, op->name));
// internal_assert(!expr_uses_var(predicate, op->name));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should either remove this line or add a comment about why we aren't doing this check.

@@ -83,6 +83,11 @@ class UnrollLoops : public IRMutator {
Stmt iters;
for (int i = e->value - 1; i >= 0; i--) {
Stmt iter = substitute(for_loop->name, for_loop->min + i, body);
// It's necessary to simplify eagerly this iteration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "eagerly simplify"

src/Util.cpp Outdated
void *stack = mmap(nullptr, stack_size.size + guard_band, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
internal_assert(stack);

mprotect((char *)stack + stack_size.size, guard_band, PROT_NONE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check the return value?

@steven-johnson
Copy link
Contributor

Looks like we're getting segfaults on osx-arm64

@abadams
Copy link
Member Author

abadams commented Sep 16, 2021

On Arm Macos the dubious function pointer casting that caused the deprecation of these routines was indeed invalid - only the bottom 32-bits made it through the variadic call. I switched it to pass args in a thread local instead. Should be safer.

@steven-johnson
Copy link
Contributor

Looks like correctness_unroll_huge_mux is segfaulting on llvm14-x86-64-windows

@alexreinking
Copy link
Member

I think it's time to CRD into the windows buildbot and just binary search our way via the env-var to the right stack size for the mux test case.

@abadams
Copy link
Member Author

abadams commented Sep 30, 2021

Windows is currently passing.

@alexreinking
Copy link
Member

Windows is currently passing.

Oh, good. I saw a Windows failure earlier, but maybe it was re-run?

@abadams abadams merged commit 81ad45e into master Oct 1, 2021
@abadams abadams deleted the abadams/stack_size branch October 1, 2021 21:41
@alexreinking alexreinking removed the release_notes For changes that may warrant a note in README for official releases. label Apr 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants