Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engine: Dynamically update maximum stack size close to overflow #6052

Merged

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Jun 8, 2023

Temporary workaround for #4876

The Python interpreter maintains a stack of frames when executing code which has a limit. As soon as a frame is added to the stack that were to exceed this limit a RecursionError is raised. Note that, unlike the name suggests, the cause doesn't need to involve recursion necessarily although that is a common cause for the problem. Simply creating a deep but non-recursive call stack will have the same effect.

This RecursionError was routinely hit when submitting large numbers of workflows to the daemon that call one or more process functions. This is due to the process function being called synchronously in an async context, namely the workchain, which is being executed as a task on the event loop of the Runner in the daemon worker. To make this possible, the event loop has to be made reentrant, but this is not supported by vanilla asyncio. This blockade is circumvented in plumpy through the use of nest-asyncio which makes a running event loop reentrant.

The problem is that when the event loop is reentered, instead of creating a separate stack for that task, it reuses the current one. Consequently, each process function adds frames to the current stack that are not resolved and removed until after the execution finished. If many process functions are started before they are finished, these frames accumulate and can ultimately hit the stack limit. Since the task queue of the event loop uses a FIFO, it would very often lead to this situation because all process function tasks would be created first, before being finalized.

Since an actual solution for this problem is not trivial and this is causing a lot problems, a temporary workaround is implemented. Each time when a process function is executed, the current stack size is compared to the current stack limit. If the stack is more than 80% filled, the limit is increased by a 1000 and a warning message is logged. This should give some more leeway for the created process function tasks to be resolved.

Note that the workaround will keep increasing the limit if necessary which can and will eventually lead to an actual stack overflow in the interpreter. When this happens will be machine dependent so it is difficult to put an absolute limit.

Copy link
Member

@unkcpz unkcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test and work for me!

@unkcpz
Copy link
Member

unkcpz commented Jun 9, 2023

When the process is finished, will the recursion number decrease? Or the recursion is cumulated when a daemon is running and will set back only after it restarts?
Maybe it is safer to check and set this value back after the process is finished?

@sphuber
Copy link
Contributor Author

sphuber commented Jun 9, 2023

When the process is finished, will the recursion number decrease? Or the recursion is cumulated when a daemon is running and will set back only after it restarts?
Maybe it is safer to check and set this value back after the process is finished?

No, it will keep the recursion limit, until the worker resets. I thought about resetting it, but I don't really see the point. If it can run with a given limit once, it should be fine to keep running with it. So resetting it is not really adding anything positive. Only thing it is adding is a potential "ping-pong" between increasing and then lowering the limit.

@sphuber
Copy link
Contributor Author

sphuber commented Jun 15, 2023

One question is the performance hit of calling len(inspect.stack()). It will be called once for each process function that is executed. This answer on SO seems to suggest that it is at least on the order of 1 ms and that there are manual implementations of more complex algorithms that can reduce the cost by an order of magnitude.

The timings were done at a stack depth of 50 and each has a different scaling. Since we will typically be running at stack depths that are considerably larger than 50, it may be best to have the algorithm that scales best for larger stack depths.

From the analysis, it seems that stack_size3a may provide the best scaling for large stacks:

from itertools import count

def stack_size3a(size=2):
    """Get stack size for caller's frame.
    """
    frame = sys._getframe(size)
    try:
        for size in count(size, 8):
            frame = frame.f_back.f_back.f_back.f_back.\
                f_back.f_back.f_back.f_back
    except AttributeError:
        while frame:
            frame = frame.f_back
            size += 1
        return size - 1

Are we willing to use this implementation which accesses the private sys._getframe? The performance increase is significant, but maybe a few milliseconds for the len(inspect.stack()) call is relatively negligible compared to the actual overhead of a process function anyway, and it is an unnecessary optimization.

@sphuber sphuber added this to the v2.4.0 milestone Jun 15, 2023
@giovannipizzi
Copy link
Member

Why don't we define our own get_stack_size, with the optimised implementation above, and in the docstring we explain that it's 3 orders of mag faster, but if in the future there is an issue with accessing the private element, one can replace with the one-line slower option?

@sphuber
Copy link
Contributor Author

sphuber commented Jun 16, 2023

Yeah, works for me. Will add it

@sphuber sphuber force-pushed the fix/dynamically-increase-recursion-limit branch 3 times, most recently from 0f3c4b3 to 354137e Compare June 18, 2023 12:44
@sphuber sphuber requested a review from unkcpz June 18, 2023 12:45
@sphuber
Copy link
Contributor Author

sphuber commented Jun 18, 2023

Ok, I have finalized the implementation with the optimized get_stack_limit function. I have tested it locally and seems to work. @unkcpz could you also have one final go at this version? Then we can merge it and I will prepare the release

The Python interpreter maintains a stack of frames when executing code
which has a limit. As soon as a frame is added to the stack that were to
exceed this limit a `RecursionError` is raised. Note that, unlike the
name suggests, the cause doesn't need to involve recursion necessarily
although that is a common cause for the problem. Simply creating a deep
but non-recursive call stack will have the same effect.

This `RecursionError` was routinely hit when submitting large numbers of
workflows to the daemon that call one or more process functions. This is
due to the process function being called synchronously in an async
context, namely the workchain, which is being executed as a task on the
event loop of the `Runner` in the daemon worker. To make this possible,
the event loop has to be made reentrant, but this is not supported by
vanilla `asyncio`. This blockade is circumvented in `plumpy` through the
use of `nest-asyncio` which makes a running event loop reentrant.

The problem is that when the event loop is reentered, instead of
creating a separate stack for that task, it reuses the current one.
Consequently, each process function adds frames to the current stack
that are not resolved and removed until after the execution finished. If
many process functions are started before they are finished, these
frames accumulate and can ultimately hit the stack limit. Since the task
queue of the event loop uses a FIFO, it would very often lead to this
situation because all process function tasks would be created first,
before being finalized.

Since an actual solution for this problem is not trivial and this is
causing a lot problems, a temporary workaround is implemented. Each time
when a process function is executed, the current stack size is compared
to the current stack limit. If the stack is more than 80% filled, the
limit is increased by a 1000 and a warning message is logged. This
should give some more leeway for the created process function tasks to
be resolved.

Note that the workaround will keep increasing the limit if necessary
which can and will eventually lead to an actual stack overflow in the
interpreter. When this happens will be machine dependent so it is
difficult to put an absolute limit.

The function to get the stack size is using a custom implementation
instead of the naive `len(inspect.stack())`. This is because the
performance is three order of magnitudes better and it scales well for
deep stacks, which is typically the case for AiiDA daemon workers. See
https://stackoverflow.com/questions/34115298 for a discussion on the
implementation and its performance.
@sphuber sphuber force-pushed the fix/dynamically-increase-recursion-limit branch from 354137e to 6bd178d Compare June 19, 2023 19:09
@unkcpz
Copy link
Member

unkcpz commented Jun 20, 2023

I did the test run with 4 daemons again and use ~50% of the available daemon worker slots, which previously has a lot of failing processes. It is now all fine!

Copy link
Member

@unkcpz unkcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@sphuber sphuber merged commit f797b47 into aiidateam:main Jun 20, 2023
@sphuber sphuber deleted the fix/dynamically-increase-recursion-limit branch June 20, 2023 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants