Engine: Dynamically update maximum stack size close to overflow #6052

sphuber · 2023-06-08T12:14:24Z

Temporary workaround for #4876

The Python interpreter maintains a stack of frames when executing code which has a limit. As soon as a frame is added to the stack that were to exceed this limit a RecursionError is raised. Note that, unlike the name suggests, the cause doesn't need to involve recursion necessarily although that is a common cause for the problem. Simply creating a deep but non-recursive call stack will have the same effect.

This RecursionError was routinely hit when submitting large numbers of workflows to the daemon that call one or more process functions. This is due to the process function being called synchronously in an async context, namely the workchain, which is being executed as a task on the event loop of the Runner in the daemon worker. To make this possible, the event loop has to be made reentrant, but this is not supported by vanilla asyncio. This blockade is circumvented in plumpy through the use of nest-asyncio which makes a running event loop reentrant.

The problem is that when the event loop is reentered, instead of creating a separate stack for that task, it reuses the current one. Consequently, each process function adds frames to the current stack that are not resolved and removed until after the execution finished. If many process functions are started before they are finished, these frames accumulate and can ultimately hit the stack limit. Since the task queue of the event loop uses a FIFO, it would very often lead to this situation because all process function tasks would be created first, before being finalized.

Since an actual solution for this problem is not trivial and this is causing a lot problems, a temporary workaround is implemented. Each time when a process function is executed, the current stack size is compared to the current stack limit. If the stack is more than 80% filled, the limit is increased by a 1000 and a warning message is logged. This should give some more leeway for the created process function tasks to be resolved.

Note that the workaround will keep increasing the limit if necessary which can and will eventually lead to an actual stack overflow in the interpreter. When this happens will be machine dependent so it is difficult to put an absolute limit.

unkcpz

Test and work for me!

unkcpz · 2023-06-09T15:51:54Z

When the process is finished, will the recursion number decrease? Or the recursion is cumulated when a daemon is running and will set back only after it restarts?
Maybe it is safer to check and set this value back after the process is finished?

sphuber · 2023-06-09T18:38:18Z

When the process is finished, will the recursion number decrease? Or the recursion is cumulated when a daemon is running and will set back only after it restarts?
Maybe it is safer to check and set this value back after the process is finished?

No, it will keep the recursion limit, until the worker resets. I thought about resetting it, but I don't really see the point. If it can run with a given limit once, it should be fine to keep running with it. So resetting it is not really adding anything positive. Only thing it is adding is a potential "ping-pong" between increasing and then lowering the limit.

sphuber · 2023-06-15T16:59:12Z

One question is the performance hit of calling len(inspect.stack()). It will be called once for each process function that is executed. This answer on SO seems to suggest that it is at least on the order of 1 ms and that there are manual implementations of more complex algorithms that can reduce the cost by an order of magnitude.

The timings were done at a stack depth of 50 and each has a different scaling. Since we will typically be running at stack depths that are considerably larger than 50, it may be best to have the algorithm that scales best for larger stack depths.

From the analysis, it seems that stack_size3a may provide the best scaling for large stacks:

from itertools import count

def stack_size3a(size=2):
    """Get stack size for caller's frame.
    """
    frame = sys._getframe(size)
    try:
        for size in count(size, 8):
            frame = frame.f_back.f_back.f_back.f_back.\
                f_back.f_back.f_back.f_back
    except AttributeError:
        while frame:
            frame = frame.f_back
            size += 1
        return size - 1

Are we willing to use this implementation which accesses the private sys._getframe? The performance increase is significant, but maybe a few milliseconds for the len(inspect.stack()) call is relatively negligible compared to the actual overhead of a process function anyway, and it is an unnecessary optimization.

giovannipizzi · 2023-06-16T13:01:35Z

Why don't we define our own get_stack_size, with the optimised implementation above, and in the docstring we explain that it's 3 orders of mag faster, but if in the future there is an issue with accessing the private element, one can replace with the one-line slower option?

sphuber · 2023-06-16T18:10:45Z

Yeah, works for me. Will add it

sphuber · 2023-06-18T12:46:13Z

Ok, I have finalized the implementation with the optimized get_stack_limit function. I have tested it locally and seems to work. @unkcpz could you also have one final go at this version? Then we can merge it and I will prepare the release

The Python interpreter maintains a stack of frames when executing code which has a limit. As soon as a frame is added to the stack that were to exceed this limit a `RecursionError` is raised. Note that, unlike the name suggests, the cause doesn't need to involve recursion necessarily although that is a common cause for the problem. Simply creating a deep but non-recursive call stack will have the same effect. This `RecursionError` was routinely hit when submitting large numbers of workflows to the daemon that call one or more process functions. This is due to the process function being called synchronously in an async context, namely the workchain, which is being executed as a task on the event loop of the `Runner` in the daemon worker. To make this possible, the event loop has to be made reentrant, but this is not supported by vanilla `asyncio`. This blockade is circumvented in `plumpy` through the use of `nest-asyncio` which makes a running event loop reentrant. The problem is that when the event loop is reentered, instead of creating a separate stack for that task, it reuses the current one. Consequently, each process function adds frames to the current stack that are not resolved and removed until after the execution finished. If many process functions are started before they are finished, these frames accumulate and can ultimately hit the stack limit. Since the task queue of the event loop uses a FIFO, it would very often lead to this situation because all process function tasks would be created first, before being finalized. Since an actual solution for this problem is not trivial and this is causing a lot problems, a temporary workaround is implemented. Each time when a process function is executed, the current stack size is compared to the current stack limit. If the stack is more than 80% filled, the limit is increased by a 1000 and a warning message is logged. This should give some more leeway for the created process function tasks to be resolved. Note that the workaround will keep increasing the limit if necessary which can and will eventually lead to an actual stack overflow in the interpreter. When this happens will be machine dependent so it is difficult to put an absolute limit. The function to get the stack size is using a custom implementation instead of the naive `len(inspect.stack())`. This is because the performance is three order of magnitudes better and it scales well for deep stacks, which is typically the case for AiiDA daemon workers. See https://stackoverflow.com/questions/34115298 for a discussion on the implementation and its performance.

unkcpz · 2023-06-20T08:37:03Z

I did the test run with 4 daemons again and use ~50% of the available daemon worker slots, which previously has a lot of failing processes. It is now all fine!

unkcpz

Thanks!

unkcpz reviewed Jun 9, 2023

View reviewed changes

sphuber added this to the v2.4.0 milestone Jun 15, 2023

sphuber force-pushed the fix/dynamically-increase-recursion-limit branch 3 times, most recently from 0f3c4b3 to 354137e Compare June 18, 2023 12:44

sphuber requested a review from unkcpz June 18, 2023 12:45

sphuber force-pushed the fix/dynamically-increase-recursion-limit branch from 354137e to 6bd178d Compare June 19, 2023 19:09

unkcpz approved these changes Jun 20, 2023

View reviewed changes

sphuber merged commit f797b47 into aiidateam:main Jun 20, 2023

sphuber deleted the fix/dynamically-increase-recursion-limit branch June 20, 2023 08:45

unkcpz mentioned this pull request Sep 25, 2023

"maximum recursion depth reached" when submitting ~200 workchains #4876

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Engine: Dynamically update maximum stack size close to overflow #6052

Engine: Dynamically update maximum stack size close to overflow #6052

sphuber commented Jun 8, 2023 •

edited

Loading

unkcpz left a comment

unkcpz commented Jun 9, 2023

sphuber commented Jun 9, 2023

sphuber commented Jun 15, 2023

giovannipizzi commented Jun 16, 2023

sphuber commented Jun 16, 2023

sphuber commented Jun 18, 2023

unkcpz commented Jun 20, 2023

unkcpz left a comment

Engine: Dynamically update maximum stack size close to overflow #6052

Engine: Dynamically update maximum stack size close to overflow #6052

Conversation

sphuber commented Jun 8, 2023 • edited Loading

unkcpz left a comment

Choose a reason for hiding this comment

unkcpz commented Jun 9, 2023

sphuber commented Jun 9, 2023

sphuber commented Jun 15, 2023

giovannipizzi commented Jun 16, 2023

sphuber commented Jun 16, 2023

sphuber commented Jun 18, 2023

unkcpz commented Jun 20, 2023

unkcpz left a comment

Choose a reason for hiding this comment

sphuber commented Jun 8, 2023 •

edited

Loading