Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Engine: Dynamically update maximum stack size close to overflow (#6052)
The Python interpreter maintains a stack of frames when executing code which has a limit. As soon as a frame is added to the stack that were to exceed this limit a `RecursionError` is raised. Note that, unlike the name suggests, the cause doesn't need to involve recursion necessarily although that is a common cause for the problem. Simply creating a deep but non-recursive call stack will have the same effect. This `RecursionError` was routinely hit when submitting large numbers of workflows to the daemon that call one or more process functions. This is due to the process function being called synchronously in an async context, namely the workchain, which is being executed as a task on the event loop of the `Runner` in the daemon worker. To make this possible, the event loop has to be made reentrant, but this is not supported by vanilla `asyncio`. This blockade is circumvented in `plumpy` through the use of `nest-asyncio` which makes a running event loop reentrant. The problem is that when the event loop is reentered, instead of creating a separate stack for that task, it reuses the current one. Consequently, each process function adds frames to the current stack that are not resolved and removed until after the execution finished. If many process functions are started before they are finished, these frames accumulate and can ultimately hit the stack limit. Since the task queue of the event loop uses a FIFO, it would very often lead to this situation because all process function tasks would be created first, before being finalized. Since an actual solution for this problem is not trivial and this is causing a lot problems, a temporary workaround is implemented. Each time when a process function is executed, the current stack size is compared to the current stack limit. If the stack is more than 80% filled, the limit is increased by a 1000 and a warning message is logged. This should give some more leeway for the created process function tasks to be resolved. Note that the workaround will keep increasing the limit if necessary which can and will eventually lead to an actual stack overflow in the interpreter. When this happens will be machine dependent so it is difficult to put an absolute limit. The function to get the stack size is using a custom implementation instead of the naive `len(inspect.stack())`. This is because the performance is three order of magnitudes better and it scales well for deep stacks, which is typically the case for AiiDA daemon workers. See https://stackoverflow.com/questions/34115298 for a discussion on the implementation and its performance.
- Loading branch information