Prod Incident - Runner and Worker Thread crashing due to Memory Issues #437

darunrs · 2023-11-30T00:55:39Z

@gabehamilton reached out that a historical process was not being kicked off despite successful coordinator process. Looking into it, we realized that the worker thread had somehow died at some point. There may have been others simultaneously too. So, while investigating further, we restarted the runner. This led to a larger issue when unprocessed stream messages no longer displayed in Grafana, an indication that the worker threads were not working correctly. The dev environment faced no such issues at any point.

Further investigation led to the discovery that runner was repeatedly crashing and re-initializing due to a fatal error related to memory running out, along with similar errors at the worker level. Despite the repeated crashing and reinitializing, social_feed somehow managed to continue running and near.org never fell back to using near social (Woohoo!).

The memory errors were strange as the machine itself never ran out of free memory. Gabe also shared that its possible that blocks are larger due to some new information being packaged into blocks in a process which started recently. Going along this path, we reduced the block prefetch queue size from 100 down to 10. This unfortunately did not resolve the problem.

Another possibility was that the number of worker threads being spun up was the problem. We checked to see how many streams are present (each of which would get a worker thread, and any backed up ones would also max out their prefetch queue due to having too many messages). We compared dev to prod and found that dev had 79 streams whereas prod had 170. If each worker was trying to get allocated a memory range to use as its heap, then this could lead to memory problems. This was a possibility.

We decided to start by pruning the stream set, removing stream keys from the set as we go. The repeated crashing of runner would help indicate if we are on the right track. Each time runner is reinitialized, it will poll the stream set and create workers. As we whittle down the stream set, we will approach a number low enough to cease the memory problems. We began by removing all historical streams. By the end of it, we had noticed that the runner itself had stopped crashing, but some workers were still crashing without crashing runner. But that too ceased.

We verified indexers had begun to reduce their backed up real time queues. We then deleted the entire stream entirely and restarted runner. Coordinator will add streams to the set if a matching block height is found, ensuring any streams in the set are from active indexers. We verified that historical processes were being created and consumed correctly very quickly. We did notice that the unprocessed messages was not matching redis' actual count. This was due to a bug in the id incrementing most likely.

Regardless of if the issue was the number of workers, the data in each worker, a combination of both, or neither, there is a scaling issue that needs to be addressed. It's not too difficult to encounter these issues again as indexer count grows. There's short term hot fixes that could be implemented to delay the problem if necessary, but a long term decision needs to be made on how we can scale past whatever bottleneck is causing the memory issue. In addition, if we increase the prefetch queue again to something like 30, then we'll also eventually face a storage problem.

Through debugging the issue, I produced the below list of problems to investigate and a variety of proposed solutions.

Topics for Investigation:

Verify that historical processes being successful in coordinator and not runner is a result of crashed worker thread.
Investigate worker thread crashing due to Error [ERR_WORKER_OUT_OF_MEMORY]: worker terminated due to reaching memory limit Runner stream handlers fail due to OUT_OF_MEMORY #551
Investigate Runner crashing due to FATAL ERROR: NewSpace::Rebalance Allocation failed - JavaScript heap out of memory
Verify that skipped message sin processing is due to incrementId and fix the problem

Proposed improvements to Implement:

Tasks

Give feedback

The text was updated successfully, but these errors were encountered:

darunrs · 2024-07-26T18:37:10Z

All action items ended up completed. I can close this ticket now.

darunrs added the bug Something isn't working label Nov 30, 2023

darunrs self-assigned this Nov 30, 2023

pkudinov mentioned this issue Dec 19, 2023

Ensure Runner machine doesn't run out of memory #456

Open

jl-santana added the component: Frontend label Jan 9, 2024

pkudinov mentioned this issue Apr 5, 2024

Investigate Production Issues and complete Prod Release #635

Closed

darunrs added component: Runner and removed component: Frontend labels Jul 6, 2024

darunrs closed this as completed Jul 26, 2024

darunrs changed the title ~~Runner and Worker Thread crashing due to Memory Issues~~ Prod Incident - Runner and Worker Thread crashing due to Memory Issues Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prod Incident - Runner and Worker Thread crashing due to Memory Issues #437

Prod Incident - Runner and Worker Thread crashing due to Memory Issues #437

darunrs commented Nov 30, 2023 •

edited

Loading

Tasks

darunrs commented Jul 26, 2024 •

edited

Loading

Prod Incident - Runner and Worker Thread crashing due to Memory Issues #437

Prod Incident - Runner and Worker Thread crashing due to Memory Issues #437

Comments

darunrs commented Nov 30, 2023 • edited Loading

Tasks

darunrs commented Jul 26, 2024 • edited Loading

darunrs commented Nov 30, 2023 •

edited

Loading

darunrs commented Jul 26, 2024 •

edited

Loading