You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@gabehamilton reached out that a historical process was not being kicked off despite successful coordinator process. Looking into it, we realized that the worker thread had somehow died at some point. There may have been others simultaneously too. So, while investigating further, we restarted the runner. This led to a larger issue when unprocessed stream messages no longer displayed in Grafana, an indication that the worker threads were not working correctly. The dev environment faced no such issues at any point.
Further investigation led to the discovery that runner was repeatedly crashing and re-initializing due to a fatal error related to memory running out, along with similar errors at the worker level. Despite the repeated crashing and reinitializing, social_feed somehow managed to continue running and near.org never fell back to using near social (Woohoo!).
The memory errors were strange as the machine itself never ran out of free memory. Gabe also shared that its possible that blocks are larger due to some new information being packaged into blocks in a process which started recently. Going along this path, we reduced the block prefetch queue size from 100 down to 10. This unfortunately did not resolve the problem.
Another possibility was that the number of worker threads being spun up was the problem. We checked to see how many streams are present (each of which would get a worker thread, and any backed up ones would also max out their prefetch queue due to having too many messages). We compared dev to prod and found that dev had 79 streams whereas prod had 170. If each worker was trying to get allocated a memory range to use as its heap, then this could lead to memory problems. This was a possibility.
We decided to start by pruning the stream set, removing stream keys from the set as we go. The repeated crashing of runner would help indicate if we are on the right track. Each time runner is reinitialized, it will poll the stream set and create workers. As we whittle down the stream set, we will approach a number low enough to cease the memory problems. We began by removing all historical streams. By the end of it, we had noticed that the runner itself had stopped crashing, but some workers were still crashing without crashing runner. But that too ceased.
We verified indexers had begun to reduce their backed up real time queues. We then deleted the entire stream entirely and restarted runner. Coordinator will add streams to the set if a matching block height is found, ensuring any streams in the set are from active indexers. We verified that historical processes were being created and consumed correctly very quickly. We did notice that the unprocessed messages was not matching redis' actual count. This was due to a bug in the id incrementing most likely.
Regardless of if the issue was the number of workers, the data in each worker, a combination of both, or neither, there is a scaling issue that needs to be addressed. It's not too difficult to encounter these issues again as indexer count grows. There's short term hot fixes that could be implemented to delay the problem if necessary, but a long term decision needs to be made on how we can scale past whatever bottleneck is causing the memory issue. In addition, if we increase the prefetch queue again to something like 30, then we'll also eventually face a storage problem.
Through debugging the issue, I produced the below list of problems to investigate and a variety of proposed solutions.
Topics for Investigation:
Verify that historical processes being successful in coordinator and not runner is a result of crashed worker thread.
End worker threads if stream key is no longer present (Maybe part of above)
Replace xRange with xLen for getting stream size
Place instrumentation and alarms around whatever resource leads to the memory problems outlined above
Add a worker revival process if worker thread crashes for any reason
Increase prefetch queue size to a reasonable number
Reduce prefetch queue size to a smaller number for failing indexers (Remove overall memory footprint)
Consider and implement solution for failing indexers with backing up real-time streams (e.g. bucanero.near/nft_v3 with nearly 2.8M messages and growing)
Figure out a way to prevent error stack traces from being broken apart in google cloud logs due to parallel log messages
Reduce latency for Prod indexers (e.g. flatirons.near/demo_blockheight as a base since its simple)
The content you are editing has changed. Please copy your edits and refresh the page.
darunrs
changed the title
Runner and Worker Thread crashing due to Memory Issues
Prod Incident - Runner and Worker Thread crashing due to Memory Issues
Aug 1, 2024
@gabehamilton reached out that a historical process was not being kicked off despite successful coordinator process. Looking into it, we realized that the worker thread had somehow died at some point. There may have been others simultaneously too. So, while investigating further, we restarted the runner. This led to a larger issue when unprocessed stream messages no longer displayed in Grafana, an indication that the worker threads were not working correctly. The dev environment faced no such issues at any point.
Further investigation led to the discovery that runner was repeatedly crashing and re-initializing due to a fatal error related to memory running out, along with similar errors at the worker level. Despite the repeated crashing and reinitializing, social_feed somehow managed to continue running and near.org never fell back to using near social (Woohoo!).
The memory errors were strange as the machine itself never ran out of free memory. Gabe also shared that its possible that blocks are larger due to some new information being packaged into blocks in a process which started recently. Going along this path, we reduced the block prefetch queue size from 100 down to 10. This unfortunately did not resolve the problem.
Another possibility was that the number of worker threads being spun up was the problem. We checked to see how many streams are present (each of which would get a worker thread, and any backed up ones would also max out their prefetch queue due to having too many messages). We compared dev to prod and found that dev had 79 streams whereas prod had 170. If each worker was trying to get allocated a memory range to use as its heap, then this could lead to memory problems. This was a possibility.
We decided to start by pruning the stream set, removing stream keys from the set as we go. The repeated crashing of runner would help indicate if we are on the right track. Each time runner is reinitialized, it will poll the stream set and create workers. As we whittle down the stream set, we will approach a number low enough to cease the memory problems. We began by removing all historical streams. By the end of it, we had noticed that the runner itself had stopped crashing, but some workers were still crashing without crashing runner. But that too ceased.
We verified indexers had begun to reduce their backed up real time queues. We then deleted the entire stream entirely and restarted runner. Coordinator will add streams to the set if a matching block height is found, ensuring any streams in the set are from active indexers. We verified that historical processes were being created and consumed correctly very quickly. We did notice that the unprocessed messages was not matching redis' actual count. This was due to a bug in the id incrementing most likely.
Regardless of if the issue was the number of workers, the data in each worker, a combination of both, or neither, there is a scaling issue that needs to be addressed. It's not too difficult to encounter these issues again as indexer count grows. There's short term hot fixes that could be implemented to delay the problem if necessary, but a long term decision needs to be made on how we can scale past whatever bottleneck is causing the memory issue. In addition, if we increase the prefetch queue again to something like 30, then we'll also eventually face a storage problem.
Through debugging the issue, I produced the below list of problems to investigate and a variety of proposed solutions.
Topics for Investigation:
Error [ERR_WORKER_OUT_OF_MEMORY]: worker terminated due to reaching memory limit
Runner stream handlers fail due to OUT_OF_MEMORY #551FATAL ERROR: NewSpace::Rebalance Allocation failed - JavaScript heap out of memory
Proposed improvements to Implement:
Tasks
The text was updated successfully, but these errors were encountered: