Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prod Incident - Runner and Worker Thread crashing due to Memory Issues #437

Closed
1 task
darunrs opened this issue Nov 30, 2023 · 1 comment
Closed
1 task
Assignees
Labels
bug Something isn't working component: Runner

Comments

@darunrs
Copy link
Collaborator

darunrs commented Nov 30, 2023

@gabehamilton reached out that a historical process was not being kicked off despite successful coordinator process. Looking into it, we realized that the worker thread had somehow died at some point. There may have been others simultaneously too. So, while investigating further, we restarted the runner. This led to a larger issue when unprocessed stream messages no longer displayed in Grafana, an indication that the worker threads were not working correctly. The dev environment faced no such issues at any point.

Further investigation led to the discovery that runner was repeatedly crashing and re-initializing due to a fatal error related to memory running out, along with similar errors at the worker level. Despite the repeated crashing and reinitializing, social_feed somehow managed to continue running and near.org never fell back to using near social (Woohoo!).

The memory errors were strange as the machine itself never ran out of free memory. Gabe also shared that its possible that blocks are larger due to some new information being packaged into blocks in a process which started recently. Going along this path, we reduced the block prefetch queue size from 100 down to 10. This unfortunately did not resolve the problem.

Another possibility was that the number of worker threads being spun up was the problem. We checked to see how many streams are present (each of which would get a worker thread, and any backed up ones would also max out their prefetch queue due to having too many messages). We compared dev to prod and found that dev had 79 streams whereas prod had 170. If each worker was trying to get allocated a memory range to use as its heap, then this could lead to memory problems. This was a possibility.

We decided to start by pruning the stream set, removing stream keys from the set as we go. The repeated crashing of runner would help indicate if we are on the right track. Each time runner is reinitialized, it will poll the stream set and create workers. As we whittle down the stream set, we will approach a number low enough to cease the memory problems. We began by removing all historical streams. By the end of it, we had noticed that the runner itself had stopped crashing, but some workers were still crashing without crashing runner. But that too ceased.

We verified indexers had begun to reduce their backed up real time queues. We then deleted the entire stream entirely and restarted runner. Coordinator will add streams to the set if a matching block height is found, ensuring any streams in the set are from active indexers. We verified that historical processes were being created and consumed correctly very quickly. We did notice that the unprocessed messages was not matching redis' actual count. This was due to a bug in the id incrementing most likely.

Regardless of if the issue was the number of workers, the data in each worker, a combination of both, or neither, there is a scaling issue that needs to be addressed. It's not too difficult to encounter these issues again as indexer count grows. There's short term hot fixes that could be implemented to delay the problem if necessary, but a long term decision needs to be made on how we can scale past whatever bottleneck is causing the memory issue. In addition, if we increase the prefetch queue again to something like 30, then we'll also eventually face a storage problem.

Through debugging the issue, I produced the below list of problems to investigate and a variety of proposed solutions.

Topics for Investigation:

  • Verify that historical processes being successful in coordinator and not runner is a result of crashed worker thread.
  • Investigate worker thread crashing due to Error [ERR_WORKER_OUT_OF_MEMORY]: worker terminated due to reaching memory limit Runner stream handlers fail due to OUT_OF_MEMORY #551
  • Investigate Runner crashing due to FATAL ERROR: NewSpace::Rebalance Allocation failed - JavaScript heap out of memory
  • Verify that skipped message sin processing is due to incrementId and fix the problem

Proposed improvements to Implement:

  • Investigate and implement solution for worker memory issue (e.g. worker config)
  • Investigate and implement solution for Runner memory issue
  • Log block heights and indexer type (real time or historical) with any logs outputted by worker.ts
  • Update incrementId to correctly increment stream message IDs to avoid skipping messages
  • Add instrumentation around average block size and/or queue size per worker thread
  • Remove resources when indexer is deleted (Delete all resources when deleting indexer #346)
  • End worker threads if stream key is no longer present (Maybe part of above)
  • Replace xRange with xLen for getting stream size
  • Place instrumentation and alarms around whatever resource leads to the memory problems outlined above
  • Add a worker revival process if worker thread crashes for any reason
  • Increase prefetch queue size to a reasonable number
  • Reduce prefetch queue size to a smaller number for failing indexers (Remove overall memory footprint)
  • Consider and implement solution for failing indexers with backing up real-time streams (e.g. bucanero.near/nft_v3 with nearly 2.8M messages and growing)
  • Figure out a way to prevent error stack traces from being broken apart in google cloud logs due to parallel log messages
  • Reduce latency for Prod indexers (e.g. flatirons.near/demo_blockheight as a base since its simple)

Tasks

Preview Give feedback
  1. darunrs
@darunrs
Copy link
Collaborator Author

darunrs commented Jul 26, 2024

All action items ended up completed. I can close this ticket now.

@darunrs darunrs closed this as completed Jul 26, 2024
@darunrs darunrs changed the title Runner and Worker Thread crashing due to Memory Issues Prod Incident - Runner and Worker Thread crashing due to Memory Issues Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component: Runner
Projects
None yet
Development

No branches or pull requests

2 participants