Remove Nested BoltDB View Within Update Transaction in SaveHeadBlockRoot #9428
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
What does this PR do? Why is it needed?
End-to-end tests flake a lot. In particular, we noticed it has been draining developer productivity and affecting how fast we can iterate on features. To diagnose this, we decided to gather opentracing spans during endtoend, which can be replayed to a jaeger collector and visualized in the jaeger UI.
To accomplish this, we wrote a tool for collecting spans in endtoend tests, and another tool to replay them to jaeger collector in #9341.
The first time we took a peek at the spans, they showed some slowdown in certain disk operations, which we attributed to not running on an SSD. However, Preston from our team confirmed the nodes that run CI are using SSD, so we ignored the issue as there was not much to do.
Today, we decided to pick this up again and inspect the failure happening in CI once more by looking at jaeger spans. We noticed that most of the issues are in ReceiveBlock in functions related to checking state summaries. Specifically,
SaveHeadBlockRoot
. Upon further inspection, we see the function opens a write-transaction in bolt, and within that transactions opens another read-transaction. This seems potentially dangerous.There is no way a single db.Put operation is taking almost 4 seconds. Next, Kasey from our team brought up some information regarding this from the bolt documentation:
As such, this PR tries to remove that by simplifying the logic with a small refactor. We hope this will be the root cause of end to end failing this much.