Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Nested BoltDB View Within Update Transaction in SaveHeadBlockRoot #9428

Merged
merged 4 commits into from
Aug 19, 2021

Conversation

rauljordan
Copy link
Contributor

@rauljordan rauljordan commented Aug 19, 2021

What type of PR is this?

Other

What does this PR do? Why is it needed?

End-to-end tests flake a lot. In particular, we noticed it has been draining developer productivity and affecting how fast we can iterate on features. To diagnose this, we decided to gather opentracing spans during endtoend, which can be replayed to a jaeger collector and visualized in the jaeger UI.

To accomplish this, we wrote a tool for collecting spans in endtoend tests, and another tool to replay them to jaeger collector in #9341.

The first time we took a peek at the spans, they showed some slowdown in certain disk operations, which we attributed to not running on an SSD. However, Preston from our team confirmed the nodes that run CI are using SSD, so we ignored the issue as there was not much to do.

Today, we decided to pick this up again and inspect the failure happening in CI once more by looking at jaeger spans. We noticed that most of the issues are in ReceiveBlock in functions related to checking state summaries. Specifically, SaveHeadBlockRoot. Upon further inspection, we see the function opens a write-transaction in bolt, and within that transactions opens another read-transaction. This seems potentially dangerous.

Screen Shot 2021-08-19 at 1 11 20 PM

There is no way a single db.Put operation is taking almost 4 seconds. Next, Kasey from our team brought up some information regarding this from the bolt documentation:

Read-only transactions and read-write transactions should not depend on one another and generally shouldn't be opened simultaneously in the same goroutine. This can cause a deadlock as the read-write transaction needs to periodically re-map the data file but it cannot do so while a read-only transaction is open.

As such, this PR tries to remove that by simplifying the logic with a small refactor. We hope this will be the root cause of end to end failing this much.

@rauljordan rauljordan requested a review from a team as a code owner August 19, 2021 19:22
@rauljordan rauljordan requested review from terencechain, rkapka and nisdas and removed request for a team August 19, 2021 19:22
@rauljordan rauljordan added CI Continuous integration related items Ready For Review labels Aug 19, 2021
@codecov
Copy link

codecov bot commented Aug 19, 2021

Codecov Report

Merging #9428 (b38b9e6) into develop (3764b7d) will increase coverage by 0.01%.
The diff coverage is 57.14%.

@@             Coverage Diff             @@
##           develop    #9428      +/-   ##
===========================================
+ Coverage    60.08%   60.09%   +0.01%     
===========================================
  Files          580      580              
  Lines        42759    42758       -1     
===========================================
+ Hits         25690    25696       +6     
+ Misses       13245    13242       -3     
+ Partials      3824     3820       -4     

terencechain
terencechain previously approved these changes Aug 19, 2021
prestonvanloon
prestonvanloon previously approved these changes Aug 19, 2021
@rauljordan
Copy link
Contributor Author

wow, e2e failed here for the same reason...looks like we're back to square one

@rauljordan rauljordan dismissed stale reviews from prestonvanloon and terencechain via b38b9e6 August 19, 2021 21:02
@rauljordan
Copy link
Contributor Author

Confirmed there were more nested views in update transactions. Resolved them and expect e2e to pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous integration related items
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants