Skip to content

Conversation

@cce
Copy link
Contributor

@cce cce commented Sep 8, 2025

Summary

The agreementtest package's TestSimulate very rarely deadlocks, with a recent example here. The stack trace in that failure shows the test goroutine blocked in instant.shutdown() waiting on channel <-i.Z1. instant is a test-only clock implementation used by the simulation.

This fetches the test logs on failure to help us better understand this failure.

@codecov
Copy link

codecov bot commented Sep 8, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 50.65%. Comparing base (988c7d3) to head (3410cb1).
⚠️ Report is 11 commits behind head on master.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6438      +/-   ##
==========================================
- Coverage   50.83%   50.65%   -0.18%     
==========================================
  Files         664      657       -7     
  Lines      111455   111357      -98     
==========================================
- Hits        56660    56412     -248     
- Misses      51927    52071     +144     
- Partials     2868     2874       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cce cce requested a review from algorandskiy September 10, 2025 13:06
algorandskiy
algorandskiy previously approved these changes Sep 10, 2025
Copy link
Contributor

@jannotti jannotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help be understand why we have this complicated shutdown scheme in test code? It feels like this change circumvents the careful shutdown, implying it's not really needed.

@zeldovich
Copy link
Contributor

The rationale for the non-trivial shutdown is that instant.Zero() might get stuck if nothing reads from instant.Z1. So, when the test decides to exit, the agreement service might call stopwatch.Zero(), where stopwatch is the instant instance passed to it by agreementtest.Simulate(), and get stuck because nothing will read from stopwatch.Z1. As far as I can tell, the call to stopwatch.shutdown() waits for the agreement service to make the next call to stopwatch.Zero(), so it might be worth trying to understand why that's not happening in some test runs.

@cce
Copy link
Contributor Author

cce commented Sep 10, 2025

I had thought that the race / flaky scenario was:

  • ledger.Wait(r) returns, indicating the final round in the loop was reached (agreement observed a cert, produced an ensureAction -> ledger received latest block)
  • agreement has not yet processed the rezeroAction, so it hasn't called stopwatch.Zero yet
  • deferred service.Shutdown() is called — agreement stopped before processing rezeroAction
  • deferred stopwatch.shutdown() is called next.
    ^^ but I realize now I had the defer order backwards, service shutdown is after stopwatch shutdown, so the Zero should be called ... I've updated the PR to collect TestSimulate.log instead.

I'm told this test has failed a couple times since we switched to Github Actions, most recently on the MacOS platform runner, so hopefully collecting TestSimulate.log will help understand it better

@algorandskiy algorandskiy changed the title tests: update TestSimulate shutdown to prevent deadlock tests: log TestSimulate to debug the shutdown deadlock Sep 10, 2025
@gmalouf gmalouf requested a review from jannotti September 19, 2025 17:16
@gmalouf gmalouf merged commit da61ec4 into algorand:master Sep 19, 2025
40 checks passed
algorandskiy pushed a commit to algorandskiy/go-algorand that referenced this pull request Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants