-
Notifications
You must be signed in to change notification settings - Fork 5
trio test flake #545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trio test flake #545
Conversation
37451cc to
108ed28
Compare
|
I believe I've finally pinned down the cause of this. When the test fails, I see the following in Dendrite. but then I see this in softnpu, showing that the |
|
Further debug statements show that right after the transit router is restarted, state is consistent: then when note that |
43a6023 to
0393a1d
Compare
0393a1d to
a6afd70
Compare
| "svccfg -s dendrite setprop config/uds_path = /opt/mnt", | ||
| )?; | ||
| self.zone.zexec( | ||
| if restart_dpd { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gating dpd restart on this variable is the fix for the flakiness.
| // copy is not complete. | ||
| println!("waiting 3s for copy of files to zone to complete ..."); | ||
| sleep(Duration::from_secs(3)); | ||
| self.zone.zcmd( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New requirement for omicron1 zones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the writeup here Ry, the changes make sense.
Move from using sparse zones to omicron1 zones for testing as these are lightyears faster to stand up.
Yeah, it's significant! I see the trio/quartet/interop tests are all around 16mins faster. 🎉
This is awesome. |
Main issue
The root cause of this test flake is restarting dendrite while keeping state in the underlying softnpu ASIC.
The following sequence of commands is sufficient to demonstrate corruption of underlying softnpu state similar to what has been observed in the comments in this PR.
The workaround employed by this PR is to not couple restarting a DDM transit router with restarting dendrite in our test harnesses. Once the following dendrite issue is addressed, the workaround can be reverted.
clearfor the softnpu ASIC. dendrite#130Why was this an intermittent flake?
Whether or not softnpu tables would get corrupted depended on the ordering of table operations. When a transit ddm router was restarted as a part of a test, its not deterministic which of its peers will reestablish a session first. It depends on where each peer is in its discovery state machine at the time the transit router comes back on line. For that reason, some orderings would result in successful tests while others would result in softnpu table corruption.
Why Is this only happening now?
oxidecomputer/dendrite#110 broke the IPv6 routing table up into two tables, a prefix table and an index table, to facilitate ECMP groups. This change means that nexthop index allocation is kept in
dpdand the ordering of allocation is reflected in how things are laid out on the ASIC. Ifdpdis restarted and loses its state, but the ASIC state remains, nexthop associations can become corrupted as observed in this issue. Prior to oxidecomputer/dendrite#110 the route-to-nexthop relationship was 1:1 and adding/removing routes was not order sensitive.Various other things in this PR
access_repometadata for dendrite, as dendrite is now a public repo and this is no longer required.