Skip to content

Conversation

@rcgoodfellow
Copy link
Collaborator

@rcgoodfellow rcgoodfellow commented Sep 9, 2025

Main issue

The root cause of this test flake is restarting dendrite while keeping state in the underlying softnpu ASIC.

The following sequence of commands is sufficient to demonstrate corruption of underlying softnpu state similar to what has been observed in the comments in this PR.

swadm route add fd00:2::/64 rear1/0 fe80::b
swadm route add fd00:1::/64 rear0/0 fe80::a

svcadm restart dendrite

swadm route add fd00:1::/64 rear0/0 fe80::a
swadm route del fd00:1::/64
swadm route del fd00:2::/64

The workaround employed by this PR is to not couple restarting a DDM transit router with restarting dendrite in our test harnesses. Once the following dendrite issue is addressed, the workaround can be reverted.

Why was this an intermittent flake?

Whether or not softnpu tables would get corrupted depended on the ordering of table operations. When a transit ddm router was restarted as a part of a test, its not deterministic which of its peers will reestablish a session first. It depends on where each peer is in its discovery state machine at the time the transit router comes back on line. For that reason, some orderings would result in successful tests while others would result in softnpu table corruption.

Why Is this only happening now?

oxidecomputer/dendrite#110 broke the IPv6 routing table up into two tables, a prefix table and an index table, to facilitate ECMP groups. This change means that nexthop index allocation is kept in dpd and the ordering of allocation is reflected in how things are laid out on the ASIC. If dpd is restarted and loses its state, but the ASIC state remains, nexthop associations can become corrupted as observed in this issue. Prior to oxidecomputer/dendrite#110 the route-to-nexthop relationship was 1:1 and adding/removing routes was not order sensitive.

Various other things in this PR

  • Remove buildomat access_repo metadata for dendrite, as dendrite is now a public repo and this is no longer required.
  • Move from using sparse zones to omicron1 zones for testing as these are lightyears faster to stand up.
  • Add more diagnostics and logging to the trio tests.

@rcgoodfellow rcgoodfellow changed the title debug ... trio test flake Sep 13, 2025
@rcgoodfellow rcgoodfellow linked an issue Sep 13, 2025 that may be closed by this pull request
@rcgoodfellow
Copy link
Collaborator Author

I believe I've finally pinned down the cause of this.

When the test fails, I see the following in Dendrite.

root@t1:~# /opt/oxide/dendrite/bin/swadm route ls
Subnet       Port   Link  Gateway                   Vlan
fd00:1::/64  rear0  0     fe80::34b0:bff:fedb:a871
fd00:2::/64  rear1  0     fe80::f05e:dff:fe5f:bf6

but then I see this in softnpu, showing that the fd00:2::/64 nexthop does not line up.

root@t1:~# /tmp/scadm standalone -c /opt/mnt/client -s /opt/mnt/server dump-state
[snip]
router v6_idx:
fd00:2::/64 -> 1
fd00:1::/64 -> 1
router v6_routes:
0 -> fe80::34b0:bff:fedb:a871 (1)
1 -> fe80::34b0:bff:fedb:a871 (1)
[snip]
resolver v6:
fe80::305f:72ff:fedf:6b36 -> 32:5f:72:df:6b:36
fe80::34b0:bff:fedb:a871 -> 36:b0:0b:db:a8:71
fe80::6087:81ff:fe1f:2b1e -> 62:87:81:1f:2b:1e
fe80::f05e:dff:fe5f:bf6 -> f2:5e:0d:5f:0b:f6
[snip]

@rcgoodfellow
Copy link
Collaborator Author

rcgoodfellow commented Sep 14, 2025

Further debug statements show that right after the transit router is restarted, state is consistent:

transit router restart passed
[sidecar.trio] /opt/scadm standalone --client /opt/mnt/client --server /opt/mnt/server dump-state
local v6:
fe80::54e8:80ff:fe1c:80a0
fe80::7028:28ff:fe98:7834
local v4:
router v6_idx:
fd00:1::/64 -> 0
fd00:2::/64 -> 1
router v6_routes:
1 -> fe80::6010:beff:fe82:5eab (2)
0 -> fe80::c855:3ff:fe32:8242 (1)

then when s1 is restarted, causing a withdraw to hit dendrite, state becomes inconsistent:

[s1.trio] pkill ddmd

[sidecar.trio] /opt/scadm standalone --client /opt/mnt/client --server /opt/mnt/server dump-state
local v6:
fe80::54e8:80ff:fe1c:80a0
fe80::7028:28ff:fe98:7834
local v4:
router v6_idx:
fd00:2::/64 -> 1
router v6_routes:
0 -> fe80::c855:3ff:fe32:8242 (1)

note that fd00:2::/64 points to index 1, but that index does not exist in the v6_routes table.

@rcgoodfellow rcgoodfellow force-pushed the trio-flake branch 4 times, most recently from 43a6023 to 0393a1d Compare September 14, 2025 22:45
"svccfg -s dendrite setprop config/uds_path = /opt/mnt",
)?;
self.zone.zexec(
if restart_dpd {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gating dpd restart on this variable is the fix for the flakiness.

// copy is not complete.
println!("waiting 3s for copy of files to zone to complete ...");
sleep(Duration::from_secs(3));
self.zone.zcmd(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New requirement for omicron1 zones.

Copy link
Contributor

@FelixMcFelix FelixMcFelix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the writeup here Ry, the changes make sense.

Move from using sparse zones to omicron1 zones for testing as these are lightyears faster to stand up.

Yeah, it's significant! I see the trio/quartet/interop tests are all around 16mins faster. 🎉

@Nieuwejaar
Copy link
Contributor

Thanks for the writeup here Ry, the changes make sense.

Move from using sparse zones to omicron1 zones for testing as these are lightyears faster to stand up.

Yeah, it's significant! I see the trio/quartet/interop tests are all around 16mins faster. 🎉

This is awesome.

@rcgoodfellow rcgoodfellow merged commit d10cc15 into main Sep 15, 2025
14 checks passed
@rcgoodfellow rcgoodfellow deleted the trio-flake branch September 15, 2025 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

trio test is flaky

4 participants