trio test flake #545

rcgoodfellow · 2025-09-09T18:13:11Z

Main issue

The root cause of this test flake is restarting dendrite while keeping state in the underlying softnpu ASIC.

The following sequence of commands is sufficient to demonstrate corruption of underlying softnpu state similar to what has been observed in the comments in this PR.

swadm route add fd00:2::/64 rear1/0 fe80::b
swadm route add fd00:1::/64 rear0/0 fe80::a

svcadm restart dendrite

swadm route add fd00:1::/64 rear0/0 fe80::a
swadm route del fd00:1::/64
swadm route del fd00:2::/64

The workaround employed by this PR is to not couple restarting a DDM transit router with restarting dendrite in our test harnesses. Once the following dendrite issue is addressed, the workaround can be reverted.

Implement clear for the softnpu ASIC. dendrite#130

Why was this an intermittent flake?

Whether or not softnpu tables would get corrupted depended on the ordering of table operations. When a transit ddm router was restarted as a part of a test, its not deterministic which of its peers will reestablish a session first. It depends on where each peer is in its discovery state machine at the time the transit router comes back on line. For that reason, some orderings would result in successful tests while others would result in softnpu table corruption.

Why Is this only happening now?

oxidecomputer/dendrite#110 broke the IPv6 routing table up into two tables, a prefix table and an index table, to facilitate ECMP groups. This change means that nexthop index allocation is kept in dpd and the ordering of allocation is reflected in how things are laid out on the ASIC. If dpd is restarted and loses its state, but the ASIC state remains, nexthop associations can become corrupted as observed in this issue. Prior to oxidecomputer/dendrite#110 the route-to-nexthop relationship was 1:1 and adding/removing routes was not order sensitive.

Various other things in this PR

Remove buildomat access_repo metadata for dendrite, as dendrite is now a public repo and this is no longer required.
Move from using sparse zones to omicron1 zones for testing as these are lightyears faster to stand up.
Add more diagnostics and logging to the trio tests.

rcgoodfellow · 2025-09-13T23:17:05Z

I believe I've finally pinned down the cause of this.

When the test fails, I see the following in Dendrite.

root@t1:~# /opt/oxide/dendrite/bin/swadm route ls
Subnet       Port   Link  Gateway                   Vlan
fd00:1::/64  rear0  0     fe80::34b0:bff:fedb:a871
fd00:2::/64  rear1  0     fe80::f05e:dff:fe5f:bf6

but then I see this in softnpu, showing that the fd00:2::/64 nexthop does not line up.

root@t1:~# /tmp/scadm standalone -c /opt/mnt/client -s /opt/mnt/server dump-state
[snip]
router v6_idx:
fd00:2::/64 -> 1
fd00:1::/64 -> 1
router v6_routes:
0 -> fe80::34b0:bff:fedb:a871 (1)
1 -> fe80::34b0:bff:fedb:a871 (1)
[snip]
resolver v6:
fe80::305f:72ff:fedf:6b36 -> 32:5f:72:df:6b:36
fe80::34b0:bff:fedb:a871 -> 36:b0:0b:db:a8:71
fe80::6087:81ff:fe1f:2b1e -> 62:87:81:1f:2b:1e
fe80::f05e:dff:fe5f:bf6 -> f2:5e:0d:5f:0b:f6
[snip]

rcgoodfellow · 2025-09-14T06:03:16Z

Further debug statements show that right after the transit router is restarted, state is consistent:

transit router restart passed
[sidecar.trio] /opt/scadm standalone --client /opt/mnt/client --server /opt/mnt/server dump-state
local v6:
fe80::54e8:80ff:fe1c:80a0
fe80::7028:28ff:fe98:7834
local v4:
router v6_idx:
fd00:1::/64 -> 0
fd00:2::/64 -> 1
router v6_routes:
1 -> fe80::6010:beff:fe82:5eab (2)
0 -> fe80::c855:3ff:fe32:8242 (1)

then when s1 is restarted, causing a withdraw to hit dendrite, state becomes inconsistent:

[s1.trio] pkill ddmd

[sidecar.trio] /opt/scadm standalone --client /opt/mnt/client --server /opt/mnt/server dump-state
local v6:
fe80::54e8:80ff:fe1c:80a0
fe80::7028:28ff:fe98:7834
local v4:
router v6_idx:
fd00:2::/64 -> 1
router v6_routes:
0 -> fe80::c855:3ff:fe32:8242 (1)

note that fd00:2::/64 points to index 1, but that index does not exist in the v6_routes table.

rcgoodfellow · 2025-09-15T02:36:35Z

tests/src/ddm.rs

-                "svccfg -s dendrite setprop config/uds_path = /opt/mnt",
-            )?;
-            self.zone.zexec(
+            if restart_dpd {


Gating dpd restart on this variable is the fix for the flakiness.

rcgoodfellow · 2025-09-15T02:37:03Z

tests/src/ddm.rs

            // copy is not complete.
            println!("waiting 3s for copy of files to zone to complete ...");
            sleep(Duration::from_secs(3));
+            self.zone.zcmd(


New requirement for omicron1 zones.

FelixMcFelix

Thanks for the writeup here Ry, the changes make sense.

Move from using sparse zones to omicron1 zones for testing as these are lightyears faster to stand up.

Yeah, it's significant! I see the trio/quartet/interop tests are all around 16mins faster. 🎉

Nieuwejaar · 2025-09-15T14:40:19Z

Thanks for the writeup here Ry, the changes make sense.

Move from using sparse zones to omicron1 zones for testing as these are lightyears faster to stand up.

Yeah, it's significant! I see the trio/quartet/interop tests are all around 16mins faster. 🎉

This is awesome.

rcgoodfellow added 12 commits September 9, 2025 11:12

debug ...

8cc16d5

trigger-ci

f39d7b8

🤦

91fb5ce

trigger-ci

fd72aa9

...

bfe2134

...

ff8da14

trigger ci

8aaa1f5

...

01b2fbb

trigger ci

f834012

...

2b55bd9

trigger ci

3c0edd2

...

108ed28

rcgoodfellow force-pushed the trio-flake branch from 37451cc to 108ed28 Compare September 12, 2025 20:18

rcgoodfellow added 5 commits September 12, 2025 13:58

...

c7f0f6a

...

cdfad58

is it source address selection!?

e101e40

.........

ec936d2

... .........

3dfa71b

rcgoodfellow changed the title ~~debug ...~~ trio test flake Sep 13, 2025

rcgoodfellow linked an issue Sep 13, 2025 that may be closed by this pull request

trio test is flaky #544

Closed

dump softnpu state

99831d7

rcgoodfellow added 6 commits September 13, 2025 16:34

trigger ci

d23f131

... ...

1da86e3

trigger ci

95bca9e

trigger ci

96d3e7a

trigger ci

f203cdf

trigger ci

c626d6e

rcgoodfellow mentioned this pull request Sep 14, 2025

softnpu: fix router v6 keyset data oxidecomputer/dendrite#129

Draft

rcgoodfellow force-pushed the trio-flake branch 4 times, most recently from 43a6023 to 0393a1d Compare September 14, 2025 22:45

bump dendrite, use omicron1 zone for tests

a6afd70

rcgoodfellow force-pushed the trio-flake branch from 0393a1d to a6afd70 Compare September 14, 2025 23:03

rcgoodfellow added 4 commits September 14, 2025 17:45

tests: router restart should not imply dpd restart

8683723

less derpy

0242534

ugh

5b9b3b1

revert dendrite bump

5a0a9b2

rcgoodfellow mentioned this pull request Sep 15, 2025

Implement clear for the softnpu ASIC. oxidecomputer/dendrite#130

Open

rcgoodfellow marked this pull request as ready for review September 15, 2025 02:28

rcgoodfellow commented Sep 15, 2025

View reviewed changes

cleanup

57a0fcb

FelixMcFelix approved these changes Sep 15, 2025

View reviewed changes

rcgoodfellow merged commit d10cc15 into main Sep 15, 2025
14 checks passed

rcgoodfellow deleted the trio-flake branch September 15, 2025 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

trio test flake #545

trio test flake #545

Uh oh!

rcgoodfellow commented Sep 9, 2025 •

edited

Loading

Uh oh!

rcgoodfellow commented Sep 13, 2025

Uh oh!

rcgoodfellow commented Sep 14, 2025 •

edited

Loading

Uh oh!

rcgoodfellow Sep 15, 2025

Uh oh!

rcgoodfellow Sep 15, 2025

Uh oh!

FelixMcFelix left a comment

Uh oh!

Nieuwejaar commented Sep 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

trio test flake #545

trio test flake #545

Uh oh!

Conversation

rcgoodfellow commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main issue

Why was this an intermittent flake?

Why Is this only happening now?

Various other things in this PR

Uh oh!

rcgoodfellow commented Sep 13, 2025

Uh oh!

rcgoodfellow commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rcgoodfellow Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

rcgoodfellow Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

FelixMcFelix left a comment

Choose a reason for hiding this comment

Uh oh!

Nieuwejaar commented Sep 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rcgoodfellow commented Sep 9, 2025 •

edited

Loading

rcgoodfellow commented Sep 14, 2025 •

edited

Loading