Skip to content

E2E Flake Report - main branch (Nov 27 – Jan 28) #2750

@snormore

Description

@snormore

25 failures out of 200 runs (12.5% failure rate, excluding 18 cancelled).

Breakdown by error type

1. failed to execute command from device: command failed with exit code 1 — 7 runs

Primarily affects TestE2E_DeviceTelemetry (6x). The underlying error is typically TWAMP sender bind: cannot assign requested address. Has been happening since at least Dec 18.

  • Jan 27DeviceTelemetry
  • Jan 20DeviceTelemetry, Multicast_Publisher
  • Jan 20DeviceTelemetry
  • Jan 20DeviceTelemetry
  • Jan 9DeviceTelemetry
  • Jan 8DeviceTelemetry, IBRL_WithAllocatedIP
  • Dec 18DeviceTelemetry

2. failed to wait for client tunnel status BGP Session Up: polling cancelled or timed out — 5 runs

Only affects TestE2E_DeviceMaxusersRollover. Client correctly targets the second device after max-users is set to 0 on device1, but BGP session never establishes on device2 within the 90s timeout. Started Jan 22 — zero occurrences in any test before that date.

3. Condition never satisfied (route convergence timeout) — 3 runs

Affects TestE2E_MultiClient/ibrl_with_allocated_ip. Route polling times out waiting for expected routes to appear.

4. failed to start ledger: ... context deadline exceeded — 4 runs

Ledger (Solana) container fails to start in time. Affects different tests each time — likely CI resource exhaustion.

5. failed to execute command from client: command failed with exit code 1 — 2 runs

Affects TestE2E_IBRL_WithAllocatedIP.

6. TestE2E_IBRL_WithAllocatedIP/doublezero_user_list — 3 runs

Test fails at doublezero_user_list check but no clear error extracted (assertion failures without detailed message).

7. Docker infra failure — 1 run

Massive failure: pull access denied for dz-local/client and multiple network not found errors. Almost every test failed. Likely a CI build/infra issue.

  • Dec 3 — 8 tests failed

Key observations

  • The top 3 error types (controlplane: open source controller/agent #1, activator: open source activator #2, client: open source doublezero client #3) account for 15 of the 25 failures and all look like timing/resource issues — commands timing out, BGP sessions not establishing, routes not converging.
  • Error activator: open source activator #2 (BGP Session Up timeout) started on Jan 22 and has never occurred before that date across any test. It only hits DeviceMaxusersRollover. The test itself wasn't changed around that time.
  • Error #4 (ledger startup timeout) is a clear CI resource exhaustion signal — spans the full date range.
  • TestE2E_IBRL_WithAllocatedIP has been a long-running flake (since Nov 27) with various error types, mostly at the doublezero_user_list / ban_user post-connect checks.
  • Some runs have multiple tests failing simultaneously (e.g. Jan 8, Jan 20, Dec 3), reinforcing CI resource contention as a contributing factor.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions