E2E Flake Report - main branch (Nov 27 – Jan 28)

25 failures out of 200 runs (**12.5% failure rate**, excluding 18 cancelled).

## Breakdown by error type

### 1. `failed to execute command from device: command failed with exit code 1` — 7 runs
Primarily affects `TestE2E_DeviceTelemetry` (6x). The underlying error is typically TWAMP sender `bind: cannot assign requested address`. Has been happening since at least Dec 18.
- [Jan 27](https://github.com/malbeclabs/doublezero/actions/runs/21409406400) — `DeviceTelemetry`
- [Jan 20](https://github.com/malbeclabs/doublezero/actions/runs/21186096768) — `DeviceTelemetry`, `Multicast_Publisher`
- [Jan 20](https://github.com/malbeclabs/doublezero/actions/runs/21183071990) — `DeviceTelemetry`
- [Jan 20](https://github.com/malbeclabs/doublezero/actions/runs/21181857995) — `DeviceTelemetry`
- [Jan 9](https://github.com/malbeclabs/doublezero/actions/runs/20856293277) — `DeviceTelemetry`
- [Jan 8](https://github.com/malbeclabs/doublezero/actions/runs/20828937097) — `DeviceTelemetry`, `IBRL_WithAllocatedIP`
- [Dec 18](https://github.com/malbeclabs/doublezero/actions/runs/20350673582) — `DeviceTelemetry`

### 2. `failed to wait for client tunnel status BGP Session Up: polling cancelled or timed out` — 5 runs
Only affects `TestE2E_DeviceMaxusersRollover`. Client correctly targets the second device after max-users is set to 0 on device1, but BGP session never establishes on device2 within the 90s timeout. **Started Jan 22** — zero occurrences in any test before that date.
- [Jan 28](https://github.com/malbeclabs/doublezero/actions/runs/21451841753)
- [Jan 27](https://github.com/malbeclabs/doublezero/actions/runs/21411973634)
- [Jan 26](https://github.com/malbeclabs/doublezero/actions/runs/21372175525)
- [Jan 23](https://github.com/malbeclabs/doublezero/actions/runs/21301248885)
- [Jan 23](https://github.com/malbeclabs/doublezero/actions/runs/21298657678)

### 3. `Condition never satisfied` (route convergence timeout) — 3 runs
Affects `TestE2E_MultiClient/ibrl_with_allocated_ip`. Route polling times out waiting for expected routes to appear.
- [Jan 28](https://github.com/malbeclabs/doublezero/actions/runs/21443433270)
- [Jan 27](https://github.com/malbeclabs/doublezero/actions/runs/21405023619)
- [Jan 22](https://github.com/malbeclabs/doublezero/actions/runs/21252489519)

### 4. `failed to start ledger: ... context deadline exceeded` — 4 runs
Ledger (Solana) container fails to start in time. Affects different tests each time — likely CI resource exhaustion.
- [Jan 21](https://github.com/malbeclabs/doublezero/actions/runs/21229958101) — `MultiClient`
- [Jan 20](https://github.com/malbeclabs/doublezero/actions/runs/21186096768) — `Multicast_Publisher` (co-occurring with #1)
- [Jan 15](https://github.com/malbeclabs/doublezero/actions/runs/21017101185) — `SDK_Serviceability`
- [Nov 27](https://github.com/malbeclabs/doublezero/actions/runs/19749500137) — `SDK_Telemetry_InternetLatencySamples`

### 5. `failed to execute command from client: command failed with exit code 1` — 2 runs
Affects `TestE2E_IBRL_WithAllocatedIP`.
- [Jan 14](https://github.com/malbeclabs/doublezero/actions/runs/21005400281)
- [Dec 16](https://github.com/malbeclabs/doublezero/actions/runs/20286488133)

### 6. `TestE2E_IBRL_WithAllocatedIP/doublezero_user_list` — 3 runs
Test fails at `doublezero_user_list` check but no clear error extracted (assertion failures without detailed message).
- [Dec 19](https://github.com/malbeclabs/doublezero/actions/runs/20378424787)
- [Dec 2](https://github.com/malbeclabs/doublezero/actions/runs/19868582375)
- [Nov 27](https://github.com/malbeclabs/doublezero/actions/runs/19749941014)

### 7. Docker infra failure — 1 run
Massive failure: `pull access denied for dz-local/client` and multiple `network not found` errors. Almost every test failed. Likely a CI build/infra issue.
- [Dec 3](https://github.com/malbeclabs/doublezero/actions/runs/19900150824) — 8 tests failed

## Key observations

- The top 3 error types (#1, #2, #3) account for 15 of the 25 failures and all look like timing/resource issues — commands timing out, BGP sessions not establishing, routes not converging.
- **Error #2 (BGP Session Up timeout) started on Jan 22 and has never occurred before that date across any test.** It only hits `DeviceMaxusersRollover`. The test itself wasn't changed around that time.
- Error #4 (ledger startup timeout) is a clear CI resource exhaustion signal — spans the full date range.
- `TestE2E_IBRL_WithAllocatedIP` has been a long-running flake (since Nov 27) with various error types, mostly at the `doublezero_user_list` / `ban_user` post-connect checks.
- Some runs have multiple tests failing simultaneously (e.g. Jan 8, Jan 20, Dec 3), reinforcing CI resource contention as a contributing factor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E Flake Report - main branch (Nov 27 – Jan 28) #2750

Breakdown by error type

1. `failed to execute command from device: command failed with exit code 1` — 7 runs

2. `failed to wait for client tunnel status BGP Session Up: polling cancelled or timed out` — 5 runs

3. `Condition never satisfied` (route convergence timeout) — 3 runs

4. `failed to start ledger: ... context deadline exceeded` — 4 runs

5. `failed to execute command from client: command failed with exit code 1` — 2 runs

6. `TestE2E_IBRL_WithAllocatedIP/doublezero_user_list` — 3 runs

7. Docker infra failure — 1 run

Key observations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

E2E Flake Report - main branch (Nov 27 – Jan 28) #2750

Description

Breakdown by error type

1. failed to execute command from device: command failed with exit code 1 — 7 runs

2. failed to wait for client tunnel status BGP Session Up: polling cancelled or timed out — 5 runs

3. Condition never satisfied (route convergence timeout) — 3 runs

4. failed to start ledger: ... context deadline exceeded — 4 runs

5. failed to execute command from client: command failed with exit code 1 — 2 runs

6. TestE2E_IBRL_WithAllocatedIP/doublezero_user_list — 3 runs

7. Docker infra failure — 1 run

Key observations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `failed to execute command from device: command failed with exit code 1` — 7 runs

2. `failed to wait for client tunnel status BGP Session Up: polling cancelled or timed out` — 5 runs

3. `Condition never satisfied` (route convergence timeout) — 3 runs

4. `failed to start ledger: ... context deadline exceeded` — 4 runs

5. `failed to execute command from client: command failed with exit code 1` — 2 runs

6. `TestE2E_IBRL_WithAllocatedIP/doublezero_user_list` — 3 runs