|
| 1 | +# Results |
| 2 | + |
| 3 | +## Test environment |
| 4 | + |
| 5 | +NGINX Plus: false |
| 6 | + |
| 7 | +NGINX Gateway Fabric: |
| 8 | + |
| 9 | +- Commit: e4eed2dad213387e6493e76100d285483ccbf261 |
| 10 | +- Date: 2025-10-17T14:41:02Z |
| 11 | +- Dirty: false |
| 12 | + |
| 13 | +GKE Cluster: |
| 14 | + |
| 15 | +- Node count: 3 |
| 16 | +- k8s version: v1.33.5-gke.1080000 |
| 17 | +- vCPUs per node: 2 |
| 18 | +- RAM per node: 4015668Ki |
| 19 | +- Max pods per node: 110 |
| 20 | +- Zone: europe-west2-a |
| 21 | +- Instance Type: e2-medium |
| 22 | + |
| 23 | +## Summary: |
| 24 | + |
| 25 | +- Still a lot of non-2xx or 3xx responses, but vastly improved on the last test run. |
| 26 | +- This indicates that while most of the Agent - control plane connection issues have been resolved, some issues remain. |
| 27 | +- All the observed 502s happened within the one window of time, which at least indicates the system was able to recover - although it is unclear what triggered Agent |
| 28 | +- The increase in memory usage for NGF seen in the previous test run appears to have been resolved. |
| 29 | +- We observe a steady increase in NGINX memory usage over time which could indicate a memory leak. |
| 30 | +- CPU usage remained consistent with past results. |
| 31 | +- Errors seem to be related to cluster upgrade or some other external factor (excluding the resolved inferences pool status error). |
| 32 | + |
| 33 | +## Traffic |
| 34 | + |
| 35 | +HTTP: |
| 36 | + |
| 37 | +```text |
| 38 | +Running 5760m test @ http://cafe.example.com/coffee |
| 39 | + 2 threads and 100 connections |
| 40 | + Thread Stats Avg Stdev Max +/- Stdev |
| 41 | + Latency 202.19ms 150.51ms 2.00s 83.62% |
| 42 | + Req/Sec 272.67 178.26 2.59k 63.98% |
| 43 | + 183598293 requests in 5760.00m, 62.80GB read |
| 44 | + Socket errors: connect 0, read 338604, write 82770, timeout 57938 |
| 45 | + Non-2xx or 3xx responses: 33893 |
| 46 | +Requests/sec: 531.24 |
| 47 | +Transfer/sec: 190.54KB |
| 48 | +``` |
| 49 | + |
| 50 | +HTTPS: |
| 51 | + |
| 52 | +```text |
| 53 | +Running 5760m test @ https://cafe.example.com/tea |
| 54 | + 2 threads and 100 connections |
| 55 | + Thread Stats Avg Stdev Max +/- Stdev |
| 56 | + Latency 189.21ms 108.25ms 2.00s 66.82% |
| 57 | + Req/Sec 271.64 178.03 1.96k 63.33% |
| 58 | + 182905321 requests in 5760.00m, 61.55GB read |
| 59 | + Socket errors: connect 10168, read 332301, write 0, timeout 96 |
| 60 | +Requests/sec: 529.24 |
| 61 | +Transfer/sec: 186.76KB |
| 62 | +``` |
| 63 | + |
| 64 | +## Key Metrics |
| 65 | + |
| 66 | +### Containers memory |
| 67 | + |
| 68 | + |
| 69 | + |
| 70 | +### Containers CPU |
| 71 | + |
| 72 | + |
| 73 | + |
| 74 | +## Error Logs |
| 75 | + |
| 76 | +### nginx-gateway |
| 77 | + |
| 78 | +- msg: Config apply failed, rolling back config; error: error getting file data for name:"/etc/nginx/conf.d/http.conf" hash:"Luqynx2dkxqzXH21wmiV0nj5bHyGiIq7/2gOoM6aKew=" permissions:"0644" size:5430: rpc error: code = NotFound desc = file not found -> happened twice in the 4 days, related to agent reconciliation during token rotation |
| 79 | + - {hashFound: jmeyy1p+6W1icH2x2YGYffH1XtooWxvizqUVd+WdzQ4=, hashWanted: Luqynx2dkxqzXH21wmiV0nj5bHyGiIq7/2gOoM6aKew=, level: debug, logger: nginxUpdater.fileService, msg: File found had wrong hash, ts: 2025-10-18T18:11:24Z} |
| 80 | + - The error indicates Agent requested a file that had since changed |
| 81 | + |
| 82 | +- msg: Failed to update lock optimistically: the server was unable to return a response in the time allotted, but may still be processing the request (put leases.coordination.k8s.io ngf-longevity-nginx-gateway-fabric-leader-election), falling back to slow path -> same leader election error as on plus, seems out of scope of our product |
| 83 | + |
| 84 | +- msg: no matches for kind "InferencePool" in version "inference.networking.k8s.io/v1" -> Thousands of these, but fixed in PR 4104 |
| 85 | + |
| 86 | +### nginx |
| 87 | + |
| 88 | +Traffic: nearly 34000 502s |
| 89 | + |
| 90 | +- These all happened in the same window of less than a minute (approx 2025-10-18T18:11:11 - 2025-10-18T18:11:50), and resolved once NGINX restarted |
| 91 | +- It's unclear what triggered NGINX to restart, though it does appear a memory spike was observed around this time |
| 92 | +- The outage correlates with the config apply error seen in the control plane logs |
0 commit comments