Update summaries, add longevity results

ciarams87 · ciarams87 · commit 4b8b457cbbeb · 2025-10-21T21:58:36.000+01:00
diff --git a/tests/results/dp-perf/2.2.0/2.2.0-oss.md b/tests/results/dp-perf/2.2.0/2.2.0-oss.md
@@ -20,6 +20,13 @@ GKE Cluster:
 - Zone: us-west1-b
 - Instance Type: n2d-standard-16
 
+## Summary: 
+
+- 4 out of 5 tests showed slight latency increases, consistent with the trend noted in the 2.1.0 summary
+- The latency differences are minimal overall, with most changes under 1%.
+- The POST method routing increase of ~2.2% is the most significant change, though still relatively small in absolute terms (~21µs).
+- All tests maintained 100% success rates with similar throughput (~1000 req/s), indicating that the slight latency variations are likely within normal performance variance.
+
 ## Test1: Running latte path based routing
 
 ```text
diff --git a/tests/results/dp-perf/2.2.0/2.2.0-plus.md b/tests/results/dp-perf/2.2.0/2.2.0-plus.md
@@ -20,6 +20,16 @@ GKE Cluster:
 - Zone: us-west1-b
 - Instance Type: n2d-standard-16
 
+## Summary:
+
+- Average latency increased across all tests
+- Largest Increase: Header-based routing (+76.461µs, +8.60%)
+- Smallest Increase: Path-based routing (+28.988µs, +3.26%)
+- Average Overall Increase: ~51.1µs (+5.69% average across all tests)
+- Most Impacted: Header and query-based routing (8.60% and 5.91% respectively)
+- Method Routing: GET and POST both increased by ~5.3%
+- All tests maintained 100% success rate, similar throughput and similar max latencies
+
 ## Test1: Running latte path based routing
 
 ```text
diff --git a/tests/results/longevity/2.2.0/2.2.0-oss.md b/tests/results/longevity/2.2.0/2.2.0-oss.md
@@ -0,0 +1,92 @@
+# Results
+
+## Test environment
+
+NGINX Plus: false
+
+NGINX Gateway Fabric:
+
+- Commit: e4eed2dad213387e6493e76100d285483ccbf261
+- Date: 2025-10-17T14:41:02Z
+- Dirty: false
+
+GKE Cluster:
+
+- Node count: 3
+- k8s version: v1.33.5-gke.1080000
+- vCPUs per node: 2
+- RAM per node: 4015668Ki
+- Max pods per node: 110
+- Zone: europe-west2-a
+- Instance Type: e2-medium
+
+## Summary:
+
+- Still a lot of non-2xx or 3xx responses, but vastly improved on the last test run.
+- This indicates that while most of the Agent - control plane connection issues have been resolved, some issues remain.
+- All the observed 502s happened within the one window of time, which at least indicates the system was able to recover - although it is unclear what triggered Agent 
+- The increase in memory usage for NGF seen in the previous test run appears to have been resolved.
+- We observe a steady increase in NGINX memory usage over time which could indicate a memory leak.
+- CPU usage remained consistent with past results. 
+- Errors seem to be related to cluster upgrade or some other external factor (excluding the resolved inferences pool status error).
+
+## Traffic
+
+HTTP:
+
+```text
+Running 5760m test @ http://cafe.example.com/coffee
+  2 threads and 100 connections
+  Thread Stats   Avg      Stdev     Max   +/- Stdev
+    Latency   202.19ms  150.51ms   2.00s    83.62%
+    Req/Sec   272.67    178.26     2.59k    63.98%
+  183598293 requests in 5760.00m, 62.80GB read
+  Socket errors: connect 0, read 338604, write 82770, timeout 57938
+  Non-2xx or 3xx responses: 33893
+Requests/sec:    531.24
+Transfer/sec:    190.54KB
+```
+
+HTTPS:
+
+```text
+Running 5760m test @ https://cafe.example.com/tea
+  2 threads and 100 connections
+  Thread Stats   Avg      Stdev     Max   +/- Stdev
+    Latency   189.21ms  108.25ms   2.00s    66.82%
+    Req/Sec   271.64    178.03     1.96k    63.33%
+  182905321 requests in 5760.00m, 61.55GB read
+  Socket errors: connect 10168, read 332301, write 0, timeout 96
+Requests/sec:    529.24
+Transfer/sec:    186.76KB
+```
+
+## Key Metrics
+
+### Containers memory
+
+![oss-memory.png](oss-memory.png)
+
+### Containers CPU
+
+![oss-cpu.png](oss-cpu.png)
+
+## Error Logs
+
+### nginx-gateway
+
+- msg: Config apply failed, rolling back config; error: error getting file data for name:"/etc/nginx/conf.d/http.conf"  hash:"Luqynx2dkxqzXH21wmiV0nj5bHyGiIq7/2gOoM6aKew="  permissions:"0644"  size:5430: rpc error: code = NotFound desc = file not found -> happened twice in the 4 days, related to agent reconciliation during token rotation
+  - {hashFound: jmeyy1p+6W1icH2x2YGYffH1XtooWxvizqUVd+WdzQ4=, hashWanted: Luqynx2dkxqzXH21wmiV0nj5bHyGiIq7/2gOoM6aKew=, level: debug, logger: nginxUpdater.fileService, msg: File found had wrong hash, ts: 2025-10-18T18:11:24Z}
+  - The error indicates Agent requested a file that had since changed
+
+- msg: Failed to update lock optimistically: the server was unable to return a response in the time allotted, but may still be processing the request (put leases.coordination.k8s.io ngf-longevity-nginx-gateway-fabric-leader-election), falling back to slow path -> same leader election error as on plus, seems out of scope of our product
+
+- msg: no matches for kind "InferencePool" in version "inference.networking.k8s.io/v1" -> Thousands of these, but fixed in PR 4104
+
+### nginx
+
+Traffic: nearly 34000 502s
+
+- These all happened in the same window of less than a minute (approx 2025-10-18T18:11:11 - 2025-10-18T18:11:50), and resolved once NGINX restarted
+- It's unclear what triggered NGINX to restart, though it does appear a memory spike was observed around this time
+- The outage correlates with the config apply error seen in the control plane logs
diff --git a/tests/results/longevity/2.2.0/2.2.0-plus.md b/tests/results/longevity/2.2.0/2.2.0-plus.md
@@ -0,0 +1,96 @@
+# Results
+
+## Test environment
+
+NGINX Plus: true
+
+NGINX Gateway Fabric:
+
+- Commit: e4eed2dad213387e6493e76100d285483ccbf261
+- Date: 2025-10-17T14:41:02Z
+- Dirty: false
+
+GKE Cluster:
+
+- Node count: 3
+- k8s version: v1.33.5-gke.1080000
+- vCPUs per node: 2
+- RAM per node: 4015668Ki
+- Max pods per node: 110
+- Zone: europe-west2-a
+- Instance Type: e2-medium
+
+## Summary:
+
+- Total of 5 502s observed across the 4 days of the test run
+- The increase in memory usage for NGF seen in the previous test run appears to have resolved.
+- We observe a steady increase in NGINX memory usage over time which could indicate a memory leak.
+- CPU usage remained consistant with past results. 
+- Errors seem to be related to cluster upgrade or some other external factor (excluding the resolved inferences pool status error).
+
+## Key Metrics
+
+### Containers memory
+
+![plus-memory.png](oss-memory.png)
+
+### Containers CPU
+
+![plus-cpu.png](oss-cpu.png)
+
+## Traffic
+
+HTTP:
+
+```text
+Running 5760m test @ http://cafe.example.com/coffee
+  2 threads and 100 connections
+  Thread Stats   Avg      Stdev     Max   +/- Stdev
+    Latency   203.71ms  108.67ms   2.00s    66.92%
+    Req/Sec   257.95    167.36     1.44k    63.57%
+  173901014 requests in 5760.00m, 59.64GB read
+  Socket errors: connect 0, read 219, write 55133, timeout 27
+  Non-2xx or 3xx responses: 4
+Requests/sec:    503.19
+Transfer/sec:    180.96KB
+```
+
+HTTPS:
+
+```text
+Running 5760m test @ https://cafe.example.com/tea
+  2 threads and 100 connections
+  Thread Stats   Avg      Stdev     Max   +/- Stdev
+    Latency   203.89ms  108.72ms   1.89s    66.92%
+    Req/Sec   257.52    167.02     1.85k    63.64%
+  173632748 requests in 5760.00m, 58.61GB read
+  Socket errors: connect 7206, read 113, write 0, timeout 0
+  Non-2xx or 3xx responses: 1
+Requests/sec:    502.41
+Transfer/sec:    177.84KB
+```
+
+
+## Error Logs
+
+### nginx-gateway
+
+msg: Failed to update lock optimistically: the server was unable to return a response in the time allotted, but may still be processing the request (put leases.coordination.k8s.io ngf-longevity-nginx-gateway-fabric-leader-election), falling back to slow path -> same leader election error as on oss, seems out of scope of our product
+
+msg: Get "https://34.118.224.1:443/apis/gateway.networking.k8s.io/v1beta1/referencegrants?allowWatchBookmarks=true&resourceVersion=1760806842166968999&timeout=10s&timeoutSeconds=435&watch=true": context canceled -> possible cluster upgrade?
+
+msg: no matches for kind "InferencePool" in version "inference.networking.k8s.io/v1" -> Thousands of these, but fixed in PR 4104
+
+### nginx
+
+Traffic: 5 502s
+
+```
+INFO 2025-10-19T00:12:04.220541710Z [resource.labels.containerName: nginx] 10.154.15.240 - - [19/Oct/2025:00:12:04 +0000] "GET /coffee HTTP/1.1" 502 150 "-" "-"
+INFO 2025-10-19T18:38:18.651520548Z [resource.labels.containerName: nginx] 10.154.15.240 - - [19/Oct/2025:18:38:18 +0000] "GET /coffee HTTP/1.1" 502 150 "-" "-"
+INFO 2025-10-20T21:49:05.008076073Z [resource.labels.containerName: nginx] 10.154.15.240 - - [20/Oct/2025:21:49:04 +0000] "GET /tea HTTP/1.1" 502 150 "-" "-"
+INFO 2025-10-21T06:43:10.256327990Z [resource.labels.containerName: nginx] 10.154.15.240 - - [21/Oct/2025:06:43:10 +0000] "GET /coffee HTTP/1.1" 502 150 "-" "-"
+INFO 2025-10-21T12:13:05.747098022Z [resource.labels.containerName: nginx] 10.154.15.240 - - [21/Oct/2025:12:13:05 +0000] "GET /coffee HTTP/1.1" 502 150 "-" "-"
+```
+
+No other errors identified in this test run.
diff --git a/tests/results/longevity/2.2.0/oss-cpu.png b/tests/results/longevity/2.2.0/oss-cpu.png
diff --git a/tests/results/longevity/2.2.0/oss-memory.png b/tests/results/longevity/2.2.0/oss-memory.png
diff --git a/tests/results/longevity/2.2.0/plus-cpu.png b/tests/results/longevity/2.2.0/plus-cpu.png
diff --git a/tests/results/longevity/2.2.0/plus-memory.png b/tests/results/longevity/2.2.0/plus-memory.png
diff --git a/tests/results/ngf-upgrade/2.2.0/2.2.0-oss.md b/tests/results/ngf-upgrade/2.2.0/2.2.0-oss.md
@@ -20,6 +20,15 @@ GKE Cluster:
 - Zone: us-west1-b
 - Instance Type: n2d-standard-16
 
+## Summary:
+
+- The 2.2.0 release shows massive improvements in upgrade behavior:
+- 2.1.0 Issue: The summary noted significant downtime during upgrades, with a manual uninstall/reinstall workaround recommended
+- 2.2.0 Fix: The new readiness probe (mentioned in 2.1.0 summary as a planned fix) appears to have successfully resolved the upgrade downtime issue
+- Remaining Failures: The 11 connection refused errors in 2.2.0 (0.18% failure rate) likely represent the minimal unavoidable disruption during pod replacement
+- 99.82% success rate during live upgrade is a production-acceptable result
+- System maintains near-normal throughput during upgrades
+
 ## Test: Send http /coffee traffic
 
 ```text
diff --git a/tests/results/ngf-upgrade/2.2.0/2.2.0-plus.md b/tests/results/ngf-upgrade/2.2.0/2.2.0-plus.md
@@ -20,6 +20,15 @@ GKE Cluster:
 - Zone: us-west1-b
 - Instance Type: n2d-standard-16
 
+## Summary:
+
+- The 2.2.0 release shows massive improvements in upgrade behavior:
+- 2.1.0 Issue: The summary noted significant downtime during upgrades, with a manual uninstall/reinstall workaround recommended
+- 2.2.0 Fix: The new readiness probe (mentioned in 2.1.0 summary as a planned fix) appears to have successfully resolved the upgrade downtime issue
+- Remaining Failures: The 19 connection refused errors in 2.2.0 (0.32% failure rate) likely represent the minimal unavoidable disruption during pod replacement
+- 99.68% success rate during live upgrade is a production-acceptable result
+- System maintains near-normal throughput during upgrades
+
 ## Test: Send http /coffee traffic
 
 ```text
diff --git a/tests/results/reconfig/2.2.0/2.2.0-oss.md b/tests/results/reconfig/2.2.0/2.2.0-oss.md
@@ -20,6 +20,37 @@ GKE Cluster:
 - Zone: us-west1-b
 - Instance Type: n2d-standard-16
 
+## Summary:
+
+- 2.2.0 shows meaningful improvements in configuration reliability with ~46% fewer errors, though at the cost of slightly more event processing overhead for dynamic resource creation.
+
+### Key Findings
+- Test 1 Improvements (Pre-existing Resources):
+  - Faster time to ready for both 30 and 150 resources
+  - No configuration errors in either version
+  - Stable event batch processing
+- Test 2 Mixed Results (Dynamic Resources):
+  - Slight time increase for 30 resources (+1s)
+  - Nearly identical time for 150 resources (-1s)
+  - More event batches in 2.2.0 (15-8% increase)
+  - Slower average processing (+63.6% for 30, +5.9% for 150)
+  - Significantly fewer errors (-33% for 30, -46% for 150)
+- Configuration Error Improvements:
+  - 46% reduction in NGINX errors for 150 resources
+  - No duplicate upstream errors in 2.2.0
+  - Cleaner error pattern (only EOF and pread issues)
+  - Jumbled configuration issue still present but reduced
+
+### Positive Changes:
+- Better handling of pre-existing resources (faster startup)
+- Significantly fewer configuration errors during dynamic resource creation
+- Eliminated certain error types (invalid zone directive, duplicate upstream)
+
+### Concerns:
+- Increased event batch count suggests more reconciliation loops
+- Slower average processing time for dynamic resources
+- Jumbled configuration issue seen in 2.1.0 still exists but is less severe
+
 ## Test 1: Resources exist before startup - NumResources 30
 
 ### Time to Ready
diff --git a/tests/results/reconfig/2.2.0/2.2.0-plus.md b/tests/results/reconfig/2.2.0/2.2.0-plus.md
@@ -20,6 +20,10 @@ GKE Cluster:
 - Zone: us-west1-b
 - Instance Type: n2d-standard-16
 
+## Summary:
+
+- 2.2.0 shows across-the-board improvements in reconfiguration performance compared to 2.1.0 with no configuration errors
+
 ## Test 1: Resources exist before startup - NumResources 30
 
 ### Time to Ready
diff --git a/tests/results/scale/2.2.0/2.2.0-oss.md b/tests/results/scale/2.2.0/2.2.0-oss.md
@@ -20,6 +20,17 @@ GKE Cluster:
 - Zone: us-west1-b
 - Instance Type: n2d-standard-16
 
+## Summary:
+
+2.2.0 shows a trade-off pattern:
+
+- Reliability improved significantly: 80-90% fewer NGF errors in listener tests
+- Performance degraded: Processing times increased 60-160% across most tests
+- Latency impact: 2-12% increases in HTTP match latency
+- New error types: NGINX errors appeared where there were none before
+- In 2.1.0 we saw "tests which previously errored saw number of errors increase" - this has been reversed in 2.2.0, with dramatic error reductions.
+- However, this appears to come at a performance cost, particularly in event batch processing time.
+
 ## Test TestScale_Listeners
 
 ### Event Batch Processing
diff --git a/tests/results/scale/2.2.0/2.2.0-plus.md b/tests/results/scale/2.2.0/2.2.0-plus.md
@@ -20,6 +20,16 @@ GKE Cluster:
 - Zone: us-west1-b
 - Instance Type: n2d-standard-16
 
+## Summary:
+
+- Compared with 2.1.0, 2.2.0 achieves near-perfect reliability in scale tests (eliminating 348 errors) but at the cost of notably higher request latency, particularly in the HTTP matching tests where latency increased by approximately 300µs on average.
+- 333 total NGINX errors eliminated across listener tests
+- 15 NGF errors eliminated (21 → 6 total)
+- Listener processing 5-6x faster with far fewer errors
+- Clean test results for most tests (zero errors)
+- 30-44% increase in HTTP match latency - much more pronounced than OSS (which saw 2-8% increases)
+- Processing time increases for HTTPRoutes and UpstreamServers tests
+
 ## Test TestScale_Listeners
 
 ### Event Batch Processing