NFR Scale test doesn't count all reloads and batches #2009

pleshakov · 2024-05-22T18:33:44Z

NFR Scale test doesn't count all reloads and batches.

Example of a run:

## Test TestScale_UpstreamServers

### Reloads

- Total: 61
- Total Errors: 0
- Average Time: 125ms
- Reload distribution:
	- 500ms: 61
	- 1000ms: 61
	- 5000ms: 61
	- 10000ms: 61
	- 30000ms: 61
	- +Infms: 61

### Event Batch Processing

- Total: 64
- Average Time: 232ms
- Event Batch Processing distribution:
	- 500ms: 63
	- 1000ms: 64
	- 5000ms: 64
	- 10000ms: 64
	- 30000ms: 64
	- +Infms: 64

However, looking at the NGF logs, there are 109 reloads and the last batchID is 114:

{"level":"info","ts":"2024-05-22T16:32:46Z","logger":"eventLoop.eventHandler","msg":"Finished handling the batch","batchID":64}
. . .
{"level":"info","ts":"2024-05-22T16:33:01Z","logger":"eventLoop.eventHandler","msg":"NGINX configuration was successfully updated","batchID":114}

Full log https://github.com/nginxinc/nginx-gateway-fabric/blob/d443d6a55988eae8091d230dbbd90ceef1de2c8f/tests/results/scale/edge/TestScale_UpstreamServers/ngf.log

Acceptance criteria:

Make sure reloads and batches are counted correctly

The text was updated successfully, but these errors were encountered:

kate-osborn · 2024-05-22T18:38:57Z

@pleshakov did other tests run against this instance of NGF before this one?

pleshakov · 2024-05-22T18:43:00Z

@kate-osborn
nope. in fact, each scale tests re-installs NGF so that its logs are fresh (no carried over errors)

kate-osborn · 2024-05-22T18:48:31Z

@kate-osborn nope. in fact, each scale tests re-installs NGF so that its logs are fresh (no carried over errors)

That's good. Well, in that case, we may be able to simplify the Prometheus queries and remove the start and end times.

pleshakov · 2024-05-22T18:56:47Z

@kate-osborn

the queries that migrated from test instructions to the scale_test.go don't use end time. But they do use start time, so that any NGF reconfiguration prior to the test is not counted.

nginx-gateway-fabric/tests/suite/scale_test.go

Lines 382 to 390 in 5f9ec2d

    
           reloadBuckets := getBuckets( 
        
           	fmt.Sprintf( 
        
           		`nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{pod="%[1]s"}`+ 
        
           			` - `+ 
        
           			`nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{pod="%[1]s"} @ %d`, 
        
           		ngfPodName, 
        
           		startTime.Unix(), 
        
           	), 
        
           )

kate-osborn · 2024-05-22T19:48:09Z

@kate-osborn

the queries that migrated from test instructions to the scale_test.go don't use end time. But they do use start time, so that any NGF reconfiguration prior to the test is not counted.

nginx-gateway-fabric/tests/suite/scale_test.go

Lines 382 to 390 in 5f9ec2d

reloadBuckets := getBuckets(

fmt.Sprintf(

`nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{pod="%[1]s"}`+

` - `+

`nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{pod="%[1]s"} @ %d`,

ngfPodName,

startTime.Unix(),

),

)

Ok, so NGF is installed before each test, but the prometheus instance is shared among all the tests. I wonder if the start time is what's causing this issue.

pleshakov · 2024-05-22T19:56:52Z

Ok, so NGF is installed before each test, but the prometheus instance is shared among all the tests. I wonder if the start time is what's causing this issue.

I see. we do use pod name though in each query, which should be unique per installation

kate-osborn · 2024-05-22T20:09:59Z

Ok, so NGF is installed before each test, but the prometheus instance is shared among all the tests. I wonder if the start time is what's causing this issue.

I see. we do use pod name though in each query, which should be unique per installation

I'm thinking it has more to do with when prometheus has scraped the metrics from NGF. Prometheus is configured to scrape every 10 seconds, right? So maybe it's not the start time, but the time we send the query. Is it possible that we send the query before prometheus has scraped the latest metrics from NGF?

pleshakov · 2024-05-22T20:38:10Z

I'm thinking it has more to do with when prometheus has scraped the metrics from NGF. Prometheus is configured to scrape every 10 seconds, right? So maybe it's not the start time, but the time we send the query. Is it possible that we send the query before prometheus has scraped the latest metrics from NGF?

quite possible. there is a 2 times scrapes interval timeout after the test finishes. after that, we also check that mertics exist at the end time

nginx-gateway-fabric/tests/suite/scale_test.go

Line 270 in 5f9ec2d

time.Sleep(2 * scrapeInterval)

nginx-gateway-fabric/tests/suite/scale_test.go

Lines 290 to 304 in 5f9ec2d

    
           queries = []string{ 
        
           	fmt.Sprintf(`container_memory_usage_bytes{pod="%s",container="nginx-gateway"}`, ngfPodName), 
        
           	// We don't need to check all nginx_gateway_fabric_* metrics, as they are collected at the same time 
        
           	fmt.Sprintf(`nginx_gateway_fabric_nginx_reloads_total{pod="%s"}`, ngfPodName), 
        
           } 
        
           for _, q := range queries { 
        
           	Eventually( 
        
           		createMetricExistChecker( 
        
           			q, 
        
           			getEndTime, 
        
           			noOpModifier, 
        
           		), 
        
           	).WithTimeout(metricExistTimeout).WithPolling(metricExistPolling).Should(Succeed()) 
        
           }

that should have prevented not having recent metrics in Prometheus. but it somehow doesn't work :(

Problem: Scale test is not part of Github actions pipeline Solution: - Add NFR scale test to GitHub actions pipeline along other NFR tests. - Increase the size of the cluster used for NFR tests, as the scale test requires bigger size. Testing: - Successfully run with NGINX -- #2002 - Successfully run with NGINX Plus -- #2017 Some scale test issues were discovered: - #2023 - #2009 Closes #1927

pleshakov mentioned this issue May 23, 2024

Add NFR scale test to GitHub actions pipeline #2024

Merged

6 tasks

mpstefan added tests Pull requests that update tests bug Something isn't working labels Jun 3, 2024

mpstefan added this to the v1.4.0 milestone Jun 3, 2024

mpstefan added size/medium Estimated to be completed within a week refined Requirements are refined and the issue is ready to be implemented. and removed bug Something isn't working labels Jul 1, 2024

mpstefan modified the milestones: v1.4.0, v2.3.0 Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NFR Scale test doesn't count all reloads and batches #2009

NFR Scale test doesn't count all reloads and batches #2009

pleshakov commented May 22, 2024 •

edited

Loading

kate-osborn commented May 22, 2024

pleshakov commented May 22, 2024

kate-osborn commented May 22, 2024

pleshakov commented May 22, 2024

kate-osborn commented May 22, 2024

pleshakov commented May 22, 2024

kate-osborn commented May 22, 2024

pleshakov commented May 22, 2024 •

edited

Loading

NFR Scale test doesn't count all reloads and batches #2009

NFR Scale test doesn't count all reloads and batches #2009

Comments

pleshakov commented May 22, 2024 • edited Loading

kate-osborn commented May 22, 2024

pleshakov commented May 22, 2024

kate-osborn commented May 22, 2024

pleshakov commented May 22, 2024

kate-osborn commented May 22, 2024

pleshakov commented May 22, 2024

kate-osborn commented May 22, 2024

pleshakov commented May 22, 2024 • edited Loading

pleshakov commented May 22, 2024 •

edited

Loading

pleshakov commented May 22, 2024 •

edited

Loading