Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFR Scale test doesn't count all reloads and batches #2009

Open
pleshakov opened this issue May 22, 2024 · 8 comments
Open

NFR Scale test doesn't count all reloads and batches #2009

pleshakov opened this issue May 22, 2024 · 8 comments
Labels
refined Requirements are refined and the issue is ready to be implemented. size/medium Estimated to be completed within a week tests Pull requests that update tests
Milestone

Comments

@pleshakov
Copy link
Contributor

pleshakov commented May 22, 2024

NFR Scale test doesn't count all reloads and batches.

Example of a run:

## Test TestScale_UpstreamServers

### Reloads

- Total: 61
- Total Errors: 0
- Average Time: 125ms
- Reload distribution:
	- 500ms: 61
	- 1000ms: 61
	- 5000ms: 61
	- 10000ms: 61
	- 30000ms: 61
	- +Infms: 61

### Event Batch Processing

- Total: 64
- Average Time: 232ms
- Event Batch Processing distribution:
	- 500ms: 63
	- 1000ms: 64
	- 5000ms: 64
	- 10000ms: 64
	- 30000ms: 64
	- +Infms: 64

However, looking at the NGF logs, there are 109 reloads and the last batchID is 114:

{"level":"info","ts":"2024-05-22T16:32:46Z","logger":"eventLoop.eventHandler","msg":"Finished handling the batch","batchID":64}
. . .
{"level":"info","ts":"2024-05-22T16:33:01Z","logger":"eventLoop.eventHandler","msg":"NGINX configuration was successfully updated","batchID":114}

Full log https://github.com/nginxinc/nginx-gateway-fabric/blob/d443d6a55988eae8091d230dbbd90ceef1de2c8f/tests/results/scale/edge/TestScale_UpstreamServers/ngf.log

Acceptance criteria:

  • Make sure reloads and batches are counted correctly
@kate-osborn
Copy link
Contributor

@pleshakov did other tests run against this instance of NGF before this one?

@pleshakov
Copy link
Contributor Author

@kate-osborn
nope. in fact, each scale tests re-installs NGF so that its logs are fresh (no carried over errors)

@kate-osborn
Copy link
Contributor

@kate-osborn nope. in fact, each scale tests re-installs NGF so that its logs are fresh (no carried over errors)

That's good. Well, in that case, we may be able to simplify the Prometheus queries and remove the start and end times.

@pleshakov
Copy link
Contributor Author

@kate-osborn

the queries that migrated from test instructions to the scale_test.go don't use end time. But they do use start time, so that any NGF reconfiguration prior to the test is not counted.

reloadBuckets := getBuckets(
fmt.Sprintf(
`nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{pod="%[1]s"}`+
` - `+
`nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{pod="%[1]s"} @ %d`,
ngfPodName,
startTime.Unix(),
),
)

@kate-osborn
Copy link
Contributor

@kate-osborn

the queries that migrated from test instructions to the scale_test.go don't use end time. But they do use start time, so that any NGF reconfiguration prior to the test is not counted.

reloadBuckets := getBuckets(
fmt.Sprintf(
`nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{pod="%[1]s"}`+
` - `+
`nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{pod="%[1]s"} @ %d`,
ngfPodName,
startTime.Unix(),
),
)

Ok, so NGF is installed before each test, but the prometheus instance is shared among all the tests. I wonder if the start time is what's causing this issue.

@pleshakov
Copy link
Contributor Author

Ok, so NGF is installed before each test, but the prometheus instance is shared among all the tests. I wonder if the start time is what's causing this issue.

I see. we do use pod name though in each query, which should be unique per installation

@kate-osborn
Copy link
Contributor

Ok, so NGF is installed before each test, but the prometheus instance is shared among all the tests. I wonder if the start time is what's causing this issue.

I see. we do use pod name though in each query, which should be unique per installation

I'm thinking it has more to do with when prometheus has scraped the metrics from NGF. Prometheus is configured to scrape every 10 seconds, right? So maybe it's not the start time, but the time we send the query. Is it possible that we send the query before prometheus has scraped the latest metrics from NGF?

@pleshakov
Copy link
Contributor Author

pleshakov commented May 22, 2024

I'm thinking it has more to do with when prometheus has scraped the metrics from NGF. Prometheus is configured to scrape every 10 seconds, right? So maybe it's not the start time, but the time we send the query. Is it possible that we send the query before prometheus has scraped the latest metrics from NGF?

quite possible. there is a 2 times scrapes interval timeout after the test finishes. after that, we also check that mertics exist at the end time

time.Sleep(2 * scrapeInterval)

queries = []string{
fmt.Sprintf(`container_memory_usage_bytes{pod="%s",container="nginx-gateway"}`, ngfPodName),
// We don't need to check all nginx_gateway_fabric_* metrics, as they are collected at the same time
fmt.Sprintf(`nginx_gateway_fabric_nginx_reloads_total{pod="%s"}`, ngfPodName),
}
for _, q := range queries {
Eventually(
createMetricExistChecker(
q,
getEndTime,
noOpModifier,
),
).WithTimeout(metricExistTimeout).WithPolling(metricExistPolling).Should(Succeed())
}

that should have prevented not having recent metrics in Prometheus. but it somehow doesn't work :(

pleshakov added a commit that referenced this issue May 23, 2024
Problem:
Scale test is not part of Github actions pipeline

Solution:
- Add NFR scale test to GitHub actions pipeline along other NFR tests.
- Increase the size of the cluster used for NFR tests, as the scale
  test requires bigger size.

Testing:
- Successfully run with NGINX -- #2002
- Successfully run with NGINX Plus -- #2017

Some scale test issues were discovered:
- #2023
- #2009

Closes #1927
pleshakov added a commit that referenced this issue May 23, 2024
Problem:
Scale test is not part of Github actions pipeline

Solution:
- Add NFR scale test to GitHub actions pipeline along other NFR tests.
- Increase the size of the cluster used for NFR tests, as the scale
  test requires bigger size.

Testing:
- Successfully run with NGINX -- #2002
- Successfully run with NGINX Plus -- #2017

Some scale test issues were discovered:
- #2023
- #2009

Closes #1927
@mpstefan mpstefan added tests Pull requests that update tests bug Something isn't working labels Jun 3, 2024
@mpstefan mpstefan added this to the v1.4.0 milestone Jun 3, 2024
@mpstefan mpstefan added size/medium Estimated to be completed within a week refined Requirements are refined and the issue is ready to be implemented. and removed bug Something isn't working labels Jul 1, 2024
@mpstefan mpstefan modified the milestones: v1.4.0, v2.3.0 Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refined Requirements are refined and the issue is ready to be implemented. size/medium Estimated to be completed within a week tests Pull requests that update tests
Projects
Status: 🆕 New
Development

No branches or pull requests

3 participants