Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admin: remove throw on unknown self test type in admin_server #21370

Merged

Conversation

WillemKauf
Copy link
Contributor

@WillemKauf WillemKauf commented Jul 12, 2024

To help with compatibility for clusters with mixed versions of redpanda, the throw statement in admin_server::self_test_start_handler has been removed.

Previously, in a cluster with mixed versions including 24.1.x and 24.2.x, this throw would prevent the self-test from running, due to the addition of cloudcheck in 24.2.x, resulting in a Bad Request, 400 code.

unable to start self test: request POST http://127.0.0.1:9644/v1/debug/self_test/start failed: Bad Request, body: "{\"message\": \"Unknown self_test 'type', valid options are 'disk' or 'network'\", \"code\": 400}"

We now push_back() unknown test types to self_test_request::unknown_checks, and include the unknown test in rpk cluster self-test status with a generic error message:

NODE ID: 0 | STATUS: IDLE
=========================
NAME          pandatest
TYPE          pandatest
TEST ID       a1100807-84e1-426c-a5e7-6e1b4d685b5e
TIMEOUTS      0
START TIME    Mon Jul 15 20:22:02 UTC 2024
END TIME      Mon Jul 15 20:22:02 UTC 2024
AVG DURATION  0ms
ERROR         Unknown test type pandatest requested on node 0

If only an unknown test is specified, the server will (expectedly) throw due to no tests being run.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

Improvements

  • Allow rpk cluster self-test start to run, even in a cluster with mixed versions of redpanda (before and after cloudcheck addition in 24.2.x).

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue is that in a mixed-version cluster, the rpk command targets the older version admin server which doesn't support the requested command, so an error is returned?

It kinda seems like we should return an error in this case (at the very least--we could run the other tests and and still return an error).

Another option might be to create a better behavior at the rpk layer.

Is this problem common? For example, if mixed version state is rare, we could request rpk to direct the request to a specific broker?

@WillemKauf
Copy link
Contributor Author

So the issue is that in a mixed-version cluster, the rpk command targets the older version admin server which doesn't support the requested command, so an error is returned?

Regardless of which admin server in a cluster the request to run a self-test is invoked upon (though, this defaults to the controller leader), if node IDs are not specified in the request made, the default behavior is to start the self-test on all nodes via RPC. The same self-test options will get fanned out to all nodes.

So, if any nodes in a mixed cluster are of a version before 24.2.x (in which cloudcheck is landing), this throw will result in blocking the self-test start request every time (when using an updated rpk, which will append cloudcheck related properties to the request json).

It kinda seems like we should return an error in this case (at the very least--we could run the other tests and and still return an error).

Do you think that running the tests that are valid (known) to a node, while ignoring and logging a statement about the unknown test types seems like valid behavior?

Another option might be to create a better behavior at the rpk layer.

What behavior would you want to see? Could rpk configure individual self-test requests for each node in a cluster?

Is this problem common? For example, if mixed version state is rare, we could request rpk to direct the request to a specific broker?

Likely not common, but something that we discovered while preparing docs for 24.2.1

@dotnwat
Copy link
Member

dotnwat commented Jul 12, 2024

What behavior would you want to see?

I'm not sure what behavior makes the most sense. But silently dropping at the api level what was requested doesn't strike me as the most desirable.

Is this issue being discussed some where like a ticket capturing the issue?

@WillemKauf
Copy link
Contributor Author

Is this issue being discussed some where like a ticket capturing the issue?

No, it was brought to my attention in a Slack thread.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jul 12, 2024

skipped ducktape retry in https://buildkite.com/redpanda/redpanda/builds/51442#0190a8a9-1c5e-4108-859b-d2f781f4ee79:
pandatriage cache was not found

skipped ducktape retry in https://buildkite.com/redpanda/redpanda/builds/51442#0190a8aa-202c-4001-a234-988e48345821:
pandatriage cache was not found

@dotnwat
Copy link
Member

dotnwat commented Jul 15, 2024

Summary here from offline meeting:

  1. rest api runs the tests that do exist, but return contextual error information indicating if a test did not run for some reason
  2. user experience is that error reporting includes this contextual information "on the following nodes, the following tests did not run" (or something like that). more information could be in the logs if it's hard to return it to rpk.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jul 15, 2024

@WillemKauf
Copy link
Contributor Author

CI failures are unrelated (and closed)

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there compatibility concerns with say older rpk talking to newer redpanda if we are now returning a different style of error messaging that reports tests that weren't able to run for some reason?

@WillemKauf
Copy link
Contributor Author

are there compatibility concerns with say older rpk talking to newer redpanda if we are now returning a different style of error messaging that reports tests that weren't able to run for some reason?

Nope, we haven't changed anything about the rpk types for the response, so this isn't a concern. We use the same type as before to report tests that were not ran.

@WillemKauf WillemKauf requested a review from andrwng July 29, 2024 21:52
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jul 29, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/51543#01910076-b628-4903-a98a-b30894ffab0a:

"rptest.tests.cloud_storage_chunk_read_path_test.CloudStorageChunkReadTest.test_read_chunks"

new failures in https://buildkite.com/redpanda/redpanda/builds/51543#01910108-3264-4ff3-83b0-8af934322a6b:

"rptest.tests.cloud_storage_chunk_read_path_test.CloudStorageChunkReadTest.test_read_chunks"

@WillemKauf WillemKauf force-pushed the mixed_node_cluster_self_test branch from e93a874 to cc4c589 Compare July 30, 2024 02:15
@WillemKauf
Copy link
Contributor Author

Force push to bump serde::envelope version in start_test_request.

@andrwng, I have a solution for case 2 that I will include in a follow up PR (I won't want to backport it to previous versions beyond v24.2.x).

For logging purposes, when an unrecognized test type is requested
through the `self_test_start_handler`, it will be abstracted by
the `unknown_check` struct.
This commit removes the `throw` statement in
`admin_server::self_test_start_handler()` and replaces it instead
with a `push_back()` to `self_test_request::unknown_checks`.

Eventually, these unknown checks will have a result displayed
when `rpk cluster self-test status` is invoked.

For the self-test, any unrecognized tests will be appended to
`start_test_request::unknown_checks`, so a future result from
`rpk cluster self-test status` will return a message indicating
an unknown test was skipped.
For testing of the previous change in which unknown test types
are added to `start_test_request::unknown_checks`, instead of
resulting in a `throw`.
`test_type` was not being set in early exit cases for `cloudcheck`.
This would result in longer than expected output (with `IOPS`,
`THROUGHPUT`, `LATENCY` in `rpk cluster self-test status`).
@WillemKauf WillemKauf force-pushed the mixed_node_cluster_self_test branch from cc4c589 to 147d5e9 Compare July 30, 2024 03:29
@WillemKauf
Copy link
Contributor Author

Last force push to include some work that will require backporting. Future PR will build on this one.

Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +249 to +251

#Attempt to run with an unknown test type "pandatest"
#and possibly unknown "cloud" test.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be good to be consistent about spacing, even in comments (#comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix in follow up.

@@ -64,6 +65,7 @@ cloudcheck::run(cloudcheck_opts opts) {
"Cloud storage is not enabled, exiting cloud storage self-test.");
auto result = self_test_result{
.name = _opts.name,
.test_type = "cloud",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't feel strongly that it needs to be in this PR, but it'd have been nice to see a test that ran the cloud check via RPK and parsed the output to assert that we don't print "IOPS" etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add in follow up PR.

@andrwng
Copy link
Contributor

andrwng commented Aug 1, 2024

Also just noting that we chatted offline, and Willem has some follow-up changes to have the self test controller pass through unknown tests to workers, in case of mixed versions

@WillemKauf WillemKauf merged commit 9fe41c5 into redpanda-data:dev Aug 2, 2024
21 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v24.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.1.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-21370-v24.1.x-290 remotes/upstream/v24.1.x
git cherry-pick -x aa93f3f48341cb9e9966051a9759f5d08f686ec9 93b1901e4f532bcc0aabe8746b81e0d8534c388f 4e99539bd1a82857882c7a14201dc07cdcb53980 3bac4dc6f40287aaab2b6deae6f76c3434ccafd6 147d5e96ffb320d7b6f1212f53435b2cdb9aef4f

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants