admin: remove `throw` on unknown self test type in `admin_server` #21370

WillemKauf · 2024-07-12T19:25:07Z

To help with compatibility for clusters with mixed versions of redpanda, the throw statement in admin_server::self_test_start_handler has been removed.

Previously, in a cluster with mixed versions including 24.1.x and 24.2.x, this throw would prevent the self-test from running, due to the addition of cloudcheck in 24.2.x, resulting in a Bad Request, 400 code.

unable to start self test: request POST http://127.0.0.1:9644/v1/debug/self_test/start failed: Bad Request, body: "{\"message\": \"Unknown self_test 'type', valid options are 'disk' or 'network'\", \"code\": 400}"

We now push_back() unknown test types to self_test_request::unknown_checks, and include the unknown test in rpk cluster self-test status with a generic error message:

NODE ID: 0 | STATUS: IDLE
=========================
NAME          pandatest
TYPE          pandatest
TEST ID       a1100807-84e1-426c-a5e7-6e1b4d685b5e
TIMEOUTS      0
START TIME    Mon Jul 15 20:22:02 UTC 2024
END TIME      Mon Jul 15 20:22:02 UTC 2024
AVG DURATION  0ms
ERROR         Unknown test type pandatest requested on node 0

If only an unknown test is specified, the server will (expectedly) throw due to no tests being run.

Backports Required

Release Notes

Improvements

Allow rpk cluster self-test start to run, even in a cluster with mixed versions of redpanda (before and after cloudcheck addition in 24.2.x).

dotnwat

So the issue is that in a mixed-version cluster, the rpk command targets the older version admin server which doesn't support the requested command, so an error is returned?

It kinda seems like we should return an error in this case (at the very least--we could run the other tests and and still return an error).

Another option might be to create a better behavior at the rpk layer.

Is this problem common? For example, if mixed version state is rare, we could request rpk to direct the request to a specific broker?

WillemKauf · 2024-07-12T19:53:33Z

So the issue is that in a mixed-version cluster, the rpk command targets the older version admin server which doesn't support the requested command, so an error is returned?

Regardless of which admin server in a cluster the request to run a self-test is invoked upon (though, this defaults to the controller leader), if node IDs are not specified in the request made, the default behavior is to start the self-test on all nodes via RPC. The same self-test options will get fanned out to all nodes.

So, if any nodes in a mixed cluster are of a version before 24.2.x (in which cloudcheck is landing), this throw will result in blocking the self-test start request every time (when using an updated rpk, which will append cloudcheck related properties to the request json).

It kinda seems like we should return an error in this case (at the very least--we could run the other tests and and still return an error).

Do you think that running the tests that are valid (known) to a node, while ignoring and logging a statement about the unknown test types seems like valid behavior?

Another option might be to create a better behavior at the rpk layer.

What behavior would you want to see? Could rpk configure individual self-test requests for each node in a cluster?

Is this problem common? For example, if mixed version state is rare, we could request rpk to direct the request to a specific broker?

Likely not common, but something that we discovered while preparing docs for 24.2.1

dotnwat · 2024-07-12T20:56:47Z

What behavior would you want to see?

I'm not sure what behavior makes the most sense. But silently dropping at the api level what was requested doesn't strike me as the most desirable.

Is this issue being discussed some where like a ticket capturing the issue?

WillemKauf · 2024-07-12T21:23:26Z

Is this issue being discussed some where like a ticket capturing the issue?

No, it was brought to my attention in a Slack thread.

src/v/redpanda/admin/server.cc

vbotbuildovich · 2024-07-12T21:35:40Z

skipped ducktape retry in https://buildkite.com/redpanda/redpanda/builds/51442#0190a8a9-1c5e-4108-859b-d2f781f4ee79:
pandatriage cache was not found

skipped ducktape retry in https://buildkite.com/redpanda/redpanda/builds/51442#0190a8aa-202c-4001-a234-988e48345821:
pandatriage cache was not found

dotnwat · 2024-07-15T18:38:22Z

Summary here from offline meeting:

rest api runs the tests that do exist, but return contextual error information indicating if a test did not run for some reason
user experience is that error reporting includes this contextual information "on the following nodes, the following tests did not run" (or something like that). more information could be in the logs if it's hard to return it to rpk.

vbotbuildovich · 2024-07-15T22:18:47Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/51543#0190b846-d481-43bf-92cc-4a0849b07d5c

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/51543#0190b848-a629-41cf-96c9-f52e9e4ece43

WillemKauf · 2024-07-18T17:43:05Z

CI failures are unrelated (and closed)

src/v/cluster/self_test_rpc_types.h

tests/rptest/tests/self_test_test.py

dotnwat

are there compatibility concerns with say older rpk talking to newer redpanda if we are now returning a different style of error messaging that reports tests that weren't able to run for some reason?

WillemKauf · 2024-07-27T01:35:51Z

are there compatibility concerns with say older rpk talking to newer redpanda if we are now returning a different style of error messaging that reports tests that weren't able to run for some reason?

Nope, we haven't changed anything about the rpk types for the response, so this isn't a concern. We use the same type as before to report tests that were not ran.

vbotbuildovich · 2024-07-29T23:11:45Z

new failures in https://buildkite.com/redpanda/redpanda/builds/51543#01910076-b628-4903-a98a-b30894ffab0a:

"rptest.tests.cloud_storage_chunk_read_path_test.CloudStorageChunkReadTest.test_read_chunks"

new failures in https://buildkite.com/redpanda/redpanda/builds/51543#01910108-3264-4ff3-83b0-8af934322a6b:

"rptest.tests.cloud_storage_chunk_read_path_test.CloudStorageChunkReadTest.test_read_chunks"

WillemKauf · 2024-07-30T02:21:35Z

Force push to bump serde::envelope version in start_test_request.

@andrwng, I have a solution for case 2 that I will include in a follow up PR (I won't want to backport it to previous versions beyond v24.2.x).

For logging purposes, when an unrecognized test type is requested through the `self_test_start_handler`, it will be abstracted by the `unknown_check` struct.

This commit removes the `throw` statement in `admin_server::self_test_start_handler()` and replaces it instead with a `push_back()` to `self_test_request::unknown_checks`. Eventually, these unknown checks will have a result displayed when `rpk cluster self-test status` is invoked. For the self-test, any unrecognized tests will be appended to `start_test_request::unknown_checks`, so a future result from `rpk cluster self-test status` will return a message indicating an unknown test was skipped.

For testing of the previous change in which unknown test types are added to `start_test_request::unknown_checks`, instead of resulting in a `throw`.

`test_type` was not being set in early exit cases for `cloudcheck`. This would result in longer than expected output (with `IOPS`, `THROUGHPUT`, `LATENCY` in `rpk cluster self-test status`).

WillemKauf · 2024-07-30T03:30:30Z

Last force push to include some work that will require backporting. Future PR will build on this one.

andrwng

LGTM

andrwng · 2024-08-01T20:55:26Z

tests/rptest/tests/self_test_test.py

+
+        #Attempt to run with an unknown test type "pandatest"
+        #and possibly unknown "cloud" test.


nit: would be good to be consistent about spacing, even in comments (#comment)

Will fix in follow up.

andrwng · 2024-08-01T20:56:57Z

src/v/cluster/self_test/cloudcheck.cc

@@ -64,6 +65,7 @@ cloudcheck::run(cloudcheck_opts opts) {
          "Cloud storage is not enabled, exiting cloud storage self-test.");
        auto result = self_test_result{
          .name = _opts.name,
+          .test_type = "cloud",


nit: I don't feel strongly that it needs to be in this PR, but it'd have been nice to see a test that ran the cloud check via RPK and parsed the output to assert that we don't print "IOPS" etc

Will add in follow up PR.

andrwng · 2024-08-01T20:58:08Z

Also just noting that we chatted offline, and Willem has some follow-up changes to have the self test controller pass through unknown tests to workers, in case of mixed versions

vbotbuildovich · 2024-08-02T02:04:49Z

/backport v24.2.x

vbotbuildovich · 2024-08-02T02:04:49Z

/backport v24.1.x

vbotbuildovich · 2024-08-02T02:06:00Z

Failed to create a backport PR to v24.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-21370-v24.1.x-290 remotes/upstream/v24.1.x
git cherry-pick -x aa93f3f48341cb9e9966051a9759f5d08f686ec9 93b1901e4f532bcc0aabe8746b81e0d8534c388f 4e99539bd1a82857882c7a14201dc07cdcb53980 3bac4dc6f40287aaab2b6deae6f76c3434ccafd6 147d5e96ffb320d7b6f1212f53435b2cdb9aef4f

Workflow run logs.

WillemKauf requested a review from andrwng July 12, 2024 19:25

github-actions bot added the area/redpanda label Jul 12, 2024

dotnwat reviewed Jul 12, 2024

View reviewed changes

andrwng reviewed Jul 12, 2024

View reviewed changes

src/v/redpanda/admin/server.cc Outdated Show resolved Hide resolved

WillemKauf force-pushed the mixed_node_cluster_self_test branch from f01f540 to e93a874 Compare July 15, 2024 20:11

WillemKauf requested review from dotnwat and andrwng July 15, 2024 20:13

andrwng reviewed Jul 22, 2024

View reviewed changes

src/v/cluster/self_test_rpc_types.h Show resolved Hide resolved

tests/rptest/tests/self_test_test.py Show resolved Hide resolved

tests/rptest/tests/self_test_test.py Show resolved Hide resolved

dotnwat reviewed Jul 27, 2024

View reviewed changes

WillemKauf requested a review from andrwng July 29, 2024 21:52

WillemKauf force-pushed the mixed_node_cluster_self_test branch from e93a874 to cc4c589 Compare July 30, 2024 02:15

WillemKauf added 5 commits July 29, 2024 23:14

cluster/self_test: add unknown_check

aa93f3f

For logging purposes, when an unrecognized test type is requested through the `self_test_start_handler`, it will be abstracted by the `unknown_check` struct.

cluster/self_test: use unknown_checks in backend/frontend

93b1901

rptest: add test_self_test_unknown_test_type

3bac4dc

For testing of the previous change in which unknown test types are added to `start_test_request::unknown_checks`, instead of resulting in a `throw`.

cluster/self_test: add test_type to cloudcheck cases

147d5e9

`test_type` was not being set in early exit cases for `cloudcheck`. This would result in longer than expected output (with `IOPS`, `THROUGHPUT`, `LATENCY` in `rpk cluster self-test status`).

WillemKauf force-pushed the mixed_node_cluster_self_test branch from cc4c589 to 147d5e9 Compare July 30, 2024 03:29

andrwng approved these changes Aug 1, 2024

View reviewed changes

WillemKauf merged commit 9fe41c5 into redpanda-data:dev Aug 2, 2024
21 checks passed

vbotbuildovich mentioned this pull request Aug 2, 2024

[v24.2.x] admin: remove throw on unknown self test type in admin_server #22698

Merged

vbotbuildovich mentioned this pull request Aug 2, 2024

[v24.1.x] admin: remove throw on unknown self test type in admin_server #22699

Closed

WillemKauf mentioned this pull request Aug 2, 2024

[v24.1.x] admin: remove throw on unknown self test type in admin_server (manual backport) #22700

Merged

7 tasks

WillemKauf mentioned this pull request Aug 10, 2024

cluster: Parse unknown self-test checks #22831

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admin: remove `throw` on unknown self test type in `admin_server` #21370

admin: remove `throw` on unknown self test type in `admin_server` #21370

WillemKauf commented Jul 12, 2024 •

edited

Loading

dotnwat left a comment

WillemKauf commented Jul 12, 2024

dotnwat commented Jul 12, 2024

WillemKauf commented Jul 12, 2024

vbotbuildovich commented Jul 12, 2024 •

edited

Loading

dotnwat commented Jul 15, 2024

vbotbuildovich commented Jul 15, 2024 •

edited

Loading

WillemKauf commented Jul 18, 2024

dotnwat left a comment

WillemKauf commented Jul 27, 2024

vbotbuildovich commented Jul 29, 2024 •

edited

Loading

WillemKauf commented Jul 30, 2024

WillemKauf commented Jul 30, 2024

andrwng left a comment

andrwng Aug 1, 2024

WillemKauf Aug 2, 2024

andrwng Aug 1, 2024

WillemKauf Aug 2, 2024

andrwng commented Aug 1, 2024

vbotbuildovich commented Aug 2, 2024

vbotbuildovich commented Aug 2, 2024

vbotbuildovich commented Aug 2, 2024


		#Attempt to run with an unknown test type "pandatest"
		#and possibly unknown "cloud" test.

admin: remove throw on unknown self test type in admin_server #21370

admin: remove throw on unknown self test type in admin_server #21370

Conversation

WillemKauf commented Jul 12, 2024 • edited Loading

Backports Required

Release Notes

Improvements

dotnwat left a comment

Choose a reason for hiding this comment

WillemKauf commented Jul 12, 2024

dotnwat commented Jul 12, 2024

WillemKauf commented Jul 12, 2024

vbotbuildovich commented Jul 12, 2024 • edited Loading

dotnwat commented Jul 15, 2024

vbotbuildovich commented Jul 15, 2024 • edited Loading

WillemKauf commented Jul 18, 2024

dotnwat left a comment

Choose a reason for hiding this comment

WillemKauf commented Jul 27, 2024

vbotbuildovich commented Jul 29, 2024 • edited Loading

WillemKauf commented Jul 30, 2024

WillemKauf commented Jul 30, 2024

andrwng left a comment

Choose a reason for hiding this comment

andrwng Aug 1, 2024

Choose a reason for hiding this comment

WillemKauf Aug 2, 2024

Choose a reason for hiding this comment

andrwng Aug 1, 2024

Choose a reason for hiding this comment

WillemKauf Aug 2, 2024

Choose a reason for hiding this comment

andrwng commented Aug 1, 2024

vbotbuildovich commented Aug 2, 2024

vbotbuildovich commented Aug 2, 2024

vbotbuildovich commented Aug 2, 2024

admin: remove `throw` on unknown self test type in `admin_server` #21370

admin: remove `throw` on unknown self test type in `admin_server` #21370

WillemKauf commented Jul 12, 2024 •

edited

Loading

vbotbuildovich commented Jul 12, 2024 •

edited

Loading

vbotbuildovich commented Jul 15, 2024 •

edited

Loading

vbotbuildovich commented Jul 29, 2024 •

edited

Loading