-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(clustering): report config reload errors to Konnect #12282
Conversation
2ad3fc3
to
8121f28
Compare
8121f28
to
4f04276
Compare
9ff7513
to
8ea7371
Compare
"control plane: ", json_err, ", error: ", inspect(err_t), log_suffix) | ||
|
||
payload = cjson_encode({ | ||
type = "error", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest making this more descriptive, like "configuration_validation_error" (a better name and more condensed is good).
This keeps the protocol well-documented and makes it trivial to introduce other error types. You can also choose to have a subtype if you prefer. Assigning 'error' to mean configuration validation only errors is also possible in the future, so this is optional. Nevertheless, it's a good code of hygiene.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, "error" here is just a top-level classifier, and the .error.name
attribute holds one of several well-defined, descriptive error labels that I've defined in the kong.constants.CLUSTERING_DATA_PLANE_ERROR
Lua table. Example:
{
"type": "error",
"error": {
"name": "declarative configuration parse failure",
"message": "invalid entity 'foo' blah blah blah blah..."
}
}
This is just what "felt right" to me when writing the code, so happy to entertain changes.
@flrgh To help clarifying, are the following assertions correct?
|
@GGabriele @flrgh - Do you guys need any help here? We are 1 week from feature freeze. |
@RobSerafini aside from needing gateway team code review (which can be sorted with tomorrows PR meeting), I need a product/manager decision on whether or not this is a pure OSS feature or a Konnect-only feature. |
@GGabriele I think I haven't done the best job of laying out specifications with examples, and "error" as a term is pretty generic, so allow me to expand a little and see if this answers your questions... Let's call the final JSON-encoded, binary WebSocket message that is sent to the CP the An {
"type": "error",
"error": {
"name": "my error name",
"config_hash": "588b0c0baa505a71be9369608e8b75f1",
...other attributes specific to error.name also live here
}
} ...where + CLUSTERING_DATA_PLANE_ERROR = {
+ CONFIG_PARSE = "declarative configuration parse failure",
+ RELOAD = "configuration reload failed",
+ GENERIC = "generic or unknown error",
+ }, Additionally, Upon receiving a In the case of {
"type": "error",
"error": {
"name": "declarative configuration parse failure",
"config_hash": "ba5fa5346a9f48062895911e5ebcee03",
"message": "declarative config is invalid: {}",
"source": "kong.db.declarative.parse_table",
"code": 14,
"fields": {},
"flattened_errors": [
{
"entity_id": "3923d481-b21c-4c0e-a91e-e4eb5d63a8f7",
"entity_name": "my-other-service",
"entity_tags": [
"tag-1",
"tag-2"
],
"entity_type": "service",
"errors": [
{
"field": "host",
"message": "required field missing",
"type": "field"
},
{
"field": "extra_field",
"message": "unknown field",
"type": "field"
}
]
},
{
"entity_id": "908e55c2-0b83-4c47-b291-eb9c8776fef3",
"entity_name": "my-service",
"entity_tags": [
"tag-1",
"tag-2"
],
"entity_type": "service",
"errors": [
{
"field": "host",
"message": "required field missing",
"type": "field"
}
]
}
]
}
} * |
6672c5b
to
d7b1604
Compare
@flrgh thanks a lot for the clarification, this is super helpful!! |
ecfd412
to
c6335bf
Compare
c6335bf
to
7e6478d
Compare
c6adbc5
to
8f7810c
Compare
Added an additional commit with better test coverage (assert more fields and cover the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just leaving some comments, I want to take another look before approving.
The custom plugin was a smart idea, way better than a huge refactor.
spec/fixtures/custom_plugins/kong/plugins/cluster-error-reporting/handler.lua
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
My change suggestion is related to flakiness I saw in the past, I don't know if it is still relevant. Feel free to ignore.
bc46bd5
to
f97b8c1
Compare
f97b8c1
to
5c0e3f2
Compare
Data-plane nodes running will now report config reload failures such as invalid configuration or transient errors to the control-plane.
5c0e3f2
to
3ad0280
Compare
@GGabriele - I would like a final approval from you on behalf of Konnect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cherry-pick failed for Please cherry-pick the changes locally. git remote add upstream https://github.com/kong/kong-ee
git fetch upstream master
git worktree add -d .worktree/cherry-pick-12282-to-master-to-upstream upstream/master
cd .worktree/cherry-pick-12282-to-master-to-upstream
git checkout -b cherry-pick-12282-to-master-to-upstream
ancref=$(git merge-base ccfac55b965c9818955c3422d7cfc4e509dcf922 3ad0280b6e5368c4f4e25a1a014aabd3ea02c931)
git cherry-pick -x $ancref..3ad0280b6e5368c4f4e25a1a014aabd3ea02c931 |
summary
This adds new error-reporting for data planes running in Konnect.
When config update fails due to invalid configuration or more transient reasons (e.g. LMDB map size exceeded), the data plane will collect the error into a JSON payload and send this back to the control-plane as a binary WebSocket message.
I have refactored some of the code in
kong.clustering.config_helper.update()
to make error-handling more ergonomic and robust (swapping inxpcall
to get better exception data).On the Konnect mode limitation
In its initial form, this feature is being added for Konnect-only deployments. Konnect is only supported by Enterprise versions of Kong, but this feature requires changes to files in the
kong/clustering/*
tree that would be very frustrating to maintain if committed directly to the EE codebase. I think refactoring could help with this obstacle, but with the current state ofkong.clustering
it would need to be a pretty large refactor.For these reasons, the code has been added to OSS. Because the
konnect_mode
flag is only present in EE, I am using a custom plugin to monkey-patchkong.clustering.init_dp_worker()
such that error reporting is always enabled. Otherwise we would not be able to test this code in OSS, making maintenance even more of a headache.For OSS and non-Konnect EE deployments, the side-effect is that there is that the data plane's code path for
reconfigure
payloads does some unnecessary work preparing an error payload that will ultimately be discarded, but this is only in the error-handling path and should hopefully not become a performance concern.NOTE: Because OSS data planes are not yet supported by Konnect, there is no changelog entry for this feature. I will add one for the EE pull request.
Example Payloads
Invalid Declarative Config
Exception Thrown During Config Reload
LMDB Map Full
See also
KAG-3249