Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(clustering): report config reload errors to Konnect #12282

Merged
merged 1 commit into from
Jan 17, 2024

Conversation

flrgh
Copy link
Contributor

@flrgh flrgh commented Jan 3, 2024

summary

This adds new error-reporting for data planes running in Konnect.

When config update fails due to invalid configuration or more transient reasons (e.g. LMDB map size exceeded), the data plane will collect the error into a JSON payload and send this back to the control-plane as a binary WebSocket message.

I have refactored some of the code in kong.clustering.config_helper.update() to make error-handling more ergonomic and robust (swapping in xpcall to get better exception data).

On the Konnect mode limitation

In its initial form, this feature is being added for Konnect-only deployments. Konnect is only supported by Enterprise versions of Kong, but this feature requires changes to files in the kong/clustering/* tree that would be very frustrating to maintain if committed directly to the EE codebase. I think refactoring could help with this obstacle, but with the current state of kong.clustering it would need to be a pretty large refactor.

For these reasons, the code has been added to OSS. Because the konnect_mode flag is only present in EE, I am using a custom plugin to monkey-patch kong.clustering.init_dp_worker() such that error reporting is always enabled. Otherwise we would not be able to test this code in OSS, making maintenance even more of a headache.

For OSS and non-Konnect EE deployments, the side-effect is that there is that the data plane's code path for reconfigure payloads does some unnecessary work preparing an error payload that will ultimately be discarded, but this is only in the error-handling path and should hopefully not become a performance concern.

NOTE: Because OSS data planes are not yet supported by Konnect, there is no changelog entry for this feature. I will add one for the EE pull request.

Example Payloads

Invalid Declarative Config

{
  "type": "error",
  "error": {
    "name": "declarative configuration parse failure",
    "config_hash": "706f787e686e32a565e2a578aba1b347",
    "source": "kong.db.declarative.parse_table"
    "message": "declarative config is invalid: {extra_top_level_field=\"unknown field\"}",
    "code": 14,
    "fields": {
      "extra_top_level_field": "unknown field"
    },
    "flattened_errors": [
      {
        "entity_id": "2e1373fa-874a-426c-b639-b9e20284ebbc",
        "entity_name": "my-service",
        "entity_tags": [
          "tag-1",
          "tag-2"
        ],
        "entity_type": "service",
        "errors": [
          {
            "field": "host",
            "message": "required field missing",
            "type": "field"
          },
          {
            "field": "extra_field",
            "message": "unknown field",
            "type": "field"
          }
        ]
      }
    ]
  }
}

Exception Thrown During Config Reload

{
  "type": "error",
  "error": {
    "name": "configuration reload failed",
    "config_hash": "8cb8c00863336fee14e4364ab3e8819c",
    "exception": "...plugins/kong/plugins/cluster-error-reporting/handler.lua:19: oh no!",
    "message": "an exception was raised while updating the configuration",
    "source": "kong.clustering.config_helper.update",
    "traceback": "...plugins/kong/plugins/cluster-error-reporting/handler.lua:19: oh no!\nstack traceback:\n\t...michaelm/git/kong/kong/kong/clustering/config_helper.lua:278: in function <...michaelm/git/kong/kong/kong/clustering/config_helper.lua:272>\n\t[C]: in function 'error'\n\t...plugins/kong/plugins/cluster-error-reporting/handler.lua:19: in function 'load_into_cache_with_events'\n\t...michaelm/git/kong/kong/kong/clustering/config_helper.lua:346: in function <...michaelm/git/kong/kong/kong/clustering/config_helper.lua:297>\n\t[C]: in function 'xpcall'\n\t...michaelm/git/kong/kong/kong/clustering/config_helper.lua:374: in function 'update'\n\t/home/michaelm/git/kong/kong/kong/clustering/data_plane.lua:255: in function </home/michaelm/git/kong/kong/kong/clustering/data_plane.lua:224>"
  }
}

LMDB Map Full

{
  "type": "error",
  "error": {
    "name": "configuration reload failed",
    "config_hash": "d78ed6c82b8aedf5323f249b3f0a363d",
    "message": "map full",
    "source": "kong.db.declarative.load_into_cache_with_events"
  }
}

See also

KAG-3249

@flrgh flrgh force-pushed the feat/clustering-report-dp-errors branch from 2ad3fc3 to 8121f28 Compare January 3, 2024 20:07
@flrgh flrgh changed the title [WIP] feat(clustering): report dp config update errors to cp feat(clustering): report config reload errors to Konnect Jan 3, 2024
@flrgh flrgh force-pushed the feat/clustering-report-dp-errors branch from 8121f28 to 4f04276 Compare January 3, 2024 20:28
@flrgh flrgh force-pushed the feat/clustering-report-dp-errors branch 2 times, most recently from 9ff7513 to 8ea7371 Compare January 3, 2024 20:44
@flrgh flrgh marked this pull request as ready for review January 3, 2024 20:45
"control plane: ", json_err, ", error: ", inspect(err_t), log_suffix)

payload = cjson_encode({
type = "error",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest making this more descriptive, like "configuration_validation_error" (a better name and more condensed is good).
This keeps the protocol well-documented and makes it trivial to introduce other error types. You can also choose to have a subtype if you prefer. Assigning 'error' to mean configuration validation only errors is also possible in the future, so this is optional. Nevertheless, it's a good code of hygiene.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, "error" here is just a top-level classifier, and the .error.name attribute holds one of several well-defined, descriptive error labels that I've defined in the kong.constants.CLUSTERING_DATA_PLANE_ERROR Lua table. Example:

{
  "type": "error",
  "error": {
    "name": "declarative configuration parse failure",
    "message": "invalid entity 'foo' blah blah blah blah..."
  }
}

This is just what "felt right" to me when writing the code, so happy to entertain changes.

@GGabriele
Copy link
Contributor

@flrgh To help clarifying, are the following assertions correct?

  • in case of multiple errors, we expect the DP to send multiple error messages
  • at any time, a payload would generate at most one error per type (e.g. one invalid declarative configuration, one transient error and one exception)

@RobSerafini
Copy link
Contributor

@GGabriele @flrgh - Do you guys need any help here? We are 1 week from feature freeze.

@flrgh
Copy link
Contributor Author

flrgh commented Jan 8, 2024

Do you guys need any help here? We are 1 week from feature freeze.

@RobSerafini aside from needing gateway team code review (which can be sorted with tomorrows PR meeting), I need a product/manager decision on whether or not this is a pure OSS feature or a Konnect-only feature.

@flrgh
Copy link
Contributor Author

flrgh commented Jan 8, 2024

* in case of multiple errors, we expect the DP to send multiple error messages

* at any time, a payload would generate at most one error per type (e.g. one `invalid declarative configuration`, one `transient error` and one `exception`)

@GGabriele I think I haven't done the best job of laying out specifications with examples, and "error" as a term is pretty generic, so allow me to expand a little and see if this answers your questions...

Let's call the final JSON-encoded, binary WebSocket message that is sent to the CP the error payload.

An error payload is a JSON object with a top-level classification of type: "error" and an object in the error attribute:

{
  "type": "error",
  "error": {
    "name":        "my error name",
    "config_hash": "588b0c0baa505a71be9369608e8b75f1",

    ...other attributes specific to error.name also live here
  }
}

...where error.name describes the type of error that was recorded and includes exactly one of (currently 3) different strings that are being added to kong/constants.lua:

+  CLUSTERING_DATA_PLANE_ERROR = {
+    CONFIG_PARSE     = "declarative configuration parse failure",
+    RELOAD           = "configuration reload failed",
+    GENERIC          = "generic or unknown error",
+  },

Additionally, error.config_hash represents the value of config_hash within the reconfigure message that was received from the CP.

Upon receiving a reconfigure payload from the CP, the DP enacts a serialized, atomic(ish) procedure for handling the payload: decode it, validate it, and then reload runtime state from it. If any one of those steps fails, the process is immediately aborted, and the DP sends an error payload for the CP--the contents of which will depend upon which step yielded an error. Because this is a direct consequence of the DP's receipt of a reconfigure payload from the CP, one message from the CP will result in at most one error payload being sent from the DP.

In the case of CONFIG_PARSE, there will be an array called error.flattened_errors* that contains individual error structures, because a single declarative config may contain more than one invalid entity. So by definition, yes, an error payload can contain more than one "error", but that is just an implementation detail of the declarative validation code and the error structure it returns. Each error contained within error.flattened_errors is to be considered part of the entire CONFIG_PARSE error payload as a unit. Here is complete example of a CONFIG_PARSE error payload

{
  "type": "error",
  "error": {
    "name": "declarative configuration parse failure",
    "config_hash": "ba5fa5346a9f48062895911e5ebcee03",
    "message": "declarative config is invalid: {}",
    "source": "kong.db.declarative.parse_table",
    "code": 14,
    "fields": {},
    "flattened_errors": [
      {
        "entity_id": "3923d481-b21c-4c0e-a91e-e4eb5d63a8f7",
        "entity_name": "my-other-service",
        "entity_tags": [
          "tag-1",
          "tag-2"
        ],
        "entity_type": "service",
        "errors": [
          {
            "field": "host",
            "message": "required field missing",
            "type": "field"
          },
          {
            "field": "extra_field",
            "message": "unknown field",
            "type": "field"
          }
        ]
      },
      {
        "entity_id": "908e55c2-0b83-4c47-b291-eb9c8776fef3",
        "entity_name": "my-service",
        "entity_tags": [
          "tag-1",
          "tag-2"
        ],
        "entity_type": "service",
        "errors": [
          {
            "field": "host",
            "message": "required field missing",
            "type": "field"
          }
        ]
      }
    ]
  }
}

* flattened_errors is not something introduced by this PR. See this older PR for more context on where it comes from

@flrgh flrgh force-pushed the feat/clustering-report-dp-errors branch from 6672c5b to d7b1604 Compare January 8, 2024 18:28
@GGabriele
Copy link
Contributor

@flrgh thanks a lot for the clarification, this is super helpful!!

@flrgh flrgh added the cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee label Jan 8, 2024
@flrgh flrgh force-pushed the feat/clustering-report-dp-errors branch 2 times, most recently from ecfd412 to c6335bf Compare January 9, 2024 00:50
@flrgh flrgh force-pushed the feat/clustering-report-dp-errors branch from c6335bf to 7e6478d Compare January 9, 2024 01:12
@flrgh flrgh added this to the 3.6.0 milestone Jan 9, 2024
@flrgh flrgh force-pushed the feat/clustering-report-dp-errors branch 2 times, most recently from c6adbc5 to 8f7810c Compare January 9, 2024 16:09
@flrgh flrgh requested review from chronolaw and locao January 9, 2024 17:46
@flrgh
Copy link
Contributor Author

flrgh commented Jan 9, 2024

Added an additional commit with better test coverage (assert more fields and cover the error() case). Please squash this on merge.

Copy link
Contributor

@locao locao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leaving some comments, I want to take another look before approving.

The custom plugin was a smart idea, way better than a huge refactor.

Copy link
Contributor

@locao locao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

My change suggestion is related to flakiness I saw in the past, I don't know if it is still relevant. Feel free to ignore.

@locao locao force-pushed the feat/clustering-report-dp-errors branch from bc46bd5 to f97b8c1 Compare January 10, 2024 15:10
@flrgh flrgh force-pushed the feat/clustering-report-dp-errors branch from f97b8c1 to 5c0e3f2 Compare January 10, 2024 17:48
Data-plane nodes running will now report config reload failures such as
invalid configuration or transient errors to the control-plane.
@flrgh flrgh force-pushed the feat/clustering-report-dp-errors branch from 5c0e3f2 to 3ad0280 Compare January 16, 2024 22:58
@RobSerafini
Copy link
Contributor

@GGabriele - I would like a final approval from you on behalf of Konnect.

Copy link
Contributor

@GGabriele GGabriele left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@flrgh flrgh merged commit 0f95ffc into master Jan 17, 2024
23 checks passed
@flrgh flrgh deleted the feat/clustering-report-dp-errors branch January 17, 2024 17:47
@team-gateway-bot
Copy link
Collaborator

Cherry-pick failed for master, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git remote add upstream https://github.com/kong/kong-ee
git fetch upstream master
git worktree add -d .worktree/cherry-pick-12282-to-master-to-upstream upstream/master
cd .worktree/cherry-pick-12282-to-master-to-upstream
git checkout -b cherry-pick-12282-to-master-to-upstream
ancref=$(git merge-base ccfac55b965c9818955c3422d7cfc4e509dcf922 3ad0280b6e5368c4f4e25a1a014aabd3ea02c931)
git cherry-pick -x $ancref..3ad0280b6e5368c4f4e25a1a014aabd3ea02c931

flrgh added a commit that referenced this pull request Jan 17, 2024
This branch of logic was mistakenly removed in
0f95ffc / #12282.
locao pushed a commit that referenced this pull request Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants