-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(api) add a config readiness endpoint #8256
Conversation
c2cd1e3
to
40350c7
Compare
a15ceeb
to
0d947c5
Compare
blocked on #8224 |
@mflendrich - I believe LMDB will not ship in 2.8 as the default db_cache, and would remove the block you raised. Can you confirm? |
@bungle assuming Rob's comment is correct, can you or another Gateway team member review this for the current DB-less implementation? The external interface would be consistent regardless of the backend, so I'm not concerned that we'd need a breaking change to that in a future version. |
kong/api/routes/health.lua
Outdated
end | ||
-- unintuitively, "true" is unitialized. we do always initialize the shdict key | ||
-- after a config loads, this returns the hash string | ||
if not declarative.has_config() then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: does loading the config mean that the router and run loops have been updated to use the config? Is there a delay? Could the user be surprised by seeing 404s when this endpoint returns a 200?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means declarative.load_into_cache()
has completed. If you're running strict consistency, the router and such will be updated since they'll rebuild at request time. If eventual, it's rebuilt whenever that fires (maybe? not sure if that system behaves differently in DB-less).
For the controller's purposes that's fine, since we mainly just care about knowing when we need to push it again.
For general purpose config availability, I don't know if there's a good approach for knowing when the config is available, since it's built per-worker--I can set an additional SHM key, but it won't be guaranteed that all workers have rebuilt. Is there anything we can use to inform us when all workers have completed processing of the declarative flip_config event?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c3fef049d868fd0d1dc98847ea8db6b090e6561a sorta does this, but it only means that at least one worker has rebuilt using the new config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is any plumbing for what I'm requesting, is there @bungle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirm that what @rainest says is correct: his flag indicates that the declarative config might have been loaded into the node's memory, but it can take a while until all the nodes get their routers (and plugins iterators) updated. With strict consistency set, requests "will wait" if a Router is not up to date so for all effects and purposes the new exposed flag it will indicate that "Kong will use the new config, even if it means waiting".
With eventual consistency, the gateway itself "does not care" about exact availability. If something else cares, then I think the answer to the router and run loops have been updated to use the config
can't be a single bit. Consider that some customers update their declarative configurations very rapidly. Workers are in config A. Config B arrives, gets processed by 1 worker, which sets the flag. Workers gradually load config B. While this is happening, Config C arrives. And so on. In that scenario with frequent updates, a single bit will always be false
- a number of workers will always be out of date.
Options:
- Change the question to
is the router and and run loops updated for this specific config (or a more recent one)?
. We could answer a bit for that one. - Change the question to
how many workers are updated to this specific config
. The answer to that one would be a number, or a couple numbers (2 out of 16). - Change the question to
which config does every worker have
. The answer would be a list of n configuration identifiers. Probably their hashes.
I don't think we have any of the plumbing for that at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have something that provides the worker counts/a list of their PIDs? It's more complicated, but it should be viable to restructure the value of DECLARATIVE_CONFIG_READY_KEY
from a bit to N bits, and then in the endpoint loop over N to confirm that all are set.
We're not really concerned with latest config, just that configuration is not null, so the update stampede case isn't a problem. You just always set your worker entry to true
on update (there's probably some way to make this only happen once using local worker memory to avoid the unnecessary SHM locks after the first update--is there a Lua equivalent of https://pkg.go.dev/sync#Once?) and don't worry what specific hash you've loaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ed5d5d736 updates the previous system to add a worker PID onto the end, so we set and then check a key per worker, along with a better test to confirm that it works as expected. We can't test actual startup conditions easily (at all?) AFAIK, but most of the moving parts are in the bit where workers mark their status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not the SME here so I'll defer the review to someone within Gateway team.
Health-endpoints are invoked by schedulers, orchestrators, load-balancers, external health-checkers, etc. Meaning, they should be very performant. The shm:gets acquire a lock across all workers, which can drastically affect performance.
36db71c
to
ba2980c
Compare
53da411
to
f0d5103
Compare
e3141e9
to
9b2cd7a
Compare
I was about to say this approach looks good but then I realized another issue, maybe an irrelevant corner case.
If a pod restarts for any reason, Kong will load the configuration that it has cached on disk, and this endpoint will return a 200. I think this doesn't solve the problem you originally set out to solve.
This is a fine use-case for this patch and I think we should accept it. Now, how to solve the KIC issue? Thanks to #8214, the Admin API |
We do want this to return a 200 if config is loaded from disk, config is available then. Is this specific to hybrid data planes or should it work on full DB-less also? Container restarts do not wipe out the tempdir we use for the prefix, so if it does get saved there, it should be reloaded on restart. In practice, if we assassinate a DB-less container, its replacement comes online with an empty configuration.
This is part of why I'd have preferred to use this to solve that issue: using the configuration hash ends up being a leaky abstraction where you need to reimplement Kong internals outside to fully understand what's going on. We can sorta get partway there by comparing against the known |
@rainest :
Yes, in the https://github.com/Kong/kong/blob/master/kong/api/routes/kong.lua#L71-L87 |
a74a5f0
to
ed5d5d7
Compare
I'm not sure. @bungle or @dndx Can either of you confirm this?
While they try to solve the same problem, their use is quite different. |
@hbagdi That is one possibility, but wouldn't the user actually wants to check for the reachability of backend services in this case? Just having any config loaded is a pretty weak guarantee that Kong is actually "healthy", there are a lot of things that could still gone wrong. I am considering the original problem @rainest needs to solve within this PR's description which is KIC's config readiness detection, and it seems that #8214 already solves it. Maybe we should keep the problem scope confined for now. |
They could and they are free to do that and they should do that as an end-to-end monitoring check.
It is a stronger guarantee than not having any configuration at all. |
It was originally designed for full DB-less. For this they're functionally the same since the code to actually load the config is shared, and only the method for receiving new configs changes. This is handled differently from #8124 for two reasons:
I concur with Harry that this isn't a perfect metric for readiness (there is no such thing), but it's taking us past the |
Add a config readiness endpoint. When db="off", it returns a 200 if Kong has loaded or received a configuration and applied it, or 503 if not. Compute hashes when loading configuration from the environment, to distinguish between user-supplied configuration and the default blank configuration. Test the config readiness endpoint. This requires actually using the database strategy for Kong route tests and adding a utility server that allows tests to directly set SHM values, to circumvent the lack of normal config load behavior in test environments.
ed5d5d7
to
20c9dea
Compare
Summary
Adds a config readiness endpoint. This is designed to allow the Ingress controller to detect when Kong has restarted and needs to receive configuration again. It should also be useful for load balancers that want to avoid sending requests through an instance before it is ready to process them.
Full changelog