From aabddcf44ce7a1398b0defce6c062f3aa7b07fb3 Mon Sep 17 00:00:00 2001 From: Shane Utt Date: Thu, 11 May 2023 14:45:23 -0400 Subject: [PATCH 1/7] feat: initial declarative policy GEP --- geps/x.md | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) create mode 100644 geps/x.md diff --git a/geps/x.md b/geps/x.md new file mode 100644 index 0000000000..4c8a3cab8b --- /dev/null +++ b/geps/x.md @@ -0,0 +1,32 @@ +# GEP-X: Declarative Policy + +* Issue: TODO +* Status: Provisional + +## Definitions + +In this document we'll use `Policy` to refer to things that are specifically called policies +as well as other "MetaResources" that follow similar patterns. + +## TLDR + +This proposal is a follow-up to [GEP-713 Metaresources and Policy Attachment][713] to recommend +that we specifically remove the "attachment" part of "policy attachment" in favor of something +that is declarative at the affected resource level. + +[713]:https://gateway-api.sigs.k8s.io/geps/gep-713/ + +## Goals + +- Remove "attachment" from `Policy` resources and related documentation. +- Retain `Policy` resource structure other than "attachment" semantics. +- Provide new semantics to incorporate `Policy` resources at the level of the `Resource` that + will be affected. + +## Introduction + +TODO + +## API + +TODO: future iteration From 9ebda33b6b21996eb5f8554e7eca30e742d2c382 Mon Sep 17 00:00:00 2001 From: Flynn Date: Thu, 11 May 2023 18:12:42 -0400 Subject: [PATCH 2/7] Add the parable, the proposal, and the questions. Signed-off-by: Flynn --- geps/x.md | 251 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 249 insertions(+), 2 deletions(-) diff --git a/geps/x.md b/geps/x.md index 4c8a3cab8b..27ab5176d6 100644 --- a/geps/x.md +++ b/geps/x.md @@ -23,10 +23,257 @@ that is declarative at the affected resource level. - Provide new semantics to incorporate `Policy` resources at the level of the `Resource` that will be affected. -## Introduction +## The Problem: A Parable of Jane -TODO +It's a sunny Wednesday afternoon, and the lead microservices developer for +Evil Genius Cupcakes is windsurfing. Work has been eating Jane alive for the +past two and a half weeks, but after successfully deploying version 3.6.0 of +the `baker` service that morning, she left early to try to unwind a bit. + +Her shoulders are just starting to unknot when her phone pings with a text +from Julian, down in the NOC. Waterproof phones are a blessing, but also a +curse. + +**Julian**: _Hey Jane. Things are still running, more or less, but latencies +on everything in the baker namespace are crazy high after your last rollout, +and baker itself has a weirdly high load. Sorry to interrupt you on the lake +but can you take a look? Thanks!!_ + +Jane stares at the phone for a long moment, then slumps and heads back to +shore to dry off and grab her laptop. + +What she finds is strange. `baker` is taking a _lot_ of load, almost 4x what’s +being reported by its usual clients, and its clients report that calls are +taking much longer than they’d expect them to. She doublechecks the +Deployment, the Service, and all the HTTPRoutes around `baker`; everything +looks good. `baker`’s logs show her mostly failed requests... with a lot of +duplicate requests? Jane checks her HTTPRoute again, though she's pretty sure +you can't configure retries there, and finds nothing. But it definitely looks +like a client is retrying when it shouldn’t be. + +She pings Julian. + +**Jane**: _Hey Julian. Something weird is up, looks like requests to `baker` +are failing but getting retried??_ + +A minute later he answers. + +**Julian**: 🤷 _Did you configure retries?_ + +**Jane**: _Dude. I don’t even know how to._ 😂 + +**Julian**: _You attach a RetryPolicy attached to your HTTPRoute?_ + +**Jane**: _Nope. Definitely didn’t do that._ + +She types `kubectl get retrypolicy -n baker` and gets a permission error. + +**Jane**: _Huh, I actually don’t have permissions for RetryPolicy._ 🤔 + +**Julian**: 🤷 _Feels like you should but OK, guess that can’t be it._ + +Minutes pass while both look at logs. + +**Jane**: _OK, it’s definitely retrying. Nearly every request fails the first +few times, gets retried, and then finally succeeds?_ + +**Julian**: _Are you sure? I don’t see the `mixer` client making duplicate requests…_ + +**Jane**: _Check both logs for request ID +6E69546E-3CD8-4BED-9CE7-45CD3BF4B889. `mixer` sends that once, but `baker` +shows it arriving four times in quick succession. Only the fourth one +succeeds. That has to be retries._ + +Another pause. + +**Julian**: _I’m an idiot. There’s a RetryPolicy for the whole namespace – +sorry, too many policies in the dashboard and I missed it. Deleting that since +you don’t want retries._ + +**Jane**: _Are you sure that’s a good–_ + +Jane’s phone shrills while she’s typing, and she drops it. When she picks it +up again she sees a stack of alerts. Quickly flipping through them, she feels +the blood drain from her face: there’s one for every single service in the +`baker` namespace. + +**Jane**: _PUT IT BACK!!_ + +**Julian**: _Just did. Be glad you couldn't hear all the alarms here._ 😕 + +**Jane**: _What the hell just happened??_ + +**Julian**: _At a guess, all the workloads in the `baker` namespace actually +fail a lot, but they seem OK because there are retries across the whole +namespace?_ 🤔 + +Jane’s jaw drops. + +**Jane**: _You’re saying that ALL of our services are broken??!_ + +**Julian**: _That’s what it looks like. Guessing your `baker` rollout would +have failed without retries turned on._ + +There is a pause while Jane thinks through increasingly unpleasant possibilities. + +**Jane**: _I don't even know where to start here. How long did that +RetryPolicy go in? Is it the only thing like it?_ + +**Julian**: _I didn’t look closely before deleting it, but I think it said a +few months ago. And there are lots of different kinds of policy and lots of +individual policies, hang on a minute…_ + +**Julian**: _Looks like about 47 for your chunk of the world, a couple hundred +system-wide._ + +**Jane**: 😱 _Can you tell me what they’re doing for each of our services? I +can’t even_ look _at these things._ 😕 + +**Julian**: _That's gonna take awhile. Our tooling to show us which policies +bind to a given workload doesn't go the other direction._ + +**Jane**: _…Wait. You have to_ build tools _to figure out basic configuration??_ + +Pause. + +**Julian**: _Policy attachment is more complex than we’d like, yeah._ 😐 +_Look, how ‘bout roll back your `baker` change for now? We can get together in +the morning and start sorting this out._ + +Jane shakes her head and rolls back her edits to the `baker` Deployment, then +sits looking out over the lake as the deployment progresses. + +**Jane**: _Done. Are things happier now?_ + +**Julian**: _Looks like, thanks. Reckon you can get back to your sailboard._ 🙂 + +Jane sighs. + +**Jane**: _Wish I could. Wind’s died down, though, and the sun is almost gone. +May as well head home._ + +One more look out at the lake. + +**Jane**: _Thanks for the help. Wish we’d found better answers._ 😢 + +## The Proposal + +The fundamental problem with policy attachment is that it **breaks the core +premise of Kubernetes as a declarative system**, because it’s not declarative: +it sets the world up for a sort of spooky action at a distance, to borrow +Einstein’s phrase. We acknowledge that policy attachement is not the only +place where we see this in Kubernetes, of course! but we submit that we should +probably not be adding more such places. + +Given that the fundamental problem is that policy attachement isn't +declarative as written and should be made declarative, there is only one +fundamental answer: we need to modify the Kubernetes core resources to include +extension points where a given object refers to its modifier, rather than +having the modifying resource try to attach to its source. This is an ugly +job, but it’s the only way to deal with this situation. + +This GEP proposes to start this process with the Gateway API resources. ## API TODO: future iteration + +## Questions and Answers + +**Q**: _Why are you implying that there’s a problem with policy attachment? +Isn’t your parable really just showing us that Jane and Julian work for a +dysfunctional organization?_ + +**A**: As written, Evil Genius Cupcakes is far from the most dysfunctional +organization I’ve seen. Jane and Julian support each other, neither casts +blame, both are clearly trying to do their best by the organization and their +customers even to their own cost. So the organization isn't really the +problem. + +**Q**: _No organization would actually install a namespace-wide retry policy +and then forget about it, though._ + +**A**: I literally cannot even begin to count the number of times I’ve seen +something like this happen. + +The most common scenario goes like this: it’s 8PM on a Friday and something +goes wrong. There is much screaming, wailing, and gnashing of teeth as the +on-call staff try to figure out what’s up. Inevitably, the SME is on vacation. +Someone suggests retries and they hastily slap in the CRD to enable them. The +post-mortem gets rescheduled a few times, and/or the person writing up the +timeline mistakenly notes that the retries were enabled for a given workload +rather than for the entire namespace, and no one ever figures out that error. +It creates an action item of “fix this workload to not need retries”, that +goes into the backlog, and it gets pushed down by more critical items. + +**Q**: _Okay, but in the real world, removing the RetryPolicy wouldn’t affect +every workload._ + +**A**: As soon as the namespace-wide RetryPolicy goes in, Jane’s team largely +loses the backstop of progressive rollout. As long as their workloads don’t +fail 100% of the time, progressive rollout will likely succeed; after a few +months, it’s not even close to unlikely that every service will actually be +failing pretty often. + +**Q**: _Fine. But in the real world, Jane would be able to see all the policy +objects herself, and this would be a non-issue._ + +**A**: Quick, write me a kubectl query to fetch every policy CRD that’s +attached to an arbitrary object. Go ahead. I’ll wait. Make sure you get policy +CRDs attached to the enclosing namespace, too. + +… + +There’s a big difference between “having permission to see” and “being able to +effectively query and understand”. As policy attachment currently stands, you +need to be able to query many different kinds of CRDs _and_ filter them in a +couple of different ways that existing tooling isn't very good at. + +**Q**: _Well then, in the real world, Jane would have access to higher-level +tools that know how to do that._ + +**A**: Those tools need to be written, and Jane and her team need to be taught +that the tools exist and how to use them. From Jane’s point of view, those +tools are adding friction to her job, and honestly she’s right: why should she +need to learn funky new tools instead of just putting the right thing in her +HTTPRoutes? + +**Q**: _What if we give Julian those tools? He could cope with them._ + +**A**: Sure, but now you’re back to a world in which Jane isn’t +self-sufficient and has to bottleneck on Julian. Neither of them will like +that. + +**Q**: _Doesn't direct policy attachment make things better?_ + +**A**: Not really, no. The only real effect is that if you use direct policy +attachment, you can’t land in a scenario that I considered but didn’t write +about: in that one, Julian tries to tweak the RetryPolicy to disable the +retries for `baker` alone, but runs afoul of an override installed by Jasmine +from the cluster-ops team, which Julian doesn’t have permission to change… so +he literally can’t even turn them off. + +**Q**: _OK, so isn’t this really just a retry thing? It’s not like all +policies can affect things so broadly._ + +**A**: Stating the obvious here: the whole point of policy attachment is to +set policy. By definition, policy has very broad capabilities. Retry is +actually a fairly narrow function: suppose the attached policy was a WAF which +was intentionally applied on every namespace (gotta protect everything!), and +Jasmine mistakenly changed its configuration? That could affect everything in +the entire cluster – possibly only a week after Jasmine made the change, when +the WAF gets an update that interacts poorly with the configuration change. + +**Q**: _Dude, c’mon. That’s Jasmine and the WAF shooting themselves in the +foot, not a problem with policy attachment._ + +**A**: You’re right that policy attachment didn’t cause the retry issue we +looked at first, nor would it cause the WAF problem above. But it does make it +much harder for Jane (the human directly affected) to understand what’s +happening so she can fix it. That’s the problem that I’m concerned about. + +**Q**: _So you’re saying this is just impossible then, and you’re not +listening to anything I ask._ + +**A**: Well, most of your questions aren’t questions! But more importantly, +see the next section. From 7ac75d1350dde701f8744bdb54f4ebb4dac77a01 Mon Sep 17 00:00:00 2001 From: Flynn Date: Thu, 11 May 2023 21:03:47 -0400 Subject: [PATCH 3/7] Wordsmithing. Signed-off-by: Flynn --- geps/x.md | 121 +++++++++++++++++++++++++++++------------------------- 1 file changed, 66 insertions(+), 55 deletions(-) diff --git a/geps/x.md b/geps/x.md index 27ab5176d6..1f54c982c7 100644 --- a/geps/x.md +++ b/geps/x.md @@ -28,28 +28,28 @@ that is declarative at the affected resource level. It's a sunny Wednesday afternoon, and the lead microservices developer for Evil Genius Cupcakes is windsurfing. Work has been eating Jane alive for the past two and a half weeks, but after successfully deploying version 3.6.0 of -the `baker` service that morning, she left early to try to unwind a bit. +the `baker` service that morning, she escaped early to try to unwind a bit. Her shoulders are just starting to unknot when her phone pings with a text from Julian, down in the NOC. Waterproof phones are a blessing, but also a curse. **Julian**: _Hey Jane. Things are still running, more or less, but latencies -on everything in the baker namespace are crazy high after your last rollout, -and baker itself has a weirdly high load. Sorry to interrupt you on the lake +on everything in the `baker` namespace are crazy high after your last rollout, +and `baker` itself has a weirdly high load. Sorry to interrupt you on the lake but can you take a look? Thanks!!_ -Jane stares at the phone for a long moment, then slumps and heads back to -shore to dry off and grab her laptop. +Jane stares at the phone for a long moment, heart sinking, then slowly tacks +back to shore to dry off and grab her laptop. -What she finds is strange. `baker` is taking a _lot_ of load, almost 4x what’s -being reported by its usual clients, and its clients report that calls are -taking much longer than they’d expect them to. She doublechecks the -Deployment, the Service, and all the HTTPRoutes around `baker`; everything +What she finds when she logs in is strange. `baker` is taking a _lot_ of load, +almost 4x what’s being reported by its usual clients, and its clients report +that calls are taking much longer than they’d expect them to. She doublechecks +the Deployment, the Service, and all the HTTPRoutes around `baker`; everything looks good. `baker`’s logs show her mostly failed requests... with a lot of -duplicate requests? Jane checks her HTTPRoute again, though she's pretty sure -you can't configure retries there, and finds nothing. But it definitely looks -like a client is retrying when it shouldn’t be. +duplicates? Jane checks her HTTPRoute again, though she's pretty sure you +can't configure retries there, and finds nothing. But it definitely looks like +a client is retrying when it shouldn’t be. She pings Julian. @@ -62,7 +62,7 @@ A minute later he answers. **Jane**: _Dude. I don’t even know how to._ 😂 -**Julian**: _You attach a RetryPolicy attached to your HTTPRoute?_ +**Julian**: _You just attach a RetryPolicy to your HTTPRoute._ **Jane**: _Nope. Definitely didn’t do that._ @@ -77,7 +77,7 @@ Minutes pass while both look at logs. **Jane**: _OK, it’s definitely retrying. Nearly every request fails the first few times, gets retried, and then finally succeeds?_ -**Julian**: _Are you sure? I don’t see the `mixer` client making duplicate requests…_ +**Julian**: _Are you sure? I don’t see the `mixer` client making duplicate requests..._ **Jane**: _Check both logs for request ID 6E69546E-3CD8-4BED-9CE7-45CD3BF4B889. `mixer` sends that once, but `baker` @@ -121,7 +121,7 @@ RetryPolicy go in? Is it the only thing like it?_ **Julian**: _I didn’t look closely before deleting it, but I think it said a few months ago. And there are lots of different kinds of policy and lots of -individual policies, hang on a minute…_ +individual policies, hang on a minute..._ **Julian**: _Looks like about 47 for your chunk of the world, a couple hundred system-wide._ @@ -132,7 +132,7 @@ can’t even_ look _at these things._ 😕 **Julian**: _That's gonna take awhile. Our tooling to show us which policies bind to a given workload doesn't go the other direction._ -**Jane**: _…Wait. You have to_ build tools _to figure out basic configuration??_ +**Jane**: _...wait. You have to_ build tools _to figure out basic configuration??_ Pause. @@ -161,9 +161,9 @@ One more look out at the lake. The fundamental problem with policy attachment is that it **breaks the core premise of Kubernetes as a declarative system**, because it’s not declarative: it sets the world up for a sort of spooky action at a distance, to borrow -Einstein’s phrase. We acknowledge that policy attachement is not the only -place where we see this in Kubernetes, of course! but we submit that we should -probably not be adding more such places. +Einstein’s phrase. Policy attachment is not the only place where we see this +in Kubernetes, of course! but we submit that we shouldn't be adding any more +such places. Given that the fundamental problem is that policy attachement isn't declarative as written and should be made declarative, there is only one @@ -184,7 +184,7 @@ TODO: future iteration Isn’t your parable really just showing us that Jane and Julian work for a dysfunctional organization?_ -**A**: As written, Evil Genius Cupcakes is far from the most dysfunctional +**A**: As written, Evil Genius Cupcakes is _far_ from the most dysfunctional organization I’ve seen. Jane and Julian support each other, neither casts blame, both are clearly trying to do their best by the organization and their customers even to their own cost. So the organization isn't really the @@ -203,26 +203,29 @@ Someone suggests retries and they hastily slap in the CRD to enable them. The post-mortem gets rescheduled a few times, and/or the person writing up the timeline mistakenly notes that the retries were enabled for a given workload rather than for the entire namespace, and no one ever figures out that error. -It creates an action item of “fix this workload to not need retries”, that -goes into the backlog, and it gets pushed down by more critical items. +The post-mortem results in an action item of “fix this workload to not need +retries so we can turn retries off”, that goes into the backlog, and it gets +pushed down by more critical items. + +That is a process problem for sure! but it's a sadly realistic one. **Q**: _Okay, but in the real world, removing the RetryPolicy wouldn’t affect every workload._ **A**: As soon as the namespace-wide RetryPolicy goes in, Jane’s team largely -loses the backstop of progressive rollout. As long as their workloads don’t -fail 100% of the time, progressive rollout will likely succeed; after a few -months, it’s not even close to unlikely that every service will actually be -failing pretty often. +loses the backstop of progressive rollout. As long as their workloads succeed +sometimes, progressive rollout has a good chance to succeed. After the few +months posited above, it’s not at all unlikely that every service will +actually be failing pretty often. **Q**: _Fine. But in the real world, Jane would be able to see all the policy objects herself, and this would be a non-issue._ -**A**: Quick, write me a kubectl query to fetch every policy CRD that’s -attached to an arbitrary object. Go ahead. I’ll wait. Make sure you get policy -CRDs attached to the enclosing namespace, too. +**A**: Assuming permission to see everything necessary, please write me a +`kubectl` query to fetch every policy CRD that’s attached to an arbitrary +object. Remember to get policy CRDs attached to the enclosing namespace, too. -… +Challenging, no? There’s a big difference between “having permission to see” and “being able to effectively query and understand”. As policy attachment currently stands, you @@ -232,11 +235,11 @@ couple of different ways that existing tooling isn't very good at. **Q**: _Well then, in the real world, Jane would have access to higher-level tools that know how to do that._ -**A**: Those tools need to be written, and Jane and her team need to be taught -that the tools exist and how to use them. From Jane’s point of view, those -tools are adding friction to her job, and honestly she’s right: why should she -need to learn funky new tools instead of just putting the right thing in her -HTTPRoutes? +**A**: Those tools have yet to be written. Once they are, Jane and her team +will need to be taught that the tools exist and how to use them. From Jane’s +point of view, it's simpler not to need those tools: she'd rather just put the +right thing in her HTTPRoutes, and then be able to see them all when she reads +her HTTPRoutes. **Q**: _What if we give Julian those tools? He could cope with them._ @@ -246,34 +249,42 @@ that. **Q**: _Doesn't direct policy attachment make things better?_ -**A**: Not really, no. The only real effect is that if you use direct policy -attachment, you can’t land in a scenario that I considered but didn’t write -about: in that one, Julian tries to tweak the RetryPolicy to disable the -retries for `baker` alone, but runs afoul of an override installed by Jasmine -from the cluster-ops team, which Julian doesn’t have permission to change… so -he literally can’t even turn them off. +**A**: Not really, no. Direct policy attachment is still spooky action at a +distance, so it doesn't really make things markedly better. + +(That said, direct policy attachment _does_ sidestep a specific very +unpleasant scenario that I considered but didn’t write about. In that one, +Julian tries to tweak the RetryPolicy to disable the retries for just the +`baker` workload, but runs afoul of an override installed by Jasmine from the +cluster-ops team, which Julian doesn’t have permission to even see... so he +has to infer the existence of the override he can't see, and he can't do +anything about it.) **Q**: _OK, so isn’t this really just a retry thing? It’s not like all policies can affect things so broadly._ -**A**: Stating the obvious here: the whole point of policy attachment is to -set policy. By definition, policy has very broad capabilities. Retry is -actually a fairly narrow function: suppose the attached policy was a WAF which -was intentionally applied on every namespace (gotta protect everything!), and -Jasmine mistakenly changed its configuration? That could affect everything in -the entire cluster – possibly only a week after Jasmine made the change, when -the WAF gets an update that interacts poorly with the configuration change. +**A**: To state the obvious, the whole point of policy attachment is to set +policy -- and by definition, policy has very broad capabilities. Retry is +actually a fairly _narrow_ function: suppose the attached policy was instead a +WAF which was intentionally applied on every namespace (gotta protect +everything!), and Jasmine mistakenly changed its configuration? That could +affect everything in the entire cluster – possibly only a week after Jasmine +made the change, when the WAF gets an update that interacts poorly with the +configuration change. **Q**: _Dude, c’mon. That’s Jasmine and the WAF shooting themselves in the foot, not a problem with policy attachment._ **A**: You’re right that policy attachment didn’t cause the retry issue we -looked at first, nor would it cause the WAF problem above. But it does make it -much harder for Jane (the human directly affected) to understand what’s -happening so she can fix it. That’s the problem that I’m concerned about. +looked at first, nor would it cause the WAF problem above. What we're +concerned about is that policy attachement _does_ make it much harder for Jane +to understand what's happening so that she can fix it. That will have a real +impact on real people. + +**Q**: _So you're just saying that everything is impossible and you're not +listening to my questions._ -**Q**: _So you’re saying this is just impossible then, and you’re not -listening to anything I ask._ +**A**: Well, most of your "questions" aren't questions! 🙂 -**A**: Well, most of your questions aren’t questions! But more importantly, -see the next section. +And we definitely think it's possible to do something about the situation; +that's what this proposal is all about. From 7b67238cda0e2323ec4ce8bbe948935502e93cff Mon Sep 17 00:00:00 2001 From: Flynn Date: Fri, 12 May 2023 11:03:27 -0400 Subject: [PATCH 4/7] Moar wordsmithing. Signed-off-by: Flynn --- geps/x.md | 56 ++++++++++++++++++++++--------------------------------- 1 file changed, 22 insertions(+), 34 deletions(-) diff --git a/geps/x.md b/geps/x.md index 1f54c982c7..257bcc253d 100644 --- a/geps/x.md +++ b/geps/x.md @@ -28,7 +28,7 @@ that is declarative at the affected resource level. It's a sunny Wednesday afternoon, and the lead microservices developer for Evil Genius Cupcakes is windsurfing. Work has been eating Jane alive for the past two and a half weeks, but after successfully deploying version 3.6.0 of -the `baker` service that morning, she escaped early to try to unwind a bit. +the `baker` service this morning, she's escaped early to try to unwind a bit. Her shoulders are just starting to unknot when her phone pings with a text from Julian, down in the NOC. Waterproof phones are a blessing, but also a @@ -39,17 +39,17 @@ on everything in the `baker` namespace are crazy high after your last rollout, and `baker` itself has a weirdly high load. Sorry to interrupt you on the lake but can you take a look? Thanks!!_ -Jane stares at the phone for a long moment, heart sinking, then slowly tacks -back to shore to dry off and grab her laptop. +Jane stares at the phone for a long moment, heart sinking, then sighs and +turns back to shore. -What she finds when she logs in is strange. `baker` is taking a _lot_ of load, -almost 4x what’s being reported by its usual clients, and its clients report -that calls are taking much longer than they’d expect them to. She doublechecks -the Deployment, the Service, and all the HTTPRoutes around `baker`; everything +What she finds when dries off and grabs her laptop is strange. `baker` does +seem to be taking much more load than its clients are sending, and its clients +report much higher latencies than they’d expect. She doublechecks the +Deployment, the Service, and all the HTTPRoutes around `baker`; everything looks good. `baker`’s logs show her mostly failed requests... with a lot of duplicates? Jane checks her HTTPRoute again, though she's pretty sure you can't configure retries there, and finds nothing. But it definitely looks like -a client is retrying when it shouldn’t be. +clients are retrying when they shouldn’t be. She pings Julian. @@ -74,18 +74,6 @@ She types `kubectl get retrypolicy -n baker` and gets a permission error. Minutes pass while both look at logs. -**Jane**: _OK, it’s definitely retrying. Nearly every request fails the first -few times, gets retried, and then finally succeeds?_ - -**Julian**: _Are you sure? I don’t see the `mixer` client making duplicate requests..._ - -**Jane**: _Check both logs for request ID -6E69546E-3CD8-4BED-9CE7-45CD3BF4B889. `mixer` sends that once, but `baker` -shows it arriving four times in quick succession. Only the fourth one -succeeds. That has to be retries._ - -Another pause. - **Julian**: _I’m an idiot. There’s a RetryPolicy for the whole namespace – sorry, too many policies in the dashboard and I missed it. Deleting that since you don’t want retries._ @@ -93,9 +81,8 @@ you don’t want retries._ **Jane**: _Are you sure that’s a good–_ Jane’s phone shrills while she’s typing, and she drops it. When she picks it -up again she sees a stack of alerts. Quickly flipping through them, she feels -the blood drain from her face: there’s one for every single service in the -`baker` namespace. +up again she sees a stack of alerts. She goes pale as she quickly flips +through them: there’s one for every single service in the `baker` namespace. **Jane**: _PUT IT BACK!!_ @@ -107,20 +94,19 @@ the blood drain from her face: there’s one for every single service in the fail a lot, but they seem OK because there are retries across the whole namespace?_ 🤔 -Jane’s jaw drops. - -**Jane**: _You’re saying that ALL of our services are broken??!_ +Jane's blood runs cold. -**Julian**: _That’s what it looks like. Guessing your `baker` rollout would -have failed without retries turned on._ +**Julian**: _Yeah. Looking a little closer, I think your `baker` rollout this +morning would have failed without those retries._ 😕 -There is a pause while Jane thinks through increasingly unpleasant possibilities. +There is a pause while Jane's mind races through increasingly unpleasant +possibilities. **Jane**: _I don't even know where to start here. How long did that RetryPolicy go in? Is it the only thing like it?_ -**Julian**: _I didn’t look closely before deleting it, but I think it said a -few months ago. And there are lots of different kinds of policy and lots of +**Julian**: _Didn’t look closely before deleting it, but I think it said a few +months ago. And there are lots of different kinds of policy and lots of individual policies, hang on a minute..._ **Julian**: _Looks like about 47 for your chunk of the world, a couple hundred @@ -132,7 +118,7 @@ can’t even_ look _at these things._ 😕 **Julian**: _That's gonna take awhile. Our tooling to show us which policies bind to a given workload doesn't go the other direction._ -**Jane**: _...wait. You have to_ build tools _to figure out basic configuration??_ +**Jane**: _...wait. You have to_ build tools _to know if retries are turned on??_ Pause. @@ -149,8 +135,10 @@ sits looking out over the lake as the deployment progresses. Jane sighs. -**Jane**: _Wish I could. Wind’s died down, though, and the sun is almost gone. -May as well head home._ +**Jane**: _Wish I could. Wind’s died down, though, and it'll be dark soon. +Just gonna head home._ + +**Julian**: _Ouch. Sorry to hear that._ 😐 One more look out at the lake. From 001f3e1887b87f8d9540f4c95c67776c5dd99135 Mon Sep 17 00:00:00 2001 From: Flynn Date: Fri, 12 May 2023 12:41:43 -0400 Subject: [PATCH 5/7] Real number and last round of wordsmithing. Signed-off-by: Flynn --- geps/{x.md => gep-2014.md} | 40 +++++++++++++++++++++++++------------- 1 file changed, 27 insertions(+), 13 deletions(-) rename geps/{x.md => gep-2014.md} (88%) diff --git a/geps/x.md b/geps/gep-2014.md similarity index 88% rename from geps/x.md rename to geps/gep-2014.md index 257bcc253d..bd4d207a82 100644 --- a/geps/x.md +++ b/geps/gep-2014.md @@ -1,20 +1,24 @@ # GEP-X: Declarative Policy -* Issue: TODO +* Issue: [2014](https://github.com/kubernetes-sigs/gateway-api/issues/2014) * Status: Provisional +* Authors: [Flynn](mailto:flynn@buoyant.io); [Shane Utt](mailto:shane@konghq.com) ## Definitions -In this document we'll use `Policy` to refer to things that are specifically called policies -as well as other "MetaResources" that follow similar patterns. +In this document we'll use `policy` to refer to any resource whose purpose is +setting policy around other resources. Notably, this could include either +"policies" or "metaresources" as used in other documents: we're intentionally +using the broader scope here. -## TLDR +## tl;dr: -This proposal is a follow-up to [GEP-713 Metaresources and Policy Attachment][713] to recommend -that we specifically remove the "attachment" part of "policy attachment" in favor of something -that is declarative at the affected resource level. +This proposal is a follow-up to [GEP-713 Metaresources and Policy Attachment] +to recommend that we specifically remove the "attachment" part of "policy +attachment" in favor of something that is declarative at the affected resource +level. -[713]:https://gateway-api.sigs.k8s.io/geps/gep-713/ +[GEP-713 Metaresources and Policy Attachment]:https://gateway-api.sigs.k8s.io/geps/gep-713/ ## Goals @@ -23,6 +27,10 @@ that is declarative at the affected resource level. - Provide new semantics to incorporate `Policy` resources at the level of the `Resource` that will be affected. +## Non-Goals + +- To be clarified + ## The Problem: A Parable of Jane It's a sunny Wednesday afternoon, and the lead microservices developer for @@ -157,20 +165,26 @@ Given that the fundamental problem is that policy attachement isn't declarative as written and should be made declarative, there is only one fundamental answer: we need to modify the Kubernetes core resources to include extension points where a given object refers to its modifier, rather than -having the modifying resource try to attach to its source. This is an ugly -job, but it’s the only way to deal with this situation. +having the modifying resource try to attach to its source. (For the record, we +take no joy in this statement, but we do feel that it's the correct answer.) This GEP proposes to start this process with the Gateway API resources. +A final note: while it's important to acknowledge that policy attachment is +**not** the root cause of the application problems that Jane and Julian have +in the parable above, it's also important to recognize that policy attachment +makes understanding and fixing the problem much more difficult. That's the +primary concern behind this GEP. + ## API TODO: future iteration ## Questions and Answers -**Q**: _Why are you implying that there’s a problem with policy attachment? -Isn’t your parable really just showing us that Jane and Julian work for a -dysfunctional organization?_ +**Q**: _Isn’t your parable really just showing us that Jane and Julian work +for a dysfunctional organization, rather than showing anything wrong with +policy attachment?_ **A**: As written, Evil Genius Cupcakes is _far_ from the most dysfunctional organization I’ve seen. Jane and Julian support each other, neither casts From 40f2fb41b37854581a70325573738757dd8bc119 Mon Sep 17 00:00:00 2001 From: Flynn Date: Fri, 12 May 2023 12:46:16 -0400 Subject: [PATCH 6/7] Update mkdocs.yml as well Signed-off-by: Flynn --- mkdocs.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/mkdocs.yml b/mkdocs.yml index 355d36836c..274c3bd38b 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -84,6 +84,7 @@ nav: - geps/gep-1324.md - geps/gep-1282.md - geps/gep-1897.md + - geps/gep-2014.md - Prototyping: - geps/gep-1709.md - Experimental: From 03731e331a6c566cc93a4c6c0020cc7d01f01eb9 Mon Sep 17 00:00:00 2001 From: Flynn Date: Fri, 12 May 2023 12:49:06 -0400 Subject: [PATCH 7/7] Fix title; switch to GitHub profiles for authors Signed-off-by: Flynn --- geps/gep-2014.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/geps/gep-2014.md b/geps/gep-2014.md index bd4d207a82..189cf36401 100644 --- a/geps/gep-2014.md +++ b/geps/gep-2014.md @@ -1,8 +1,8 @@ -# GEP-X: Declarative Policy +# GEP-2014: Declarative Policy * Issue: [2014](https://github.com/kubernetes-sigs/gateway-api/issues/2014) * Status: Provisional -* Authors: [Flynn](mailto:flynn@buoyant.io); [Shane Utt](mailto:shane@konghq.com) +* Authors: [Flynn](https://github.com/kflynn); [Shane Utt](https://github.com/shaneutt) ## Definitions