Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add resource constraints for CEL #3144

Merged
Merged
Changes from 1 commit
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
ae09cbb
Add initial info on CEL resource constraints.
DangerOnTheRanger Jan 13, 2022
16283e1
Reword to imply depth limiting is just an option.
DangerOnTheRanger Jan 13, 2022
11be065
Move resource constraint timeout to the top.
DangerOnTheRanger Jan 13, 2022
eea4fd8
Add link to CEL cost subsystem code.
DangerOnTheRanger Jan 13, 2022
ff38644
Change timeout from per-expression to per-request.
DangerOnTheRanger Jan 13, 2022
efda13f
Move paragraph on cost subsystem to the end.
DangerOnTheRanger Jan 13, 2022
fa0b979
Change CEL expression timeout to 5 seconds.
DangerOnTheRanger Jan 13, 2022
bf2b434
Change text that was out-of-date after reformatting.
DangerOnTheRanger Jan 13, 2022
76e4605
Note that the timeout message will include evaluation time.
DangerOnTheRanger Jan 13, 2022
e4b49ba
Specifcy that Beta timeout will be 100ms-1s.
DangerOnTheRanger Jan 14, 2022
8bd777a
Run hack/update-toc.sh.
DangerOnTheRanger Jan 14, 2022
e541ae3
Clarify that Beta performance will be tested against benchmarks.
DangerOnTheRanger Jan 14, 2022
6a550cb
Prepare for final review.
DangerOnTheRanger Feb 1, 2022
ac9e0e7
Run hack/update-toc.sh
DangerOnTheRanger Feb 1, 2022
4e440b0
Update in response to feedback.
DangerOnTheRanger Feb 3, 2022
3e36b22
Remove per-expression timeout, make per-request timeout sourced from …
DangerOnTheRanger Feb 3, 2022
8736077
Add runtime cost check.
DangerOnTheRanger Feb 3, 2022
49fe807
Run hack/update-toc.sh
DangerOnTheRanger Feb 3, 2022
5f96b1c
Minor fixes with APF wording.
DangerOnTheRanger Feb 3, 2022
5f44379
Keep cost bounds as future work.
DangerOnTheRanger Feb 3, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 143 additions & 20 deletions keps/sig-api-machinery/2876-crd-validation-expression-language/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -592,26 +592,149 @@ associativeList.all(e, e.val == 100)

### Resource constraints

For Beta, per-request execution time will be constrained via a context-passed timeout. For Beta, we are looking at a range between 100 milliseconds and 1 second for the timeout. We want to see how
CEL performs in benchmarks once our Beta optimizations are in place before deciding what specific
duration to set the timeout to. If the timeout is exceeded, the error message will include how
long each expression took to evaluate.

We will provide time and space benchmarks to show what can be expected in typical and worst-case
scenarios for CEL expressions, and that for most use cases, the timeout alone will be sufficient.
The alternatives listed below would be more aimed at malformed/malicious expressions.

If the context timeout proves to be insufficient, we have other alternatives to explore, such as
limiting nested list comprehensions to a depth of 2. According to the performance section of
[the CEL spec](https://github.com/google/cel-spec/blob/master/doc/langdef.md#performance), the
time complexity of these operations are all O(n), so by limiting nesting to 2 deep, we limit
expression complexity to O(n<sup>2</sup>). We can limit regular expression complexity
as well.

CEL also provides a [cost subsystem](https://github.com/google/cel-go/blob/dfef54b359b05532fb9695bc88937aa8530ab055/cel/program.go#L309) that could be used in the future,
but the cost subsystem would need to know the length of any relevant lists in order to be useful.
That information can be supplied using `maxLength`, but this is an optional field, and if not
passed, CEL would not be able to provide a useful figure.
CEL expressions have the potential to consume unacceptable amounts of API server resources. We intend to constrain the resource utilization in two ways:

- Proactively by static analysis of the CEL expressions to compute a "cost", and limiting that cost
- Reactively by enforcing CEL expression evaluation time limits using go context cancelation

We intend to enforce the limits both on a per-expression and per-request basis.

Per-expression limits are primarily targeted at CEL expression authors and are intended to provide actionable feedback about expression cost. They are insufficient for ensuring that requests will be processed in a bounded amount of time. For example, a validation rule for a field within a list of objects, might be well within bounds each time it is evaluated, but the sum of the costs across the list of objects may be unacceptable. This is why we also have a per-request limit.

#### Cost limits

We will use CEL's [cost subsystem](https://github.com/google/cel-go/blob/dfef54b359b05532fb9695bc88937aa8530ab055/cel/program.go#L309) to provide our proactive limits. This is protects against both
accidental and malicious misuse. If an expression is analyzed and has a cost greater than
a predefined limit, it will not be allowed. If an expression is rejected for this reason, the error
message will include how much the limit was exceeded.

A major problem with the cost system is assigning the cost of list iteration. The cost of a CEL expression is
computed statically without any knowledge about the data that will be validated. Only the CEL expression is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And don't forget boolean short circuit. basically it became n-SAT Problem, which is NP Complete.

Copy link
Contributor

@jpbetz jpbetz Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. CEL is very constrained though. We're using a CEL runtime with a limit of 32 levels of expression nesting, 32 terms separated by || in a row, the expression can't define its own functions and has to fit within a CRD. I'm curious what the worst case CEL expression would be here? It might still be pretty expensive? If so we might want to tighten some bounds. It seems containable to me, but I'd love to see someone prove me wrong (or right).

available when computing cost. This means that the sizes of variable length data types (lists, maps, sets,
strings, byte arrays) are entirely unknown. This limits our options for computing the cost of CEL expressions
that iterate across these list types. Setting high costs for iteration is a major problem because, in
practice, lists are short. For example, a pod has a list of containers, but the list is usually quite short.

There are a couple ways to address this problem:

- We require CRD authors to provide `maxLength` on all variable length data types that they iterate across in CEL expressions so we know exactly how to compute cost
- We do not attempt to statically compute a cost and instead rely entirely on time based limits
- We prove that, in practice, the cost limits can be found that are operationally safe while still not being overly restrictive

We have explored the cost system and realized (thanks @liggitt!) that O(n) iteration is bounded by custom
resource size limits (specifically the 3MB request limit) and we can use this to establish more manageable
worst case sizes.

If a list contains small elements (e.g. an int64) then the worst case list size can be 3MB/4bytes=~786k
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2 bytes per item, right? 0,0,0,0,0,...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true, CRDs are stored in JSON. (Kermit and I were thinking about in-memory representation when writing this...)

elements, but as the element size grows, the worst case list size quickly becomes manageable. We ran a [series of experiments](https://docs.google.com/document/d/1yR746Rf-rw-_zoq36Ypzu8LTqOa-LaCk1pv3e0kjF3A/edit?usp=sharing) to test the resource utilization of CEL expressions that iterate across lists.
We found that:

- O(n) iteration of a worst case list (list length == 786k) is <150ms
Copy link
Member

@liggitt liggitt Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that mean an iteration of a list of integers would not fit in the proposed 100ms limit? edit: oh, I guess an actual list of integers in a 3MB CR...

also, I think worst-case is actually ~1572k (see above comment about items being 2 bytes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, there are some corner cases that wouldn't fit within the wall clock timeout, at least on the hardware tested. For instance, the 3-million-long list scenario in the performance doc. Though in that case iteration took nearly a full .8 seconds, hence the timeouts/cost limits being built at least in part over what would provide a good experience request-wise (as opposed to covering all CEL cases).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the integer case is a bit problematic that way (and you're correct that it's 2x what we had calculated, so it's cost estimate would be closer to a worse case of 300ms). I suspect we can get the expected latency down for this once we optimize some of our CEL type integration code. If we get this under 100ms, then I think we can get to the point where all O(n) ops can be within our cost limits. But there is some risk that this specific case won't, in which case we'd either need to make a special exception to allow it anyway (since most list in practice won't be nearly this bad), or require users set maxLength on the field in order to use CEL iteration operations on the list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd be ok with basically any cel expression touching things within lists requiring a maxLength be specified on the list

- O(n) iteration of 10 byte data elements (list length == 300k) is < 70ms
- O(n) iteration of 100 byte data elements (list length == 30k) is < 7ms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the concrete times are heavily dependent on the system running the test. what safety factor are we building in for our estimated translation of abstract CEL cost to bounded run time to account for lower-powered servers and server-load?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding correctly (what happens when the cost doesn't account for a significantly underpowered server?), then the wall-clock timeout should catch those kinds of events.

Copy link
Contributor

@jpbetz jpbetz Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the safety factor. I like the idea of having a convincing calculation of our "worst case". I think it helps ground the following calculations like showing that we can allow O(n) in most cases without requiring a maxLength.

How about 5x as the safety factor? I.e. CEL is getting 20% of the available compute or the machine is 1/5 as powerful? (since there is not bound to how little CPU a process might be granted, we can't calculate a finite worst case, so we just need to pick something and document it)


Consider an example:

```
list.all(element, element.startsWith("prefix:"))
```

The cost of this expression is:

```
cost('startsWith()' function) * worst_case_length(list)
```

If the `worst_case_length()` is 30k for the list (e.g. the list elements are objects where the minimum size of
each object can be computed statically to be 100 bytes) and the cost of `statsWith()` is, say, 3, we know that
this is < 21ms to evaluate.

Note that we could also have a much lower `worst_case_length()` if the CRD author provided a `maxLength` limit on their list size.

This implies that we can use cost to:

- Set limits that ensure CEL expressions evaluate within known time bounds
- Allow O(n) operations so long as the per-element cost is moderate (e.g. 100 cost)
- And again, allow more expensive per-element cost when `maxLength` is set
- Allow many (100s) CEL expressions in a CRD that are within these complexity bounds
- Disallow O(n<sup>2</sup>) or worse operations based purely on their cost
- But still allow O(n<sup>2</sup>) or worse so long as the CRD author adds a `maxLength` limit to the lists and the resulting cost is within cost bounds (e.g. O(n<sup>3</sup>) is still fine for a list size of 3).

Also, not requiring `maxLength` on most O(n) CEL expressions keeps the cost system low friction; the majority
of CEL expressions can be written and used without bumping into cost limits or needing to set `maxLength`.

For the cases where a `maxLength` is needed to ensure cost is acceptable, encouraging CRD authors to include
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the incoming object is determined to be invalid according to the openapi schema (violates maxLength, or omits required fields in items that informed our calculated minimum size, etc), do we still evaluate cel expressions anyway? I think we currently evaluate CEL in addition to openapi validation (@jpbetz can confirm). We may need to change that if we're counting on maxLength or required fields in our cost calculations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. We do run both openAPI validation and CEL validation. So at a minimum we would need to change the behavior and skip CEL expression validation when we fail maxLength validations that are relevant to that CEL expression. That needs to be stated here somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this comment is non-blocking for the KEP update, but we'll need to resolve what we do here for the impl... I'm ok with seeing if the runtime cost bounds protect us sufficiently before committing to run or not run CEL rules in that scenario)

`maxLength` on variable length data elements is generally considered good hygiene anyway, so having a cost
system that incentivizes developers to set this field, has other benefits.
liggitt marked this conversation as resolved.
Show resolved Hide resolved

We will use the minimum possible size of list elements to estimate the size (e.g. objects with many fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple edges to pay attention to:

  • don't consider optional fields when determining the minimum item size (the request can omit them)
  • don't consider required fields with a default specified when determining the minimum item size (the request can omit them)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thinking, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this doesn't need to be reflected in the KEP, but should be remembered for impl time)

will have much larger minimum sizes). This helps us pick smaller worst case list sizes that we might pick if we had to assume all lists contain 1 byte elements.

For example, the list element size `x` in this schema:

```
spec:
type: object
properties:
x:
x-kubernetes-validation:
- rule: "self.all(element, element.y.startsWith('prefix:'))"
type: array
items:
type: object
properties:
name: y
type: string
name: z
type: string
```

The worst case list size of `x` would be based on the size of an object with two strings (`y` and `z`).

We will also account for where in a schema CEL expression is placed when calculating its cost. For example:

```
spec:
type: object
properties:
x:
type: array
items:
type: object
properties:
name: y
type: string
x-kubernetes-validation:
- rule: "self.startsWith('prefix:')"
```

In this example the validation rule is contained within a list, so its expression cost would be `cost(startsWith) * worst_case_length(x)`. Note that this means some of the cost calculation will
happen outside of the CEL cost function (which would only calculate the `startsWith` cost).

During CRD validation, when an expression exceeds the cost limit, the error message will include both the limit and the cost of the expression to facilitate debugging. The cost calculation rules will be documented.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how confident are we in the stability of the cost calculations (the ones done in CEL and what we're planning to layer on top)? how will we ensure that we will not accept a CRD in one release and reject it in the future due to tweak in the cost heuristics?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, will publishing the limit encourage construction of expressions that squeak in just under the limit?

Copy link
Contributor

@jpbetz jpbetz Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do some things to boost our confidence. I would like to write a fuzzer that tries to find the most expensive CEL expression for a given "cost" limit. If we could get something like that working and show that we don't have any gapping discrepancies between the cost heuristic and reality, I think we could start to feel a bit more confident. That said, regex matching worries me.

re: expressions that squeak in just under the limit. Maybe? I'd like to see how developers respond to the ability to set maxLength, I suspect that the vast majority of cost problems will be most trivally solved by setting that to a value that is higher than they ever expect to actually hit, but still much lower than the max the cost heuristics would pick.


#### Time limits

We believe CEL cost will be sufficient to bound the resource utilization of the vast majority of CEL
expressions. As a backstop we will use timeouts. We will set these timeouts high enough that they are
not disruptive to all but the most extremely resource starved api-servers. We don't believe this will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dunno... I've seen some pretty resource-starved servers... if a server is starved and is context-switching between requests, is artificially failing the CEL evaluation at 100 ms or all expressions at 1 second helpful?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be slightly better if this were counting CPU time and not wall time, but off the top of my head I'm not sure how to do that easily.

cause problems in practice because there are other timeouts that will trigger under those conditions.

Per-request and per-expression execution time will be constrained via go cancellation and a context-passed timeout.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at minimum, we should definitely constrain it to the request lifetime using the request context. I'm less certain about constraining it to 100ms/1s targets without knowledge of the server power and load

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll be okay with just using the request context timeout so long as we can build enough confidence that the cost heuristic working reasonably well, which I think we need to do anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an idea on how CEL is going to implement cancellation? Does it check if the context is cancelled in every step of loop, a few steps, or something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that the time limit could make this inconsistent depending on how busy apiserver is. Does CEL offer a feature for counting up & limiting actual runtime cost? I'd really prefer a limit that was reproducible given the input object(s).

Copy link
Member

@liggitt liggitt Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that the time limit could make this inconsistent depending on how busy apiserver is.

agreed

Copy link
Contributor

@jpbetz jpbetz Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems doable to me. At compile time we can compute a cost heuristic with incomplete information. Running a cost counter at evaluation time seems no harder, and maybe quite a bit easier. I'd be interested in exploring this. If we can get it working, I'd be in favor of using it instead of wall clock time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would strongly prefer this to a time-based approach if it is doable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, hrmm… I wasn't thinking about calculating cost per request. The worst case pre-calculation at CRD edit time seems more palatable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you should compute cost-per-request, I think you should give CEL an expression budget and let it halt if the expression goes over its budget.

Copy link
Contributor

@jpbetz jpbetz Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A case that seems relevant to this discussion:

spec:
  type: object
  properties:
    x:
      type: array
      items:
        type: object
        properties:
          name: y
          type: int
          x-kubernetes-validation:
          - rule: "self < 100"

Here the rule is cheap (I'm pretty sure the cost is 1). But it's contained in a list which might be huge so it might be evaluated many times.

(edit: If we pre-compute this cost at CRD update time, we were planning to multiply the expression cost with the worst case length of the array. If we instead give the expression a budget, should the budget be for all evaluations of that expression that happen during a request?)

Copy link
Contributor

@jpbetz jpbetz Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current alternatives I'm seeing:

  1. Proactive: We pre-compute the worst case costs of each expression when validation rules are written to a CRD. Expressions that are O(n) or worse will usually need to set maxLengths to bring their worst case cost down into range
  2. Reactive: We count cost at CEL expression runtime, accumulating all the cost that an expression has consumed and exiting early if the cost limit is hit (because this is at runtime APF can be used to pick limits?)
  3. We do both Proactive and Reactive (but with different limits?)
  4. We do one of the above but also with a per-request limit (to handle the case where a CRD has many expensive CEL expressions)
  5. We do one of the above but limit the total number of expressions allowed in a CRD (to handle the same situation as 4)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should keep two budgets, overall and per expression.

When running an expression, its budget is the minimum of the remaining overall budget and of the per-expression budget.

After running an expression, you should decrement whatever it used from the remaining overall budget.

(Assuming you can do this with CEL)


If the per-expression timeout is exceeded, the error message will include how long the expression in question took to execute. If the per-request timeout is exceeded, the error message will instead include how long every expression took to execute up to and including the one where the timeout was exceeded.

We will continue to refine and improve this cost data. For example, optimizations to how CEL integrates with Kubernetes data may allow us to increase our cost bounds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the compatibility implications of changing permitted cost bounds over time? if we tighten these in the future, we'll break existing CRDs. if we relax them, new CRDs that use the expanded capacity won't work on older versions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my understanding.

This portability issue applies to pretty much anything we change about CEL.


Based on our current data we believe the limits should be:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a write request with both CEL expressions and webhooks, how many times will the CEL expressions be evaluated?

Copy link
Member

@liggitt liggitt Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's run with API server validation, so once per internal write attempt to etcd (if there's an etcd conflict that results in an internal retry, validateupdate(new,old) runs again)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't validate before sending to webhooks?

Copy link
Member

@liggitt liggitt Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mutating admission (including webhooks) run first, then validation, then validating admission (including webhooks), then we write to etcd. if there's an etcd conflict and the request can be retried (patch request or unconditional update with no resourceVersion precondition) the whole chain runs again

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks -- I cleared my entire cache while OOO :)


- per-expression cost: 120,000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, do we know the unit of this cost?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it maps roughly to how many calls to Interpretable.Eval are made. Of course, different Eval implementations will take up slightly different amounts of time (even on the same hardware), so I don't think a completely exact 1-to-1 mapping between cost units and time can be made.

- per-request cost: 12,000,000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should feed this information into APF to throttle requests accordingly; this will incentivize authors to be stingy but also not prevent them from doing something expensive if it's truly necessary. May be worth mentioning as follow-up work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I've amended the PR to include that APF will be taken into account for the finalized/refined bounds. I think we can definitely include APF as follow-up work.

- per-expression timeout limit: 100 milliseconds
- per-request timeout limit: 1 second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these be fixed or tunable by cluster admins? if tunable, should we define and exercise a minimum allowable expression and request cost in conformance to give CRD authors a portable target?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Portability I think is a big reason to keep the timeouts/cost limits fixed. Maybe in the future we could make these tunable if it turns out that sort of flexibility is something cluster admins need? It'll probably be easier to go from fixed > flexible than the other way around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to not make them configurable since:

  • My intuition is that if a cost limit is hit, we want to incentivize developers to set maxLengths, not bump limits
  • Portability is something we're already worried about and this would potentially make it a lot worse
  • If we ping the timeout with the request timeout (like discussed in the above comments) I don't think we have any other need to make that configurable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use time limits, I don't think they should be tunable, it will make a CRD behave differently on different clusters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike time spent on webhooks, time spent evaluating CEL expressions is likely to be very CPU intensive, so this sounds pretty high. This may require changes in e.g. APF configurations.

Copy link
Contributor

@jpbetz jpbetz Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do a runtime cost calculation in CEL like you suggested in https://github.com/kubernetes/enhancements/pull/3144/files#r798120466 and use that to bound CPU maybe this will be less bad?


This is based on our evaluation of cost. The limits allow O(n) with a loop body with a cost of 4 (enough for a basic regex). Since basic CEL expressions have a cost of 1, this should be plenty in most cases, but `maxLength` can be set on any lists where more computation needs to be done.


### Test Plan

Expand Down