update DNS programming latency SLI #7756

aojea · 2024-03-13T14:20:50Z

Update the SLI to reflect the DNS latency expectactions for headless services, that have a high impact on on AI/ML workloads that make a have use of headless services and DNS kubeflow/mpi-operator#611 (comment)

aojea · 2024-03-13T14:21:12Z

/assign @wojtek-t @thockin @jkaniuk

thockin · 2024-03-13T21:23:27Z

sig-scalability/slos/dns_programming_latency.md

@@ -37,6 +39,10 @@ The reason for doing it this way is feasibility for efficiently computing that:
    in 99% of programmers (e.g. iptables). That requires tracking metrics on
    per-change base (which we can't do efficiently).

+- The SLI is expected to remain constant independently of the number of records, per


What's the implication of this? Scheduler has a finite throughput. Nodes have finite bandwidth.

If I start a 5 pod headless-service, it's reasonable to expect that DNS for the 5th pod is very very soon after DNS for the 1st:
t0: scale RS to 5
t1: RS controller creates pod 1
t2: scheduler schedules pod 1
t3: kubelet downloads image
t4: kubelet runs pod
t5: runtime assigns an IP
t6: kubelet reports the IP
t7: endpointslice controller observes IP and updates EPSlices
t8: DNS observes EPSlices and updates DNS

t1 - t8 happen 5 times, roughly concurrently, and is likely bounded by image download time.

Change that to 5000 and now you are bounded by the scheduler's throughput. Is DNS not allowed to publish the 1st IP until the last pod is started? How does it know which one is last?

Edit:

Or did you mean something like "The time between pod-started-and-IP-assigned and availble-in-DNS should not be significantly different for the 1st vs. last pod" ? That must be what you meant...

"The time between pod-started-and-IP-assigned and availble-in-DNS should not be significantly different for the 1st vs. last pod" ? That must be what you meant...

this, if you give me the right sentence in english so this is more clear please add it as a suggestion

sig-scalability/slos/dns_programming_latency.md

Co-authored-by: Tim Hockin <thockin@google.com>

wojtek-t

LGTM modulo the nit that is failing presubmit

wojtek-t · 2024-03-14T16:23:43Z

sig-scalability/slos/dns_programming_latency.md

@@ -37,6 +39,12 @@ The reason for doing it this way is feasibility for efficiently computing that:
    in 99% of programmers (e.g. iptables). That requires tracking metrics on
    per-change base (which we can't do efficiently).

+- The SLI for DNS publishing should remain constant independent of the number of records.
+For example, in a headless service with thousands of pods the time between the pod being
+assigned an IP and the time DNS makes that IP availabe in the service's A/AAAA record(s)


nit: available

[but it's failing presubmit]

sig-scalability/slos/dns_programming_latency.md

thockin

/lgtm

thockin · 2024-03-14T18:15:40Z

/approve

wojtek-t · 2024-03-14T20:19:01Z

/lgtm
/approve

k8s-ci-robot · 2024-03-14T20:19:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, thockin, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~sig-scalability/slos/OWNERS~~ [wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

update DNS programming latency SLI

434c37c

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 13, 2024

k8s-ci-robot requested review from marseel and wojtek-t March 13, 2024 14:21

k8s-ci-robot added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Mar 13, 2024

k8s-ci-robot assigned jkaniuk, thockin and wojtek-t Mar 13, 2024

thockin reviewed Mar 13, 2024

View reviewed changes

sig-scalability/slos/dns_programming_latency.md Outdated Show resolved Hide resolved

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 13, 2024

aojea force-pushed the dns_sli branch from af880aa to c5965c0 Compare March 14, 2024 06:46

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 14, 2024

Update sig-scalability/slos/dns_programming_latency.md

f9b6920

Co-authored-by: Tim Hockin <thockin@google.com>

aojea force-pushed the dns_sli branch from c5965c0 to f9b6920 Compare March 14, 2024 06:48

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 14, 2024

wojtek-t reviewed Mar 14, 2024

View reviewed changes

aojea commented Mar 14, 2024

View reviewed changes

sig-scalability/slos/dns_programming_latency.md Outdated Show resolved Hide resolved

Update sig-scalability/slos/dns_programming_latency.md

9e43c29

thockin reviewed Mar 14, 2024

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 14, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 14, 2024

k8s-ci-robot merged commit e9f78a3 into kubernetes:master Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

update DNS programming latency SLI #7756

update DNS programming latency SLI #7756

Uh oh!

aojea commented Mar 13, 2024

Uh oh!

aojea commented Mar 13, 2024

Uh oh!

thockin Mar 13, 2024 •

edited

Loading

Uh oh!

aojea Mar 13, 2024

Uh oh!

Uh oh!

wojtek-t left a comment

Uh oh!

wojtek-t Mar 14, 2024

Uh oh!

Uh oh!

thockin left a comment

Uh oh!

thockin commented Mar 14, 2024

Uh oh!

wojtek-t commented Mar 14, 2024

Uh oh!

k8s-ci-robot commented Mar 14, 2024

Uh oh!

Uh oh!

update DNS programming latency SLI #7756

update DNS programming latency SLI #7756

Uh oh!

Conversation

aojea commented Mar 13, 2024

Uh oh!

aojea commented Mar 13, 2024

Uh oh!

thockin Mar 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aojea Mar 13, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wojtek-t left a comment

Choose a reason for hiding this comment

Uh oh!

wojtek-t Mar 14, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thockin left a comment

Choose a reason for hiding this comment

Uh oh!

thockin commented Mar 14, 2024

Uh oh!

wojtek-t commented Mar 14, 2024

Uh oh!

k8s-ci-robot commented Mar 14, 2024

Uh oh!

Uh oh!

thockin Mar 13, 2024 •

edited

Loading