Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stable hostname to Indexed job #2630

Merged

Conversation

alculquicondor
Copy link
Member

As part of beta graduation, to support tightly coupled parallel jobs.

As presented in kubernetes/kubernetes#99497 (comment)

/sig apps

This PR builds on top of #2616 (beta graduation update with no extra features).

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Apr 15, 2021
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 15, 2021
@alculquicondor
Copy link
Member Author

/assign @soltysh @wojtek-t

@alculquicondor
Copy link
Member Author

cc @ahg-g @johnbelamaric

@alculquicondor alculquicondor force-pushed the indexed-job-stable-name branch from 190f20b to 5c47995 Compare April 15, 2021 16:07
@alculquicondor alculquicondor force-pushed the indexed-job-stable-name branch from 5c47995 to e15a4bc Compare April 15, 2021 16:33
patterns: (1) fronting each index with a Service or (2) creating Pods with
stable hostnames based on their index.

The problem with using a Service per index is twofold:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using a headless service, those aren't really a problem right?

  1. you can have the pod IP directly from DNS (this is what headless does in DNS)
  2. IIRC we don't program kube-proxies for headless services

Yes with selector headless service create an endpoint, so there is some overhead, but doesn't seem to be that important argument.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it depends on what time of service the user chooses.
The second point still applies though: there is one service and one endpoint object for each index, instead of one for the job.

But one of the intentions of these paragraphs is that the fragmentation leads to inefficiencies due to poor choose of APIs.
Still, maybe I can remove the first point. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't trying to imply that having a "service per index" is a better option. I was just pointing that the arguments below aren't fully true:

  1. Yes - I would remove the first bullet point
  2. For the second bullet, headless services aren't programmed on all nodes IIRC, so there is some overhead in creating those and programming DNS etc, but not as just as you describe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

keps/sig-apps/2214-indexed-job/README.md Show resolved Hide resolved
We call this new Job pattern an *Indexed Job*, because each Pod of the Job
specializes to work on a particular index, as if the Pods where elements of an
array.
With the addition of a headless Service, Pods can address another Pod with a
specific index with a DNS lookup, because the index is part of the hostname.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait - we don't program DNS based on pods hostnames, but rather based on service name, right?
I think I'm not fully following this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is already built-in mechanism what are the expected changes to job controller?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just adding the hostname based on the index.


Just like for existing Job patterns, workloads have to handle duplicates at the
application level.
- Scalability and latency of DNS programming.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we doing to create (headless) services for jobs or not?
If not - I don't really understand this whole point...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We leave it to the user. But this is going to be the suggested pattern in the tutorial.

keps/sig-apps/2214-indexed-job/README.md Show resolved Hide resolved
In particular, neither ordering between pods nor gang scheduling are supported.
Here, parallel means multiple pods per Job. Jobs can be:
- Embarrassingly parallel, where the pods have no dependencies between each other.
- Tightly coupled, where the Pods communicate among themselves to make progress.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link to kubernetes/kubernetes#99497 since the PR description will get lost and there's more details in that issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

keps/sig-apps/2214-indexed-job/README.md Outdated Show resolved Hide resolved
We call this new Job pattern an *Indexed Job*, because each Pod of the Job
specializes to work on a particular index, as if the Pods where elements of an
array.
With the addition of a headless Service, Pods can address another Pod with a
specific index with a DNS lookup, because the index is part of the hostname.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is already built-in mechanism what are the expected changes to job controller?

replacement Pod.
<UNRESOLVED>
The recommendation for applications is to request a new DNS resolution until
the DNS server returns one IP.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we should point to the fact that a hostname may resolve to more than one IP, I am not sure about this recommendation because I don't think applications are built that way, selecting the IP is typically handled in the linux network stack.

@johnbelamaric wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OS network stack returns a set of addresses https://golang.org/pkg/net/#LookupHost

It is the network libraries which decide what to do. In the case of the golang net package, it tries all of them until it finds one that doesn't fail https://github.com/golang/go/blob/109d7580a5c507b1e4f460445a5c4cd7313e4aa6/src/net/dial.go#L524

I think this is enough to handle the case "two IPs returned, one for a failed pod and one for new pod".
For the other case (more than one pod created per index), I don't have a proper alternative. But, recall that the job controller will try to remove the pod that was created/started later first. So the net package's algorithm seems like a reasonable solution for this too.

In conclusion, the strategy of using the first IP that doesn't fail seems reasonable to me. And if someone needs stronger guarantees, they can write their own Dialer that fails if it sees more than one IP. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@alculquicondor alculquicondor Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I didn't touch on is DNS caches. I don't see any indication of the net packaging caching DNS resolutions. So as long as users don't setup caches external to the application, they should be fine. And we can add as a recommendation to not set DNS caches for indexed jobs.

Do you have any other recommendations on this regard? I imagine users of Services face similar problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable to me; I think the key here is proper documentation to set expectations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any other recommendations on this regard? I imagine users of Services face similar problems.

Same thing, proper documentation and stressing the point that pods are ephemeral and may change IPs when getting recreated, this could happen for StatefulSets as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of the golang net package,

You have zero guarantee that user applications are written in go (in fact they often aren't). Do we even how this is solved in other languages?
[Anyway - I think that there will always be some corners cases - I'm not convinced about this recommendation.]

Copy link
Member

@ahg-g ahg-g Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read Aldo's golang comment as just an example for the sake of this discussion, the general pattern in other languages is that there are two functions, one to resolve the hostname and obtain the addresses, and one that accepts one of those addresses to establish a connection (this is the same in C/C++ socket programming). We certainly shouldn't be relying on the semantics of any specific language. But yeah, as I mentioned above, I don't think we should make the recommendation mentioned in the text and simply stress that pods are ephemeral and may change IPs when getting recreated and so applications need to be tolerant to that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly - that was what I was trying to push us towards. Discuss what you mentioned above but don't provide a concrete recommendation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SIG Apps leads were asking for a specific recommendation.

I've changed the text to describe what happens when there are 2 pods per index. I added the explicit note that DNS caches shouldn't be used. I hope this is good enough.

Also note that I'm not sure of the circumstances that would lead to more than one pod per index. In a high level, this happens when the job controller misses a Pod creation event. But:

  • we keep an in-memory track of creations issued (which we call expectations) and we don't create or delete pods until those creations are observed.
  • let's say the controller restarts due to a crash or reboot. The new controller waits for an informer cache sync before processing jobs.

Perhaps the only scenario would be were an API request connection drops, but the apiserver successfully created an object. In this case, the controller would consider this as a failure and wouldn't add this creation to the expectations. Are there other scenarios?

patterns: (1) fronting each index with a Service or (2) creating Pods with
stable hostnames based on their index.

The problem with using a Service per index is twofold:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't trying to imply that having a "service per index" is a better option. I was just pointing that the arguments below aren't fully true:

  1. Yes - I would remove the first bullet point
  2. For the second bullet, headless services aren't programmed on all nodes IIRC, so there is some overhead in creating those and programming DNS etc, but not as just as you describe.

keps/sig-apps/2214-indexed-job/README.md Show resolved Hide resolved
- Scalability and latency of DNS programming.

DNS programming requires the update of EndpointSlices by the endpoint
controller and updating DNS records by the DNS provider.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit1: DNS is still not migrated to EndpointSlices IIRC
nit2: there is a separare controller for EndpointSlices:
endpoints -> endpoint controller
endpointslices -> endpoint slice controller

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this paragraph a bit more open ended.

- Handle more than one IP for the CNAME. This might happen temporarily when:
- the job controller creates more than one pod per index or
- the job controller creates a replacement of a failed Pod before the DNS
provider clears the record for the failed pod. This will be uncommon
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this happen?
When we created replacement, the previous pod should already be at least in not ready state, which means it won't have corresponding entry in EndpointSlice read addresses (the ones for which we publish DNS records). So given that the DNS records should be consistent (potentially stale, but consistent) it shouldn't happen, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the race here is that the new pod was observed by the endpoint controller before it observed the failed one. Is that possible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're processing incoming watch events in the order, so it shouldn't happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Well - the handlers are processed in the order - in theory the handler can be asynchronous and do something strange, but it's not the case here.]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This possibility was raised during the SIG apps meeting. But I agree that it shouldn't happen if the endpoint(slice) controllers are processing events in order. Removing.

replacement Pod.
<UNRESOLVED>
The recommendation for applications is to request a new DNS resolution until
the DNS server returns one IP.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of the golang net package,

You have zero guarantee that user applications are written in go (in fact they often aren't). Do we even how this is solved in other languages?
[Anyway - I think that there will always be some corners cases - I'm not convinced about this recommendation.]

the DNS server returns one IP.
</UNRESOLVED>

However, network programming is opt-in (users need to create a matching
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly hou mean by "network programming" here? kube-proxies aren't even watching for headless services. So what you meant is DNS, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whatever network programming that a headless service triggers, which is creating and populating the endpoint object + creating DNS records. Do we exclusively use the term "network programming" to refer to kube-proxy programming iptables?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we exclusively use the term "network programming" to refer to kube-proxy programming iptables?

I've seen different people understanding it differently, so being very explicit would help here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to DNS programming. But note that a user could always use a clusterIP service, thus needing kube-proxy programming too.

@@ -259,6 +353,15 @@ The Job controller doesn't add the environment variable if there is a name
conflict with an existing environment variable. Users can specify other
environment variables for the same annotation.

<<[UNRESOLVED this deviates from the rest of the controllers ]>>
The Pod name takes the form `$(job-name)-$(index)-$(random-string)`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to change pod name itself (given we're going to set hostname)?

[Also - it contains random suffix, so I don't see any place where it helps.
If people really want to fetch a pod for a given index, we should be doing this by label selector.]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is useful, especially when looking at logs, and has no downsides. So I am in favor of this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is just for the purpose of debugging. Added the notes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the plan with this unresolved item? Do we want to leave it as is or remove it? I'd prefer not to merge a KEP with an unresolved element.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think we can just resolve it.

I'm not a huge fan of it, but I don't think there is any strong enough reason for not doing this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan was to wait for feedback before merging. Since there doesn't seem to be opinions against it, I removed the unresolved tags.

keps/sig-apps/2214-indexed-job/README.md Outdated Show resolved Hide resolved
@@ -424,64 +528,60 @@ _This section must be completed when targeting beta graduation to a release._

* **What specific metrics should inform a rollback?**

- job_sync_duration_seconds shows significantly more latency for Indexed Jobs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a label representing job type? If so, can you make it explicit (also provide label name).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was already updated in the parent PR #2616 :)

Rebased.

- 99,9% of /health requests per day finish with 200 code

- per-day percentage of job_sync_total with label result=error <= 1%
- 99% percentile over day for job_sync_duration_seconds is:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly "job_sync_duration_seconds" is reporting?

Is it reporting the duration of a single "sync in a controller"? Or processing a job?
If the former, it should be fairly fast right? [I'm assuming we're creating pods themselves asynchronously?]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor alculquicondor force-pushed the indexed-job-stable-name branch from d855b1f to ad6308c Compare April 21, 2021 15:46
keps/sig-apps/2214-indexed-job/README.md Outdated Show resolved Hide resolved

Moreoever, Pods need to be prepared to:
- Retry lookups in the case were the records didn't have time to update.
- The IP for a CNAME to change in the case of a Pod failure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sentence doesn't read well and needs to start with a verb.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Moreoever, Pods need to be prepared to:
- Retry lookups in the case were the records didn't have time to update.
- The IP for a CNAME to change in the case of a Pod failure.
- Handle more than one IP for the CNAME. This might temporarily when the job
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Handle more than one IP for the CNAME. This might temporarily when the job
- Handle more than one IP for the CNAME. This might happen temporarily when the job

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@alculquicondor alculquicondor force-pushed the indexed-job-stable-name branch from ad6308c to 3bd6507 Compare April 26, 2021 14:22
Copy link
Member Author

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Squashed


Moreoever, Pods need to be prepared to:
- Retry lookups in the case were the records didn't have time to update.
- The IP for a CNAME to change in the case of a Pod failure.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Moreoever, Pods need to be prepared to:
- Retry lookups in the case were the records didn't have time to update.
- The IP for a CNAME to change in the case of a Pod failure.
- Handle more than one IP for the CNAME. This might temporarily when the job
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

keps/sig-apps/2214-indexed-job/README.md Outdated Show resolved Hide resolved
@ahg-g
Copy link
Member

ahg-g commented Apr 26, 2021

latest changes looks good to me.

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
There are some minor comments which we can solve and get this merged by eow.

Creating Pods with stable hostnames mitigates this problem. The control plane
requires only one Service and one Endpoint (or a few EndpointSlices) to inform
the DNS programming. Pods can address each other with a DNS lookup and
communicate directly using Pod IPs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this argument, it wasn't here last time I read.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was not as clear before :)

@@ -259,6 +353,15 @@ The Job controller doesn't add the environment variable if there is a name
conflict with an existing environment variable. Users can specify other
environment variables for the same annotation.

<<[UNRESOLVED this deviates from the rest of the controllers ]>>
The Pod name takes the form `$(job-name)-$(index)-$(random-string)`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the plan with this unresolved item? Do we want to leave it as is or remove it? I'd prefer not to merge a KEP with an unresolved element.

@@ -15,12 +15,12 @@ approvers:
- "@kow3ns"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add me now here ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:D

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase

keps/sig-apps/2214-indexed-job/README.md Outdated Show resolved Hide resolved
@@ -259,6 +353,15 @@ The Job controller doesn't add the environment variable if there is a name
conflict with an existing environment variable. Users can specify other
environment variables for the same annotation.

<<[UNRESOLVED this deviates from the rest of the controllers ]>>
The Pod name takes the form `$(job-name)-$(index)-$(random-string)`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think we can just resolve it.

I'm not a huge fan of it, but I don't think there is any strong enough reason for not doing this.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 27, 2021
as part of Beta graduation. Also adds the index as part of the host name, to ease debugging.

Signed-off-by: Aldo Culquicondor <acondor@google.com>
@alculquicondor alculquicondor force-pushed the indexed-job-stable-name branch from 3bd6507 to b851fe4 Compare April 27, 2021 17:28
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 27, 2021
Copy link
Member Author

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also changed the metric's labels from mode to completion_mode as suggested in the parent PR that was already merged.

keps/sig-apps/2214-indexed-job/README.md Outdated Show resolved Hide resolved
Creating Pods with stable hostnames mitigates this problem. The control plane
requires only one Service and one Endpoint (or a few EndpointSlices) to inform
the DNS programming. Pods can address each other with a DNS lookup and
communicate directly using Pod IPs.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was not as clear before :)

@@ -15,12 +15,12 @@ approvers:
- "@kow3ns"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:D

@ahg-g
Copy link
Member

ahg-g commented Apr 27, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Apr 27, 2021
@k8s-ci-robot k8s-ci-robot merged commit 0406326 into kubernetes:master Apr 27, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.22 milestone Apr 27, 2021
@wojtek-t
Copy link
Member

LGTM (for posterity) [for both the KEP itself and the PRR]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants