Add stable hostname to Indexed job #2630

alculquicondor · 2021-04-15T15:41:32Z

As part of beta graduation, to support tightly coupled parallel jobs.

As presented in kubernetes/kubernetes#99497 (comment)

/sig apps

This PR builds on top of #2616 (beta graduation update with no extra features).

alculquicondor · 2021-04-15T15:42:09Z

/assign @soltysh @wojtek-t

alculquicondor · 2021-04-15T15:42:33Z

cc @ahg-g @johnbelamaric

keps/sig-apps/2214-indexed-job/README.md

wojtek-t · 2021-04-16T14:12:01Z

keps/sig-apps/2214-indexed-job/README.md

+patterns: (1) fronting each index with a Service or (2) creating Pods with
+stable hostnames based on their index.
+
+The problem with using a Service per index is twofold:


If using a headless service, those aren't really a problem right?

you can have the pod IP directly from DNS (this is what headless does in DNS)

IIRC we don't program kube-proxies for headless services

Yes with selector headless service create an endpoint, so there is some overhead, but doesn't seem to be that important argument.

Right, it depends on what time of service the user chooses.
The second point still applies though: there is one service and one endpoint object for each index, instead of one for the job.

But one of the intentions of these paragraphs is that the fragmentation leads to inefficiencies due to poor choose of APIs.
Still, maybe I can remove the first point. WDYT?

I wasn't trying to imply that having a "service per index" is a better option. I was just pointing that the arguments below aren't fully true:

Yes - I would remove the first bullet point

For the second bullet, headless services aren't programmed on all nodes IIRC, so there is some overhead in creating those and programming DNS etc, but not as just as you describe.

keps/sig-apps/2214-indexed-job/README.md

wojtek-t · 2021-04-16T14:20:43Z

keps/sig-apps/2214-indexed-job/README.md

 We call this new Job pattern an *Indexed Job*, because each Pod of the Job
 specializes to work on a particular index, as if the Pods where elements of an
 array.
+With the addition of a headless Service, Pods can address another Pod with a
+specific index with a DNS lookup, because the index is part of the hostname.


Wait - we don't program DNS based on pods hostnames, but rather based on service name, right?
I think I'm not fully following this.

We do both https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-hostname-and-subdomain-fields :)

If this is already built-in mechanism what are the expected changes to job controller?

just adding the hostname based on the index.

wojtek-t · 2021-04-16T14:21:54Z

keps/sig-apps/2214-indexed-job/README.md


-Just like for existing Job patterns, workloads have to handle duplicates at the
-application level.
+- Scalability and latency of DNS programming.


Are we doing to create (headless) services for jobs or not?
If not - I don't really understand this whole point...

We leave it to the user. But this is going to be the suggested pattern in the tutorial.

keps/sig-apps/2214-indexed-job/README.md

soltysh · 2021-04-19T13:38:17Z

keps/sig-apps/2214-indexed-job/README.md

-In particular, neither ordering between pods nor gang scheduling are supported.
+Here, parallel means multiple pods per Job. Jobs can be:
+- Embarrassingly parallel, where the pods have no dependencies between each other.
+- Tightly coupled, where the Pods communicate among themselves to make progress.


Can you link to kubernetes/kubernetes#99497 since the PR description will get lost and there's more details in that issue.

keps/sig-apps/2214-indexed-job/README.md

soltysh · 2021-04-19T13:42:10Z

keps/sig-apps/2214-indexed-job/README.md

 We call this new Job pattern an *Indexed Job*, because each Pod of the Job
 specializes to work on a particular index, as if the Pods where elements of an
 array.
+With the addition of a headless Service, Pods can address another Pod with a
+specific index with a DNS lookup, because the index is part of the hostname.


If this is already built-in mechanism what are the expected changes to job controller?

ahg-g · 2021-04-19T18:52:30Z

keps/sig-apps/2214-indexed-job/README.md

+      replacement Pod.
+    <UNRESOLVED>
+    The recommendation for applications is to request a new DNS resolution until
+    the DNS server returns one IP.


While we should point to the fact that a hostname may resolve to more than one IP, I am not sure about this recommendation because I don't think applications are built that way, selecting the IP is typically handled in the linux network stack.

@johnbelamaric wdyt?

The OS network stack returns a set of addresses https://golang.org/pkg/net/#LookupHost

It is the network libraries which decide what to do. In the case of the golang net package, it tries all of them until it finds one that doesn't fail https://github.com/golang/go/blob/109d7580a5c507b1e4f460445a5c4cd7313e4aa6/src/net/dial.go#L524

I think this is enough to handle the case "two IPs returned, one for a failed pod and one for new pod".
For the other case (more than one pod created per index), I don't have a proper alternative. But, recall that the job controller will try to remove the pod that was created/started later first. So the net package's algorithm seems like a reasonable solution for this too.

In conclusion, the strategy of using the first IP that doesn't fail seems reasonable to me. And if someone needs stronger guarantees, they can write their own Dialer that fails if it sees more than one IP. WDYT?

cc @kow3ns @soltysh

Something I didn't touch on is DNS caches. I don't see any indication of the net packaging caching DNS resolutions. So as long as users don't setup caches external to the application, they should be fine. And we can add as a recommendation to not set DNS caches for indexed jobs.

Do you have any other recommendations on this regard? I imagine users of Services face similar problems.

That sounds reasonable to me; I think the key here is proper documentation to set expectations.

Do you have any other recommendations on this regard? I imagine users of Services face similar problems.

Same thing, proper documentation and stressing the point that pods are ephemeral and may change IPs when getting recreated, this could happen for StatefulSets as well.

In the case of the golang net package,

You have zero guarantee that user applications are written in go (in fact they often aren't). Do we even how this is solved in other languages?
[Anyway - I think that there will always be some corners cases - I'm not convinced about this recommendation.]

I read Aldo's golang comment as just an example for the sake of this discussion, the general pattern in other languages is that there are two functions, one to resolve the hostname and obtain the addresses, and one that accepts one of those addresses to establish a connection (this is the same in C/C++ socket programming). We certainly shouldn't be relying on the semantics of any specific language. But yeah, as I mentioned above, I don't think we should make the recommendation mentioned in the text and simply stress that pods are ephemeral and may change IPs when getting recreated and so applications need to be tolerant to that.

Exactly - that was what I was trying to push us towards. Discuss what you mentioned above but don't provide a concrete recommendation.

The SIG Apps leads were asking for a specific recommendation.

I've changed the text to describe what happens when there are 2 pods per index. I added the explicit note that DNS caches shouldn't be used. I hope this is good enough.

Also note that I'm not sure of the circumstances that would lead to more than one pod per index. In a high level, this happens when the job controller misses a Pod creation event. But:

we keep an in-memory track of creations issued (which we call expectations) and we don't create or delete pods until those creations are observed.

let's say the controller restarts due to a crash or reboot. The new controller waits for an informer cache sync before processing jobs.

Perhaps the only scenario would be were an API request connection drops, but the apiserver successfully created an object. In this case, the controller would consider this as a failure and wouldn't add this creation to the expectations. Are there other scenarios?

wojtek-t · 2021-04-21T09:42:38Z

keps/sig-apps/2214-indexed-job/README.md

+patterns: (1) fronting each index with a Service or (2) creating Pods with
+stable hostnames based on their index.
+
+The problem with using a Service per index is twofold:


I wasn't trying to imply that having a "service per index" is a better option. I was just pointing that the arguments below aren't fully true:

Yes - I would remove the first bullet point

For the second bullet, headless services aren't programmed on all nodes IIRC, so there is some overhead in creating those and programming DNS etc, but not as just as you describe.

keps/sig-apps/2214-indexed-job/README.md

wojtek-t · 2021-04-21T09:52:41Z

keps/sig-apps/2214-indexed-job/README.md

+- Scalability and latency of DNS programming.
+
+  DNS programming requires the update of EndpointSlices by the endpoint
+  controller and updating DNS records by the DNS provider.


nit1: DNS is still not migrated to EndpointSlices IIRC
nit2: there is a separare controller for EndpointSlices:
endpoints -> endpoint controller
endpointslices -> endpoint slice controller

I left this paragraph a bit more open ended.

wojtek-t · 2021-04-21T09:55:10Z

keps/sig-apps/2214-indexed-job/README.md

+  - Handle more than one IP for the CNAME. This might happen temporarily when:
+    - the job controller creates more than one pod per index or
+    - the job controller creates a replacement of a failed Pod before the DNS
+      provider clears the record for the failed pod. This will be uncommon


Can this happen?
When we created replacement, the previous pod should already be at least in not ready state, which means it won't have corresponding entry in EndpointSlice read addresses (the ones for which we publish DNS records). So given that the DNS records should be consistent (potentially stale, but consistent) it shouldn't happen, right?

I think the race here is that the new pod was observed by the endpoint controller before it observed the failed one. Is that possible?

We're processing incoming watch events in the order, so it shouldn't happen.

[Well - the handlers are processed in the order - in theory the handler can be asynchronous and do something strange, but it's not the case here.]

This possibility was raised during the SIG apps meeting. But I agree that it shouldn't happen if the endpoint(slice) controllers are processing events in order. Removing.

wojtek-t · 2021-04-21T09:58:33Z

keps/sig-apps/2214-indexed-job/README.md

+      replacement Pod.
+    <UNRESOLVED>
+    The recommendation for applications is to request a new DNS resolution until
+    the DNS server returns one IP.


In the case of the golang net package,

You have zero guarantee that user applications are written in go (in fact they often aren't). Do we even how this is solved in other languages?
[Anyway - I think that there will always be some corners cases - I'm not convinced about this recommendation.]

wojtek-t · 2021-04-21T09:59:55Z

keps/sig-apps/2214-indexed-job/README.md

+    the DNS server returns one IP.
+    </UNRESOLVED>
+
+  However, network programming is opt-in (users need to create a matching


What exactly hou mean by "network programming" here? kube-proxies aren't even watching for headless services. So what you meant is DNS, right?

whatever network programming that a headless service triggers, which is creating and populating the endpoint object + creating DNS records. Do we exclusively use the term "network programming" to refer to kube-proxy programming iptables?

Do we exclusively use the term "network programming" to refer to kube-proxy programming iptables?

I've seen different people understanding it differently, so being very explicit would help here.

changed to DNS programming. But note that a user could always use a clusterIP service, thus needing kube-proxy programming too.

wojtek-t · 2021-04-21T10:01:47Z

keps/sig-apps/2214-indexed-job/README.md

@@ -259,6 +353,15 @@ The Job controller doesn't add the environment variable if there is a name
 conflict with an existing environment variable. Users can specify other
 environment variables for the same annotation.

+<<[UNRESOLVED this deviates from the rest of the controllers ]>>
+The Pod name takes the form `$(job-name)-$(index)-$(random-string)`,


Why do we need to change pod name itself (given we're going to set hostname)?

[Also - it contains random suffix, so I don't see any place where it helps.
If people really want to fetch a pod for a given index, we should be doing this by label selector.]

I think this is useful, especially when looking at logs, and has no downsides. So I am in favor of this change.

Yes, this is just for the purpose of debugging. Added the notes.

What is the plan with this unresolved item? Do we want to leave it as is or remove it? I'd prefer not to merge a KEP with an unresolved element.

I personally think we can just resolve it.

I'm not a huge fan of it, but I don't think there is any strong enough reason for not doing this.

The plan was to wait for feedback before merging. Since there doesn't seem to be opinions against it, I removed the unresolved tags.

keps/sig-apps/2214-indexed-job/README.md

wojtek-t · 2021-04-21T10:04:13Z

keps/sig-apps/2214-indexed-job/README.md

@@ -424,64 +528,60 @@ _This section must be completed when targeting beta graduation to a release._

 * **What specific metrics should inform a rollback?**

+  - job_sync_duration_seconds shows significantly more latency for Indexed Jobs.


Do we have a label representing job type? If so, can you make it explicit (also provide label name).

this was already updated in the parent PR #2616 :)

Rebased.

wojtek-t · 2021-04-21T10:07:03Z

keps/sig-apps/2214-indexed-job/README.md

-  - 99,9% of /health requests per day finish with 200 code
+
+  - per-day percentage of job_sync_total with label result=error <= 1%
+  - 99% percentile over day for job_sync_duration_seconds is:


What exactly "job_sync_duration_seconds" is reporting?

Is it reporting the duration of a single "sync in a controller"? Or processing a job?
If the former, it should be fairly fast right? [I'm assuming we're creating pods themselves asynchronously?]

let's discuss this in https://github.com/kubernetes/enhancements/pull/2616/files#r615840979

keps/sig-apps/2214-indexed-job/README.md

ahg-g · 2021-04-26T01:29:20Z

keps/sig-apps/2214-indexed-job/README.md

+
+  Moreoever, Pods need to be prepared to:
+  - Retry lookups in the case were the records didn't have time to update.
+  - The IP for a CNAME to change in the case of a Pod failure.


this sentence doesn't read well and needs to start with a verb.

ahg-g · 2021-04-26T01:30:01Z

keps/sig-apps/2214-indexed-job/README.md

+  Moreoever, Pods need to be prepared to:
+  - Retry lookups in the case were the records didn't have time to update.
+  - The IP for a CNAME to change in the case of a Pod failure.
+  - Handle more than one IP for the CNAME. This might temporarily when the job


Suggested change

- Handle more than one IP for the CNAME. This might temporarily when the job

- Handle more than one IP for the CNAME. This might happen temporarily when the job

alculquicondor

Squashed

alculquicondor · 2021-04-26T14:22:52Z

keps/sig-apps/2214-indexed-job/README.md

+
+  Moreoever, Pods need to be prepared to:
+  - Retry lookups in the case were the records didn't have time to update.
+  - The IP for a CNAME to change in the case of a Pod failure.


alculquicondor · 2021-04-26T14:22:56Z

keps/sig-apps/2214-indexed-job/README.md

+  Moreoever, Pods need to be prepared to:
+  - Retry lookups in the case were the records didn't have time to update.
+  - The IP for a CNAME to change in the case of a Pod failure.
+  - Handle more than one IP for the CNAME. This might temporarily when the job


keps/sig-apps/2214-indexed-job/README.md

ahg-g · 2021-04-26T19:22:49Z

latest changes looks good to me.

soltysh

/approve
There are some minor comments which we can solve and get this merged by eow.

soltysh · 2021-04-27T11:18:00Z

keps/sig-apps/2214-indexed-job/README.md

+Creating Pods with stable hostnames mitigates this problem. The control plane
+requires only one Service and one Endpoint (or a few EndpointSlices) to inform
+the DNS programming. Pods can address each other with a DNS lookup and
+communicate directly using Pod IPs.


I really like this argument, it wasn't here last time I read.

it was not as clear before :)

soltysh · 2021-04-27T11:20:38Z

keps/sig-apps/2214-indexed-job/README.md

@@ -259,6 +353,15 @@ The Job controller doesn't add the environment variable if there is a name
 conflict with an existing environment variable. Users can specify other
 environment variables for the same annotation.

+<<[UNRESOLVED this deviates from the rest of the controllers ]>>
+The Pod name takes the form `$(job-name)-$(index)-$(random-string)`,


What is the plan with this unresolved item? Do we want to leave it as is or remove it? I'd prefer not to merge a KEP with an unresolved element.

soltysh · 2021-04-27T11:21:39Z

keps/sig-apps/2214-indexed-job/kep.yaml

@@ -15,12 +15,12 @@ approvers:
  - "@kow3ns"


You can add me now here ;)

wojtek-t

Please rebase

keps/sig-apps/2214-indexed-job/README.md

wojtek-t · 2021-04-27T11:59:34Z

keps/sig-apps/2214-indexed-job/README.md

@@ -259,6 +353,15 @@ The Job controller doesn't add the environment variable if there is a name
 conflict with an existing environment variable. Users can specify other
 environment variables for the same annotation.

+<<[UNRESOLVED this deviates from the rest of the controllers ]>>
+The Pod name takes the form `$(job-name)-$(index)-$(random-string)`,


I personally think we can just resolve it.

I'm not a huge fan of it, but I don't think there is any strong enough reason for not doing this.

as part of Beta graduation. Also adds the index as part of the host name, to ease debugging. Signed-off-by: Aldo Culquicondor <acondor@google.com>

k8s-ci-robot · 2021-04-27T17:28:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-apps/OWNERS~~ [soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alculquicondor

I also changed the metric's labels from mode to completion_mode as suggested in the parent PR that was already merged.

keps/sig-apps/2214-indexed-job/README.md

alculquicondor · 2021-04-27T17:15:33Z

keps/sig-apps/2214-indexed-job/README.md

+Creating Pods with stable hostnames mitigates this problem. The control plane
+requires only one Service and one Endpoint (or a few EndpointSlices) to inform
+the DNS programming. Pods can address each other with a DNS lookup and
+communicate directly using Pod IPs.


it was not as clear before :)

alculquicondor · 2021-04-27T17:20:31Z

keps/sig-apps/2214-indexed-job/kep.yaml

@@ -15,12 +15,12 @@ approvers:
  - "@kow3ns"


ahg-g · 2021-04-27T17:29:48Z

/lgtm

wojtek-t · 2021-04-28T10:32:05Z

LGTM (for posterity) [for both the KEP itself and the PRR]

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Apr 15, 2021

k8s-ci-robot requested review from mattfarina and soltysh April 15, 2021 15:41

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 15, 2021

k8s-ci-robot assigned soltysh and wojtek-t Apr 15, 2021

alculquicondor force-pushed the indexed-job-stable-name branch from 190f20b to 5c47995 Compare April 15, 2021 16:07

ahg-g reviewed Apr 15, 2021

View reviewed changes

keps/sig-apps/2214-indexed-job/README.md Show resolved Hide resolved

alculquicondor force-pushed the indexed-job-stable-name branch from 5c47995 to e15a4bc Compare April 15, 2021 16:33

wojtek-t reviewed Apr 16, 2021

View reviewed changes

soltysh reviewed Apr 19, 2021

View reviewed changes

ahg-g reviewed Apr 19, 2021

View reviewed changes

wojtek-t reviewed Apr 21, 2021

View reviewed changes

alculquicondor force-pushed the indexed-job-stable-name branch from d855b1f to ad6308c Compare April 21, 2021 15:46

ahg-g reviewed Apr 26, 2021

View reviewed changes

alculquicondor force-pushed the indexed-job-stable-name branch from ad6308c to 3bd6507 Compare April 26, 2021 14:22

alculquicondor commented Apr 26, 2021

View reviewed changes

soltysh approved these changes Apr 27, 2021

View reviewed changes

wojtek-t reviewed Apr 27, 2021

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 27, 2021

Add stable hostnames to Indexed Job

b851fe4

as part of Beta graduation. Also adds the index as part of the host name, to ease debugging. Signed-off-by: Aldo Culquicondor <acondor@google.com>

alculquicondor force-pushed the indexed-job-stable-name branch from 3bd6507 to b851fe4 Compare April 27, 2021 17:28

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 27, 2021

alculquicondor commented Apr 27, 2021

View reviewed changes

k8s-ci-robot assigned ahg-g Apr 27, 2021

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Apr 27, 2021

k8s-ci-robot merged commit 0406326 into kubernetes:master Apr 27, 2021

k8s-ci-robot added this to the v1.22 milestone Apr 27, 2021

AliceZhang2016 mentioned this pull request Apr 29, 2021

specify pod name and hostname in indexed job kubernetes/kubernetes#101601

Merged

alculquicondor mentioned this pull request Jul 13, 2021

Simplifying inter-pod-communication between Job Pods kubernetes/kubernetes#99497

Closed

		@@ -424,64 +528,60 @@ _This section must be completed when targeting beta graduation to a release._

		* What specific metrics should inform a rollback?

		- job_sync_duration_seconds shows significantly more latency for Indexed Jobs.

	- Handle more than one IP for the CNAME. This might temporarily when the job
	- Handle more than one IP for the CNAME. This might happen temporarily when the job

Add stable hostname to Indexed job #2630

Add stable hostname to Indexed job #2630

Conversation

alculquicondor commented Apr 15, 2021

alculquicondor commented Apr 15, 2021

alculquicondor commented Apr 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor Apr 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g Apr 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g commented Apr 26, 2021

soltysh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Apr 27, 2021

alculquicondor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g commented Apr 27, 2021

wojtek-t commented Apr 28, 2021

alculquicondor Apr 20, 2021 •

edited

Loading

ahg-g Apr 21, 2021 •

edited

Loading