-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add stable hostname to Indexed job #2630
Add stable hostname to Indexed job #2630
Conversation
190f20b
to
5c47995
Compare
5c47995
to
e15a4bc
Compare
patterns: (1) fronting each index with a Service or (2) creating Pods with | ||
stable hostnames based on their index. | ||
|
||
The problem with using a Service per index is twofold: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If using a headless service, those aren't really a problem right?
- you can have the pod IP directly from DNS (this is what headless does in DNS)
- IIRC we don't program kube-proxies for headless services
Yes with selector headless service create an endpoint, so there is some overhead, but doesn't seem to be that important argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it depends on what time of service the user chooses.
The second point still applies though: there is one service and one endpoint object for each index, instead of one for the job.
But one of the intentions of these paragraphs is that the fragmentation leads to inefficiencies due to poor choose of APIs.
Still, maybe I can remove the first point. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't trying to imply that having a "service per index" is a better option. I was just pointing that the arguments below aren't fully true:
- Yes - I would remove the first bullet point
- For the second bullet, headless services aren't programmed on all nodes IIRC, so there is some overhead in creating those and programming DNS etc, but not as just as you describe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
We call this new Job pattern an *Indexed Job*, because each Pod of the Job | ||
specializes to work on a particular index, as if the Pods where elements of an | ||
array. | ||
With the addition of a headless Service, Pods can address another Pod with a | ||
specific index with a DNS lookup, because the index is part of the hostname. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait - we don't program DNS based on pods hostnames, but rather based on service name, right?
I think I'm not fully following this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is already built-in mechanism what are the expected changes to job controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just adding the hostname based on the index.
|
||
Just like for existing Job patterns, workloads have to handle duplicates at the | ||
application level. | ||
- Scalability and latency of DNS programming. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we doing to create (headless) services for jobs or not?
If not - I don't really understand this whole point...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We leave it to the user. But this is going to be the suggested pattern in the tutorial.
In particular, neither ordering between pods nor gang scheduling are supported. | ||
Here, parallel means multiple pods per Job. Jobs can be: | ||
- Embarrassingly parallel, where the pods have no dependencies between each other. | ||
- Tightly coupled, where the Pods communicate among themselves to make progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you link to kubernetes/kubernetes#99497 since the PR description will get lost and there's more details in that issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
We call this new Job pattern an *Indexed Job*, because each Pod of the Job | ||
specializes to work on a particular index, as if the Pods where elements of an | ||
array. | ||
With the addition of a headless Service, Pods can address another Pod with a | ||
specific index with a DNS lookup, because the index is part of the hostname. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is already built-in mechanism what are the expected changes to job controller?
replacement Pod. | ||
<UNRESOLVED> | ||
The recommendation for applications is to request a new DNS resolution until | ||
the DNS server returns one IP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we should point to the fact that a hostname may resolve to more than one IP, I am not sure about this recommendation because I don't think applications are built that way, selecting the IP is typically handled in the linux network stack.
@johnbelamaric wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OS network stack returns a set of addresses https://golang.org/pkg/net/#LookupHost
It is the network libraries which decide what to do. In the case of the golang net
package, it tries all of them until it finds one that doesn't fail https://github.com/golang/go/blob/109d7580a5c507b1e4f460445a5c4cd7313e4aa6/src/net/dial.go#L524
I think this is enough to handle the case "two IPs returned, one for a failed pod and one for new pod".
For the other case (more than one pod created per index), I don't have a proper alternative. But, recall that the job controller will try to remove the pod that was created/started later first. So the net package's algorithm seems like a reasonable solution for this too.
In conclusion, the strategy of using the first IP that doesn't fail seems reasonable to me. And if someone needs stronger guarantees, they can write their own Dialer that fails if it sees more than one IP. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I didn't touch on is DNS caches. I don't see any indication of the net
packaging caching DNS resolutions. So as long as users don't setup caches external to the application, they should be fine. And we can add as a recommendation to not set DNS caches for indexed jobs.
Do you have any other recommendations on this regard? I imagine users of Services face similar problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable to me; I think the key here is proper documentation to set expectations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any other recommendations on this regard? I imagine users of Services face similar problems.
Same thing, proper documentation and stressing the point that pods are ephemeral and may change IPs when getting recreated, this could happen for StatefulSets as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of the golang net package,
You have zero guarantee that user applications are written in go (in fact they often aren't). Do we even how this is solved in other languages?
[Anyway - I think that there will always be some corners cases - I'm not convinced about this recommendation.]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read Aldo's golang comment as just an example for the sake of this discussion, the general pattern in other languages is that there are two functions, one to resolve the hostname and obtain the addresses, and one that accepts one of those addresses to establish a connection (this is the same in C/C++ socket programming). We certainly shouldn't be relying on the semantics of any specific language. But yeah, as I mentioned above, I don't think we should make the recommendation mentioned in the text and simply stress that pods are ephemeral and may change IPs when getting recreated and so applications need to be tolerant to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly - that was what I was trying to push us towards. Discuss what you mentioned above but don't provide a concrete recommendation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SIG Apps leads were asking for a specific recommendation.
I've changed the text to describe what happens when there are 2 pods per index. I added the explicit note that DNS caches shouldn't be used. I hope this is good enough.
Also note that I'm not sure of the circumstances that would lead to more than one pod per index. In a high level, this happens when the job controller misses a Pod creation event. But:
- we keep an in-memory track of creations issued (which we call expectations) and we don't create or delete pods until those creations are observed.
- let's say the controller restarts due to a crash or reboot. The new controller waits for an informer cache sync before processing jobs.
Perhaps the only scenario would be were an API request connection drops, but the apiserver successfully created an object. In this case, the controller would consider this as a failure and wouldn't add this creation to the expectations. Are there other scenarios?
patterns: (1) fronting each index with a Service or (2) creating Pods with | ||
stable hostnames based on their index. | ||
|
||
The problem with using a Service per index is twofold: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't trying to imply that having a "service per index" is a better option. I was just pointing that the arguments below aren't fully true:
- Yes - I would remove the first bullet point
- For the second bullet, headless services aren't programmed on all nodes IIRC, so there is some overhead in creating those and programming DNS etc, but not as just as you describe.
- Scalability and latency of DNS programming. | ||
|
||
DNS programming requires the update of EndpointSlices by the endpoint | ||
controller and updating DNS records by the DNS provider. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit1: DNS is still not migrated to EndpointSlices IIRC
nit2: there is a separare controller for EndpointSlices:
endpoints -> endpoint controller
endpointslices -> endpoint slice controller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this paragraph a bit more open ended.
- Handle more than one IP for the CNAME. This might happen temporarily when: | ||
- the job controller creates more than one pod per index or | ||
- the job controller creates a replacement of a failed Pod before the DNS | ||
provider clears the record for the failed pod. This will be uncommon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this happen?
When we created replacement, the previous pod should already be at least in not ready state, which means it won't have corresponding entry in EndpointSlice read addresses (the ones for which we publish DNS records). So given that the DNS records should be consistent (potentially stale, but consistent) it shouldn't happen, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the race here is that the new pod was observed by the endpoint controller before it observed the failed one. Is that possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're processing incoming watch events in the order, so it shouldn't happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Well - the handlers are processed in the order - in theory the handler can be asynchronous and do something strange, but it's not the case here.]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This possibility was raised during the SIG apps meeting. But I agree that it shouldn't happen if the endpoint(slice) controllers are processing events in order. Removing.
replacement Pod. | ||
<UNRESOLVED> | ||
The recommendation for applications is to request a new DNS resolution until | ||
the DNS server returns one IP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of the golang net package,
You have zero guarantee that user applications are written in go (in fact they often aren't). Do we even how this is solved in other languages?
[Anyway - I think that there will always be some corners cases - I'm not convinced about this recommendation.]
the DNS server returns one IP. | ||
</UNRESOLVED> | ||
|
||
However, network programming is opt-in (users need to create a matching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly hou mean by "network programming" here? kube-proxies aren't even watching for headless services. So what you meant is DNS, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whatever network programming that a headless service triggers, which is creating and populating the endpoint object + creating DNS records. Do we exclusively use the term "network programming" to refer to kube-proxy programming iptables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we exclusively use the term "network programming" to refer to kube-proxy programming iptables?
I've seen different people understanding it differently, so being very explicit would help here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to DNS programming. But note that a user could always use a clusterIP service, thus needing kube-proxy programming too.
@@ -259,6 +353,15 @@ The Job controller doesn't add the environment variable if there is a name | |||
conflict with an existing environment variable. Users can specify other | |||
environment variables for the same annotation. | |||
|
|||
<<[UNRESOLVED this deviates from the rest of the controllers ]>> | |||
The Pod name takes the form `$(job-name)-$(index)-$(random-string)`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to change pod name itself (given we're going to set hostname)?
[Also - it contains random suffix, so I don't see any place where it helps.
If people really want to fetch a pod for a given index, we should be doing this by label selector.]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is useful, especially when looking at logs, and has no downsides. So I am in favor of this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is just for the purpose of debugging. Added the notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the plan with this unresolved item? Do we want to leave it as is or remove it? I'd prefer not to merge a KEP with an unresolved element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think we can just resolve it.
I'm not a huge fan of it, but I don't think there is any strong enough reason for not doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan was to wait for feedback before merging. Since there doesn't seem to be opinions against it, I removed the unresolved tags.
@@ -424,64 +528,60 @@ _This section must be completed when targeting beta graduation to a release._ | |||
|
|||
* **What specific metrics should inform a rollback?** | |||
|
|||
- job_sync_duration_seconds shows significantly more latency for Indexed Jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a label representing job type? If so, can you make it explicit (also provide label name).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was already updated in the parent PR #2616 :)
Rebased.
- 99,9% of /health requests per day finish with 200 code | ||
|
||
- per-day percentage of job_sync_total with label result=error <= 1% | ||
- 99% percentile over day for job_sync_duration_seconds is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly "job_sync_duration_seconds" is reporting?
Is it reporting the duration of a single "sync in a controller"? Or processing a job?
If the former, it should be fairly fast right? [I'm assuming we're creating pods themselves asynchronously?]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's discuss this in https://github.com/kubernetes/enhancements/pull/2616/files#r615840979
d855b1f
to
ad6308c
Compare
|
||
Moreoever, Pods need to be prepared to: | ||
- Retry lookups in the case were the records didn't have time to update. | ||
- The IP for a CNAME to change in the case of a Pod failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sentence doesn't read well and needs to start with a verb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
Moreoever, Pods need to be prepared to: | ||
- Retry lookups in the case were the records didn't have time to update. | ||
- The IP for a CNAME to change in the case of a Pod failure. | ||
- Handle more than one IP for the CNAME. This might temporarily when the job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Handle more than one IP for the CNAME. This might temporarily when the job | |
- Handle more than one IP for the CNAME. This might happen temporarily when the job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ad6308c
to
3bd6507
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Squashed
|
||
Moreoever, Pods need to be prepared to: | ||
- Retry lookups in the case were the records didn't have time to update. | ||
- The IP for a CNAME to change in the case of a Pod failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
Moreoever, Pods need to be prepared to: | ||
- Retry lookups in the case were the records didn't have time to update. | ||
- The IP for a CNAME to change in the case of a Pod failure. | ||
- Handle more than one IP for the CNAME. This might temporarily when the job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
latest changes looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
There are some minor comments which we can solve and get this merged by eow.
Creating Pods with stable hostnames mitigates this problem. The control plane | ||
requires only one Service and one Endpoint (or a few EndpointSlices) to inform | ||
the DNS programming. Pods can address each other with a DNS lookup and | ||
communicate directly using Pod IPs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this argument, it wasn't here last time I read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was not as clear before :)
@@ -259,6 +353,15 @@ The Job controller doesn't add the environment variable if there is a name | |||
conflict with an existing environment variable. Users can specify other | |||
environment variables for the same annotation. | |||
|
|||
<<[UNRESOLVED this deviates from the rest of the controllers ]>> | |||
The Pod name takes the form `$(job-name)-$(index)-$(random-string)`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the plan with this unresolved item? Do we want to leave it as is or remove it? I'd prefer not to merge a KEP with an unresolved element.
@@ -15,12 +15,12 @@ approvers: | |||
- "@kow3ns" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add me now here ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase
@@ -259,6 +353,15 @@ The Job controller doesn't add the environment variable if there is a name | |||
conflict with an existing environment variable. Users can specify other | |||
environment variables for the same annotation. | |||
|
|||
<<[UNRESOLVED this deviates from the rest of the controllers ]>> | |||
The Pod name takes the form `$(job-name)-$(index)-$(random-string)`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think we can just resolve it.
I'm not a huge fan of it, but I don't think there is any strong enough reason for not doing this.
as part of Beta graduation. Also adds the index as part of the host name, to ease debugging. Signed-off-by: Aldo Culquicondor <acondor@google.com>
3bd6507
to
b851fe4
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also changed the metric's labels from mode
to completion_mode
as suggested in the parent PR that was already merged.
Creating Pods with stable hostnames mitigates this problem. The control plane | ||
requires only one Service and one Endpoint (or a few EndpointSlices) to inform | ||
the DNS programming. Pods can address each other with a DNS lookup and | ||
communicate directly using Pod IPs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was not as clear before :)
@@ -15,12 +15,12 @@ approvers: | |||
- "@kow3ns" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:D
/lgtm |
LGTM (for posterity) [for both the KEP itself and the PRR] |
As part of beta graduation, to support tightly coupled parallel jobs.
As presented in kubernetes/kubernetes#99497 (comment)
/sig apps
This PR builds on top of #2616 (beta graduation update with no extra features).