Ensure k8s pod names/labels are RFC 1123 compliant #3639

rjmello · 2024-10-17T18:32:07Z

Description

Modified Kubernetes pod names and labels to conform to RFC 1123 for DNS subdomain names and labels, ensuring compliance with Kubernetes naming conventions.
Modified KubernetesProvider.submit() to return an eight-character hex value as the job ID instead of the pod name.
Replaced the trailing timestamp in the pod name with the job ID to improve collision avoidance.
Replaced app pod label with parsl-job-id.
Updated container name to use job ID.

Type of change

Bug fix
New feature

parsl/tests/test_providers/test_kubernetes_provider.py

benclifford · 2024-10-18T10:59:43Z

This has failed CI in the drain test a few times on Python 3.12, in what should be something unrelated. I'll dig into that a bit more and try to understand if that's a drain bug that is just started to show by unrelated chance, or if there's something going on with this PR (I suspect not).

parsl/providers/kubernetes/kube.py

benclifford · 2024-10-18T12:12:51Z

This has failed CI in the drain test a few times on Python 3.12, in what should be something unrelated. I'll dig into that a bit more and try to understand if that's a drain bug that is just started to show by unrelated chance, or if there's something going on with this PR (I suspect not).

I have recreated this a few times in PR #3640 independent of this PR #3639, so that test failure should not count against this PR.

…es (#3640) This PR reduces the load places on the interchange and on the whole test environment caused by repeatedly querying the interchange for connected managers. It does this by increasing the period between such requests, from the default, every 20ms, to every 100ms. In the last few days, test_drain.py began failing often. I have seen it occasionally fail before. This was initially a problem in PR #3639 which is unrelated, but I recreated the problem in CI against master as of #3627. I investigated and found this behaviour causing the failure: * test_drain configures the drain period to be 1 second * startup of a worker pool was taking more than 1 second * the worker pool enters drain state, drains and terminates immediately on being fully started up. * test_py fails its assumption that there is a worker pool to inspect after waiting for there to be worker pool to inspect. This is the race condition: the assertion at line 57 is true but line 59 returns an empty managers list. ``` 57 try_assert(lambda: len(htex.connected_managers()) == 1, check_period_ms=CONNECTED_MANAGERS_POLL_MS) 58 59 managers = htex.connected_managers() 60 assert managers[0]['active'], "The manager should be active" ``` Looking at the CI logs for a failing case, I saw direct evidence that the worker pool takes more than 1 second to start up in `manager.log`: ``` 2024-10-18 10:31:16.007 parsl:914 29414 MainThread [INFO] Python version: 3.12.7 (main, Oct 1 2024, 15:17:50) [GCC 9.4.0] [...] 2024-10-18 10:31:16.008 parsl:151 29414 MainThread [INFO] Manager initializing [this is where the worker start time for drain purposes is measured] [...] 2024-10-18 10:31:16.011 parsl:183 29414 MainThread [INFO] Manager connected to interchange 2024-10-18 10:31:17.058 parsl:243 29414 MainThread [INFO] Will request drain at 1729247477.0087547 [...] 2024-10-18 10:31:17.073 parsl:336 29414 Task-Puller [DEBUG] ready workers: 0, pending tasks: 0 ``` There's more than a second delay between "... connected to interchange" and the subsequent message "Will request drain". There's not a huge amount of stuff happening between these lines but there are things like multiprocessing initialization which starts a new process. It looks like this bit of code is slow even in the successful case: rerunning until success, I see this timing in CI: ``` 2024-10-18 12:11:17.475 parsl:183 23062 MainThread [INFO] Manager connected to interchange 2024-10-18 12:11:18.181 parsl:243 23062 MainThread [INFO] Will request drain at 1729253478.4731379 ``` which is still a large fraction of a second (but sufficiently less than a second for the test to pass). I haven't investigated what is taking that time. I haven't investigated if I also see that on my laptop. I hypothesised that a lot of these test failures come from the test environment being quite loaded. I'm especially suspicious of using `try_assert` with its default timings which are very tight (20ms) - the connected managers RPC here would be expected to run much less frequently, more like every 5 seconds in regular Parsl use. So I lengthed the period of the try_asserts in this test, to try to reduce load caused there. That makes the test pass repeatedly again. Things not investigated/debugged: * why this is taking >0.5 second even in the successful case - it's possible that this is a reasonable startup time and so the test might be lengthened by a few seconds * how to do a test without being timing reliant - draining is, by its very nature, reliant on the passage of "real time". For example, you might mock (at the libc level if not at the Python level) system time. * what other loads are present on the system - one of the points of slowly-ongoing PR #3397 shutdown tidyup is to reduce thread load left behind by earlier tests

parsl/providers/kubernetes/kube.py

benclifford · 2024-10-20T09:30:57Z

parsl/providers/kubernetes/kube.py

@@ -322,7 +321,7 @@ def _create_pod(self,
 claim_name=volume[0])))

 metadata = client.V1ObjectMeta(name=pod_name,
- labels={"app": job_name},
+ labels={"job_id": job_id},


might be more informative to get the word parsl in this label name somehow, so that users looking at a pod can see that the job ID comes from parsl rather than from some part of kubernetes - for example, so as to not confuse tech support people when they are told the "job_id" of a pod without the relevant context.

Good idea; I'll change the label key to parsl-job-id.

benclifford · 2024-10-20T09:32:40Z

parsl/utils.py

+ # DNS label cannot exceed 63 characters
+ sanitized = sanitized[:63]
+
+ return _strip_non_alphanumeric(sanitized)


trim to len 63 after removing non-alphanumeric to allow more content to be preserved?

Trimming after will not work with strings like s = "a" * 62 + "-a", in which s[:63] would end in a hyphen. I capture this scenario in the unit tests, but will add a comment for clarity.

benclifford · 2024-10-20T09:35:03Z

parsl/utils.py

+ # DNS subdomain cannot exceed 253 characters
+ sanitized = sanitized[:253]
+
+ return _strip_non_alphanumeric(sanitized)


sanitize_dns_label_rfc1123 should have removed all non-alphanumeric here, right? so this should always be a no-op?

Similar to sanitize_dns_label_rfc1123(), we could end up with a trailing dot or hyphen after trimming. I included a test for this specific scenario.

These convert any string to a valid RFC 1123 DNS subdomain or label.

- Modified Kubernetes pod names and labels to conform to RFC 1123 for DNS subdomain names and labels, ensuring compliance with Kubernetes naming conventions. - Replaced the trailing timestamp in the job name with an eight-character hex string (job ID) to improve collision avoidance. - Replaced `app` pod label with `parsl-job-id`. - Updated container name to use job ID.

rjmello requested review from yadudoc, benclifford and khk-globus October 17, 2024 18:32

benclifford reviewed Oct 17, 2024

View reviewed changes

parsl/tests/test_providers/test_kubernetes_provider.py Outdated Show resolved Hide resolved

rjmello force-pushed the rjmello-kube-pod-names branch 4 times, most recently from 2fb44a7 to 1987f00 Compare October 17, 2024 22:28

benclifford reviewed Oct 18, 2024

View reviewed changes

parsl/providers/kubernetes/kube.py Outdated Show resolved Hide resolved

benclifford reviewed Oct 18, 2024

View reviewed changes

parsl/providers/kubernetes/kube.py Show resolved Hide resolved

rjmello force-pushed the rjmello-kube-pod-names branch 3 times, most recently from 5f0cb77 to 4d099d6 Compare October 18, 2024 13:34

benclifford mentioned this pull request Oct 19, 2024

Reduce load on drain integration test to reduce race-condition failures #3640

Merged

benclifford reviewed Oct 20, 2024

View reviewed changes

parsl/providers/kubernetes/kube.py Outdated Show resolved Hide resolved

benclifford reviewed Oct 20, 2024

View reviewed changes

rjmello force-pushed the rjmello-kube-pod-names branch from 8500faa to 53cc37d Compare October 20, 2024 16:27

rjmello requested a review from benclifford October 20, 2024 16:30

rjmello added 3 commits October 20, 2024 14:31

Add utils to sanitize strings for DNS compliance

e2d4b95

These convert any string to a valid RFC 1123 DNS subdomain or label.

Use hex value for k8s job ID instead of pod name

5a84966

rjmello force-pushed the rjmello-kube-pod-names branch from 53cc37d to 7a23fa4 Compare October 20, 2024 18:31

Add tests for KubernetesProvider submit

741d3f6

rjmello force-pushed the rjmello-kube-pod-names branch from 7a23fa4 to 741d3f6 Compare October 20, 2024 18:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure k8s pod names/labels are RFC 1123 compliant #3639

Ensure k8s pod names/labels are RFC 1123 compliant #3639

rjmello commented Oct 17, 2024 •

edited

Loading

benclifford commented Oct 18, 2024

benclifford commented Oct 18, 2024

benclifford Oct 20, 2024

rjmello Oct 20, 2024

benclifford Oct 20, 2024

rjmello Oct 20, 2024

benclifford Oct 20, 2024

rjmello Oct 20, 2024

Ensure k8s pod names/labels are RFC 1123 compliant #3639

Are you sure you want to change the base?

Ensure k8s pod names/labels are RFC 1123 compliant #3639

Conversation

rjmello commented Oct 17, 2024 • edited Loading

Description

Type of change

benclifford commented Oct 18, 2024

benclifford commented Oct 18, 2024

benclifford Oct 20, 2024

Choose a reason for hiding this comment

rjmello Oct 20, 2024

Choose a reason for hiding this comment

benclifford Oct 20, 2024

Choose a reason for hiding this comment

rjmello Oct 20, 2024

Choose a reason for hiding this comment

benclifford Oct 20, 2024

Choose a reason for hiding this comment

rjmello Oct 20, 2024

Choose a reason for hiding this comment

rjmello commented Oct 17, 2024 •

edited

Loading