|
38 | 38 | - [API changes](#api-changes) |
39 | 39 | - [Resize Restart Policy](#resize-restart-policy) |
40 | 40 | - [Implementation Details](#implementation-details) |
41 | | - - [[Scoped for GA] Memory Manager](#scoped-for-ga-memory-manager) |
42 | | - - [[Scoped for GA] CPU Manager](#scoped-for-ga-cpu-manager) |
43 | | - - [[Scoped for GA] Topology Manager](#scoped-for-ga-topology-manager) |
44 | | - - [[Scoped for GA] User Experience Survey](#scoped-for-ga-user-experience-survey) |
45 | 41 | - [[Scoped for Beta] Surfacing Pod Resource Requirements](#scoped-for-beta-surfacing-pod-resource-requirements) |
46 | 42 | - [The Challenge of Determining Effective Pod Resource Requirements](#the-challenge-of-determining-effective-pod-resource-requirements) |
47 | 43 | - [Goals of surfacing Pod Resource Requirements](#goals-of-surfacing-pod-resource-requirements) |
48 | 44 | - [Implementation Details](#implementation-details-1) |
49 | 45 | - [Notes for implementation](#notes-for-implementation) |
50 | | - - [[Scoped for Beta] VPA](#scoped-for-beta-vpa) |
51 | | - - [[Scoped for Beta] Cluster Autoscaler](#scoped-for-beta-cluster-autoscaler) |
52 | | - - [[Scoped for Beta] Support for Windows](#scoped-for-beta-support-for-windows) |
| 46 | + - [[Scoped for Beta] HPA](#scoped-for-beta-hpa) |
| 47 | + - [Cluster Autoscaler](#cluster-autoscaler) |
| 48 | + - [VPA](#vpa) |
| 49 | + - [[Future KEP Consideration in 1.35] Support for Windows](#future-kep-consideration-in-135-support-for-windows) |
| 50 | + - [[Future KEP Consideration in 1.35] Memory Manager](#future-kep-consideration-in-135-memory-manager) |
| 51 | + - [[Future KEP Consideration] CPU Manager](#future-kep-consideration-cpu-manager) |
| 52 | + - [[Future KEP Consideration] Topology Manager](#future-kep-consideration-topology-manager) |
| 53 | + - [[Scoped for GA] User Experience Survey](#scoped-for-ga-user-experience-survey) |
53 | 54 | - [Test Plan](#test-plan) |
54 | 55 | - [Unit tests](#unit-tests) |
55 | 56 | - [e2e tests](#e2e-tests) |
|
71 | 72 | - [Implementation History](#implementation-history) |
72 | 73 | - [Drawbacks](#drawbacks) |
73 | 74 | - [Alternatives](#alternatives) |
74 | | - - [VPA](#vpa) |
| 75 | + - [VPA](#vpa-1) |
75 | 76 | <!-- /toc --> |
76 | 77 |
|
77 | 78 |
|
@@ -1359,93 +1360,6 @@ either modify the pod-level resources to accommodate ephemeral containers or |
1359 | 1360 | supply resources at container-level for ephemeral containers and kubernetes will |
1360 | 1361 | resize the pod to accommodate the ephemeral containers. |
1361 | 1362 |
|
1362 | | - |
1363 | | -#### [Scoped for GA] Memory Manager |
1364 | | - |
1365 | | -The Memory Manager currently allocates memory resources at |
1366 | | -the container level through its |
1367 | | -[Allocate](https://github.com/kubernetes/kubernetes/blob/849a82b727b1cc1e77b58149b3cacbfa5ada30fd/pkg/kubelet/cm/memorymanager/memory_manager.go#L261) |
1368 | | -method. The [Topology Manager](https://github.com/kubernetes/kubernetes/blob/fd53f7292c7d5899135fddd928c0dc3844126820/pkg/kubelet/cm/topologymanager/scope.go#L150) calls this Allocate method as part of its hint provider integration. |
1369 | | - |
1370 | | - |
1371 | | -With the introduction of Pod Level Resources, the following modifications are needed: |
1372 | | - |
1373 | | -1. Memory Manager Interface Extension: |
1374 | | -Add a new AllocatePodLevel method to the Memory Manager interface to handle |
1375 | | -resource allocation at the pod level. This method will complement the existing container-level Allocate method. |
1376 | | - |
1377 | | -2. Topology Manager Integration: Modify the (Topology Manager)[https://github.com/kubernetes/kubernetes/blob/fd53f7292c7d5899135fddd928c0dc3844126820/pkg/kubelet/cm/topologymanager/scope.go#L150] to conditionally |
1378 | | -call AllocatePodLevel when pod-level resources are configured. Maintain |
1379 | | -backward compatibility by continuing to use the existing Allocate method for |
1380 | | -container-level allocation scenarios |
1381 | | - |
1382 | | -Note: The BestEffort policy (Windows-only) is explicitly out of scope for this |
1383 | | -change, as Windows implementation is not covered by the Pod Level Resources KEP. |
1384 | | - |
1385 | | -#### [Scoped for GA] CPU Manager |
1386 | | - |
1387 | | -The Memory Manager currently allocates memory resources at |
1388 | | -the container level through its |
1389 | | -[Allocate](https://github.com/kubernetes/kubernetes/blob/fd53f7292c7d5899135fddd928c0dc3844126820/pkg/kubelet/cm/cpumanager/cpu_manager.go#L255) |
1390 | | -method. The [Topology Manager](https://github.com/kubernetes/kubernetes/blob/fd53f7292c7d5899135fddd928c0dc3844126820/pkg/kubelet/cm/topologymanager/scope.go#L150) calls this Allocate method as part of its hint provider integration. |
1391 | | - |
1392 | | -With the introduction of Pod Level Resources, the following modifications are required: |
1393 | | - |
1394 | | -1. CPU Manager Interface Extension: Add a new AllocatePodLevel method to the CPU |
1395 | | -Manager interface to handle resource allocation at the pod level. This method |
1396 | | -will complement the existing container-level Allocate method. |
1397 | | - |
1398 | | -2. Topology Manager Integration: Modify the (Topology |
1399 | | -Manager)[https://github.com/kubernetes/kubernetes/blob/fd53f7292c7d5899135fddd928c0dc3844126820/pkg/kubelet/cm/topologymanager/scope.go#L150] |
1400 | | -to conditionally call AllocatePodLevel when pod-level resources are |
1401 | | -configured. Backward compatibility will be maintained by continuing to use the |
1402 | | -existing Allocate method for container-level allocation scenarios. |
1403 | | - |
1404 | | -3. Policy-Specific Modifications: Not all existing CPU Manager policies remain |
1405 | | - compatible with Pod Level Resources. Following are policy-specific |
1406 | | - adaptations: |
1407 | | - |
1408 | | -* distribute-cpus-across-numa: This policy is incompatible with pod-level |
1409 | | - resources. Distributing CPUs across NUMA nodes requires detailed knowledge of |
1410 | | - bandwidth-intensive containers, which is explicitly abstracted away by |
1411 | | - pod-level resources. Without workload-specific information, the system cannot |
1412 | | - optimally distribute containers across NUMA nodes, and incorrect placement |
1413 | | - could degrade performance (How to distribute M containers across N NUMA |
1414 | | - nodes). |
1415 | | - |
1416 | | -* distribute-cpus-across-cores: Similarly, this policy is incompatible. Users focused on core-level optimization for individual containers would likely not opt for pod-level resources in the first place. |
1417 | | - |
1418 | | -* full-pcpus-only: This policy is compatible and highly beneficial for multi-tenant pods requiring inter-pod isolation, as it helps prevent hyperthread contention. The CPU Manager will be extended to allocate full physical cores at the pod level and implement a shared CPU pool within pod boundaries. |
1419 | | - |
1420 | | -* align-by-socket: This policy is compatible. It ensures all a pod's CPUs remain on the same socket when possible, reducing inter-socket latencies and benefiting containers that share L3 cache or communicate frequently. The socket alignment logic will be extended to work with pod-level CPU pools. |
1421 | | - |
1422 | | -* strict-cpu-reservation: This policy is compatible and crucial for guaranteed workloads, preventing interference from burstable and best-effort pods. We'll update the CPU reservation logic to consider pod-level requests and limits. |
1423 | | - |
1424 | | -* prefer-align-cpus-by-uncorecache: This policy is compatible. It optimizes CPU allocation across uncore cache groups, enhancing shared cache locality for containers within the pod. The allocation logic will be updated to consider pod-level requests and limits. |
1425 | | - |
1426 | | -Note: This is a prelimnary analysis, and we might have real usecases to support |
1427 | | - distribute-cpus-across-numa and distribute-cpus-across-cores with pod-level |
1428 | | - resources. We can re-visit this again during the GA planning cycle. |
1429 | | - |
1430 | | -#### [Scoped for GA] Topology Manager |
1431 | | - |
1432 | | -Currently, scope=pod aggregates resource requirements from a pod's individual |
1433 | | -containers to determine overall pod-level needs. With the introduction of Pod |
1434 | | -Level Resources, scope=pod will directly use the pod-level resource values |
1435 | | -specified in the Pod object for topology alignment. |
1436 | | - |
1437 | | -Besides, scope=container won't be supported for pods with Pod Level Resources. This is because these pods lack per-container resource specifications, leaving the Topology Manager without the granular information needed to make informed container-level topology decisions. If a user attempts to configure scope=container for such a pod, the Topology Manager will explicitly disallow it and provide an informative message. This message will guide the user to use scope=pod or to configure per-container resources if fine-grained container-level topology is truly desired. |
1438 | | - |
1439 | | -#### [Scoped for GA] User Experience Survey |
1440 | | - |
1441 | | -Before promoting the feature to GA, we plan to conduct a UX survey to |
1442 | | -understand user expectations for setting various combinations of requests and |
1443 | | -limits at both the pod and container levels. This will help us gather use cases |
1444 | | -for different combinations, enabling us to enhance the feature's usability. If we |
1445 | | -identify the need for significant changes to the defaulting logic based on this |
1446 | | -feedback, we'll release another Beta version of Pod-Level Resources to |
1447 | | -incorporate those adjustments. |
1448 | | - |
1449 | 1363 | #### [Scoped for Beta] Surfacing Pod Resource Requirements |
1450 | 1364 |
|
1451 | 1365 | ##### The Challenge of Determining Effective Pod Resource Requirements |
@@ -1558,32 +1472,149 @@ KEPs. The first change doesn’t present any user visible change, and if |
1558 | 1472 | implemented, will in a small way reduce the effort for both of those KEPs by |
1559 | 1473 | providing a single place to update the pod resource calculation. |
1560 | 1474 |
|
1561 | | -#### [Scoped for Beta] VPA |
1562 | | -
|
1563 | | -TBD. Do not review for the alpha stage. |
1564 | | -
|
1565 | | -#### [Scoped for Beta] Cluster Autoscaler |
| 1475 | +#### [Scoped for Beta] HPA |
| 1476 | +For accurate scaling decisions, HPA needs to be able to correctly calculate the |
| 1477 | +resources requested by a pod, regardless of whether those requests are defined |
| 1478 | +at the pod or container level. Currently, HPA calculates pod requests by simply |
| 1479 | +aggregating the requests of all containers within a pod. To address this, HPA |
| 1480 | +should leverage the helper method found at |
| 1481 | +https://github.com/kubernetes/kubernetes/blob/988cf21f0975cf95444a619481c13d2503d8ec6a/staging/src/k8s.io/component-helpers/resource/helpers.go |
| 1482 | +for more precise pod request computations. The changes are being worked on by |
| 1483 | +sig-autoscaling: (#132237)[https://github.com/kubernetes/kubernetes/issues/132237] |
| 1484 | +
|
| 1485 | +#### Cluster Autoscaler |
| 1486 | +
|
| 1487 | +The Cluster Autoscaler uses resourcehelper.PodRequests to calculate Pod resource |
| 1488 | +requirements for scaling decisions version 1.4.0. This automatically |
| 1489 | +includes Pod-level resource requests when the PodLevelResources feature gate is |
| 1490 | +enabled, ensuring accurate node scaling and utilization calculations. |
| 1491 | +
|
| 1492 | +#### VPA |
| 1493 | +
|
| 1494 | +Collaboration with sig-autoscaling has been established to integrate support for |
| 1495 | +VPA with Pod-level resources, slated for VPA 1.34. The changes to support pod-level |
| 1496 | +resources in VPA will be worked on in two phases: |
| 1497 | +* [Scoped for Beta] Phase 1 of Necessary changes |
| 1498 | + The necessary changes include augmenting the recommendation algorithm to |
| 1499 | + provide pod-level resource recommendations within RecommendedPodResources, in |
| 1500 | + addition to existing per-container recommendations, when pod-level resources are |
| 1501 | + set. |
| 1502 | + ``` |
| 1503 | + type RecommendedPodResources struct { |
| 1504 | + ContainerRecommendations []RecommendedContainerResources |
| 1505 | + // NEW: Pod-level resources |
| 1506 | + PodLevelResources *ResourceList |
| 1507 | + } |
| 1508 | + ``` |
| 1509 | + Note: Detailed KEP design is owned and being worked on by |
| 1510 | + sig-autoscaling: [#7571](https://github.com/kubernetes/autoscaler/issues/7571) |
1566 | 1511 |
|
1567 | | -Cluster Autoscaler won't work as expected with pod-level resources in alpha since |
1568 | | -it relies on container-level values to be specified. If a user specifies only |
1569 | | -pod-level resources, the CA will assume that the pod requires no resources since |
1570 | | -container-level values are not set. As a result, the CA won't scale the number of |
1571 | | -nodes to accommodate this pod. Meanwhile, the scheduler will evaluate the |
1572 | | -pod-level resource requests but may be unable to find a suitable node to fit the |
1573 | | -pod. Consequently, the pod will not be scheduled. While this behavior is |
1574 | | -acceptable for the alpha implementation, it is anticipated that Cluster |
1575 | | -Autoscaler support will be addressed in the Beta phase with pod resource |
1576 | | -requirements surfaced in a helper library/function that autoscalers can use to |
1577 | | -make autoscaling decisions. |
| 1512 | +* [Scoped for GA] Phase 2 of improving recommendation algorithm |
| 1513 | + Pod-Level Resources allows pod-level limits to be greater than aggregated |
| 1514 | + container limits to allow the containers to share idle resources among each other. |
| 1515 | + Integrating this functionality with VPA necessitates the development of a |
| 1516 | + complex new recommendation algorithm. Concepts such as proportionate pod |
| 1517 | + and container level recommendations have been proposed and |
| 1518 | + require further discussion. |
1578 | 1519 |
|
1579 | | -#### [Scoped for Beta] Support for Windows |
| 1520 | +#### [Future KEP Consideration in 1.35] Support for Windows |
1580 | 1521 |
|
1581 | 1522 | Pod-level resource specifications are a natural extension of Kubernetes' existing |
1582 | 1523 | resource management model. Although this new feature is expected to function with |
1583 | 1524 | Windows containers, careful testing and consideration are required due to |
1584 | 1525 | platform-specific differences. As the introduction of pod-level resources is a |
1585 | 1526 | major change in itself, full support for Windows will be addressed in future |
1586 | | -stages, beyond the initial alpha release. |
| 1527 | +KEPs, beyond the scope of this KEP. |
| 1528 | +
|
| 1529 | +#### [Future KEP Consideration in 1.35] Memory Manager |
| 1530 | +
|
| 1531 | +The Memory Manager currently allocates memory resources at |
| 1532 | +the container level through its |
| 1533 | +[Allocate](https://github.com/kubernetes/kubernetes/blob/849a82b727b1cc1e77b58149b3cacbfa5ada30fd/pkg/kubelet/cm/memorymanager/memory_manager.go#L261) |
| 1534 | +method. The [Topology Manager](https://github.com/kubernetes/kubernetes/blob/fd53f7292c7d5899135fddd928c0dc3844126820/pkg/kubelet/cm/topologymanager/scope.go#L150) calls this Allocate method as part of its hint provider integration. |
| 1535 | +
|
| 1536 | +
|
| 1537 | +With the introduction of Pod Level Resources, the following modifications are needed: |
| 1538 | +
|
| 1539 | +1. Memory Manager Interface Extension: |
| 1540 | +Add a new AllocatePodLevel method to the Memory Manager interface to handle |
| 1541 | +resource allocation at the pod level. This method will complement the existing container-level Allocate method. |
| 1542 | +
|
| 1543 | +2. Topology Manager Integration: Modify the (Topology Manager)[https://github.com/kubernetes/kubernetes/blob/fd53f7292c7d5899135fddd928c0dc3844126820/pkg/kubelet/cm/topologymanager/scope.go#L150] to conditionally |
| 1544 | +call AllocatePodLevel when pod-level resources are configured. Maintain |
| 1545 | +backward compatibility by continuing to use the existing Allocate method for |
| 1546 | +container-level allocation scenarios |
| 1547 | +
|
| 1548 | +#### [Future KEP Consideration] CPU Manager |
| 1549 | +
|
| 1550 | +The Memory Manager currently allocates memory resources at |
| 1551 | +the container level through its |
| 1552 | +[Allocate](https://github.com/kubernetes/kubernetes/blob/fd53f7292c7d5899135fddd928c0dc3844126820/pkg/kubelet/cm/cpumanager/cpu_manager.go#L255) |
| 1553 | +method. The [Topology Manager](https://github.com/kubernetes/kubernetes/blob/fd53f7292c7d5899135fddd928c0dc3844126820/pkg/kubelet/cm/topologymanager/scope.go#L150) calls this Allocate method as part of its hint provider integration. |
| 1554 | +
|
| 1555 | +With the introduction of Pod Level Resources, the following modifications are required: |
| 1556 | +
|
| 1557 | +1. CPU Manager Interface Extension: Add a new AllocatePodLevel method to the CPU |
| 1558 | +Manager interface to handle resource allocation at the pod level. This method |
| 1559 | +will complement the existing container-level Allocate method. |
| 1560 | +
|
| 1561 | +2. Topology Manager Integration: Modify the (Topology |
| 1562 | +Manager)[https://github.com/kubernetes/kubernetes/blob/fd53f7292c7d5899135fddd928c0dc3844126820/pkg/kubelet/cm/topologymanager/scope.go#L150] |
| 1563 | +to conditionally call AllocatePodLevel when pod-level resources are |
| 1564 | +configured. Backward compatibility will be maintained by continuing to use the |
| 1565 | +existing Allocate method for container-level allocation scenarios. |
| 1566 | +
|
| 1567 | +3. Policy-Specific Modifications: Not all existing CPU Manager policies remain |
| 1568 | + compatible with Pod Level Resources. Following are policy-specific |
| 1569 | + adaptations: |
| 1570 | + |
| 1571 | +* distribute-cpus-across-numa: This policy is incompatible with pod-level |
| 1572 | + resources. Distributing CPUs across NUMA nodes requires detailed knowledge of |
| 1573 | + bandwidth-intensive containers, which is explicitly abstracted away by |
| 1574 | + pod-level resources. Without workload-specific information, the system cannot |
| 1575 | + optimally distribute containers across NUMA nodes, and incorrect placement |
| 1576 | + could degrade performance (How to distribute M containers across N NUMA |
| 1577 | + nodes). |
| 1578 | +
|
| 1579 | +* distribute-cpus-across-cores: Similarly, this policy is incompatible. Users focused on core-level optimization for individual containers would likely not opt for pod-level resources in the first place. |
| 1580 | +
|
| 1581 | +* full-pcpus-only: This policy is compatible and highly beneficial for multi-tenant pods requiring inter-pod isolation, as it helps prevent hyperthread contention. The CPU Manager will be extended to allocate full physical cores at the pod level and implement a shared CPU pool within pod boundaries. |
| 1582 | +
|
| 1583 | +* align-by-socket: This policy is compatible. It ensures all a pod's CPUs remain on the same socket when possible, reducing inter-socket latencies and benefiting containers that share L3 cache or communicate frequently. The socket alignment logic will be extended to work with pod-level CPU pools. |
| 1584 | +
|
| 1585 | +* strict-cpu-reservation: This policy is compatible and crucial for guaranteed workloads, preventing interference from burstable and best-effort pods. We'll update the CPU reservation logic to consider pod-level requests and limits. |
| 1586 | +
|
| 1587 | +* prefer-align-cpus-by-uncorecache: This policy is compatible. It optimizes CPU allocation across uncore cache groups, enhancing shared cache locality for containers within the pod. The allocation logic will be updated to consider pod-level requests and limits. |
| 1588 | +
|
| 1589 | +Note: This is a prelimnary analysis, and we might have real usecases to support |
| 1590 | + distribute-cpus-across-numa and distribute-cpus-across-cores with pod-level |
| 1591 | + resources. We can re-visit this again during the GA planning cycle. |
| 1592 | +
|
| 1593 | +#### [Future KEP Consideration] Topology Manager |
| 1594 | +
|
| 1595 | +Currently, scope=pod aggregates resource requirements from a pod's individual |
| 1596 | +containers to determine overall pod-level needs. With the introduction of Pod |
| 1597 | +Level Resources, scope=pod will directly use the pod-level resource values |
| 1598 | +specified in the Pod object for topology alignment. |
| 1599 | +
|
| 1600 | +Besides, scope=container won't be supported for pods with Pod Level Resources. This |
| 1601 | +is because these pods lack per-container resource specifications, leaving the |
| 1602 | +Topology Manager without the granular information needed to make informed |
| 1603 | +container-level topology decisions. If a user creates a pod with pod-level |
| 1604 | +resources and this pod gets scheduled on a node where the Kubelet's Topology Manager |
| 1605 | +is configured with scope=container, then the Topology Manager will not perform |
| 1606 | +resource alignment for that pod, and will explicitly throw an error with an |
| 1607 | +informative message.This message will guide the user to use scope=pod or to configure per-container resources if fine-grained container-level topology is truly desired. |
| 1608 | +
|
| 1609 | +#### [Scoped for GA] User Experience Survey |
| 1610 | +
|
| 1611 | +Before promoting the feature to GA, we plan to conduct a UX survey to |
| 1612 | +understand user expectations for setting various combinations of requests and |
| 1613 | +limits at both the pod and container levels. This will help us gather use cases |
| 1614 | +for different combinations, enabling us to enhance the feature's usability. If we |
| 1615 | +identify the need for significant changes to the defaulting logic based on this |
| 1616 | +feedback, we'll release another Beta version of Pod-Level Resources to |
| 1617 | +incorporate those adjustments. |
1587 | 1618 |
|
1588 | 1619 | ### Test Plan |
1589 | 1620 |
|
|
0 commit comments