NodeResourcesHandler takes into account failed pods #2185

YevheniiSemendiak · 2024-07-20T18:49:48Z

Observed this error on dev cluster, which might be caused by custodian changing the state of the cluster.

When cluster is recreated, k8s schedules all service pods, not all of them are able to fit nodes that are rolling out sequentially.

Some of those pods being scheduled at 1st available node does not fit and get into "Failed" state with OutOfcpu error status.
Failed pod in this case is not removed by k8s and stays bound to node.

This created resource leakage in our NodeResourcesHandler since we did not handle such event.

codecov · 2024-07-20T18:51:05Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.67%. Comparing base (891a963) to head (0e798e5).
Report is 1 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (891a963) and HEAD (0e798e5). Click for more details.

HEAD has 2 uploads less than BASE

Flag BASE (891a963) HEAD (0e798e5)

integration 2 0

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #2185       +/-   ##
===========================================
- Coverage   84.33%   69.67%   -14.66%     
===========================================
  Files          41       41               
  Lines        7411     7420        +9     
  Branches     1128     1128               
===========================================
- Hits         6250     5170     -1080     
- Misses        896     2125     +1229     
+ Partials      265      125      -140

Flag	Coverage Δ
integration	`?`
unit	`69.67% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
platform_api/orchestrator/kube_client.py	`69.45% <100.00%> (-15.51%)`	⬇️
...rm_api/orchestrator/kube_orchestrator_scheduler.py	`68.96% <100.00%> (-20.20%)`	⬇️

... and 35 files with indirect coverage changes

zubenkoivan

Thanks!

YevheniiSemendiak added 2 commits July 20, 2024 21:42

NodeResourcesHandler takes into account failed pods

fce42e4

handle also init state

0e798e5

YevheniiSemendiak requested a review from zubenkoivan July 20, 2024 18:49

cleanup

43d4dc5

zubenkoivan approved these changes Jul 21, 2024

View reviewed changes

YevheniiSemendiak merged commit bca303e into master Jul 21, 2024
14 checks passed

YevheniiSemendiak deleted the ys/handle-failed-pods branch July 21, 2024 10:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NodeResourcesHandler takes into account failed pods #2185

NodeResourcesHandler takes into account failed pods #2185

YevheniiSemendiak commented Jul 20, 2024

codecov bot commented Jul 20, 2024

zubenkoivan left a comment

NodeResourcesHandler takes into account failed pods #2185

NodeResourcesHandler takes into account failed pods #2185

Conversation

YevheniiSemendiak commented Jul 20, 2024

codecov bot commented Jul 20, 2024

Codecov Report

zubenkoivan left a comment

Choose a reason for hiding this comment