-
Notifications
You must be signed in to change notification settings - Fork 53
fix(trainer): filter duplicate Pods in get_job() API #160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(trainer): filter duplicate Pods in get_job() API #160
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
When Kubernetes recreates Pods due to restart policies, multiple Pods with the same role can exist simultaneously. This causes get_job() to return duplicate TrainJob components with different statuses, creating confusion for users. This change groups Pods by their component role and selects only the most recently created Pod for each component based on creation_timestamp. This ensures users see the current state of their TrainJob after any Pod restarts. Changes: - Group Pods by role identifier (initializer name or node+index) - Select most recent Pod from each group using creation_timestamp - Add comprehensive test for Pod restart scenarios Fixes kubeflow#25 Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>
978b209 to
faf96a5
Compare
| pod_groups[key] = [] | ||
| pod_groups[key].append(pod) | ||
|
|
||
| # Select the most recently created Pod from each group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to make it more robust we could select the pod based on the status as well as the timestamp something like this Fiona-Waters@b48277f
wdyt?
It will return a pod that actually reflects the true state of each TrainJob component, rather than the newest pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely agree! I've actually implemented exactly that approach from your commit b48277f. The current implementation now:
Prioritizes by status first: Running (4) > Succeeded (3) > Failed (2) > Pending (1) > Unknown (0)
Uses timestamp as tiebreaker: Among pods with the same status, selects the most recent one
This ensures we return a pod that reflects the true state of the TrainJob component (preferring Running/Succeeded pods over Failed ones), rather than blindly picking the newest pod regardless of its state.
For example, if we have:
Pod A: Failed (created at 11:00)
Pod B: Running (created at 10:00)
The old logic would return Pod A (newest), but the new logic correctly returns Pod B (Running status is higher priority).
…e safety Apply code quality improvements based on review feedback: - Use status-based priority for pod selection (Running > Succeeded > Failed > Pending > Unknown) - Add datetime.min fallback for safer timestamp sorting (prevents TypeError) - Add precise type hints to internal dicts for better type checking - Use consistent .get() access for JOB_INDEX_LABEL with default fallback - Add pod phase constants (POD_RUNNING, POD_FAILED, POD_PENDING, POD_UNKNOWN) These changes improve robustness, type safety, and maintainability while maintaining the same behavior of selecting the best pod for each role. Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>
|
|
||
| # Sort by creation timestamp (most recent first) | ||
| candidate_pods.sort( | ||
| key=lambda p: p.metadata.creation_timestamp or datetime.datetime.min, reverse=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could cause an issue in newer python versions - do we want it to be timezone naive or set to utc?
| key=lambda p: p.metadata.creation_timestamp or datetime.datetime.min, reverse=True | |
| key=lambda p: (p.metadata.creation_timestamp or datetime.datetime.min.replace(tzinfo=timezone.utc)) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I've applied your suggestion to use datetime.datetime.min.replace(tzinfo=timezone.utc) instead of the timezone-naive datetime.datetime.min.
Thanks for the review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. Don't forget to import timezone too
from datetime import timezone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Added the timezone import. Thanks for catching that! 👍
|
@HKanoje left one more comment but otherwise it looks good to me. |
Use datetime.datetime.min.replace(tzinfo=timezone.utc) instead of datetime.datetime.min to prevent TypeError when comparing timezone-aware and timezone-naive datetimes in Python 3.9+. The Kubernetes API returns creation_timestamp as timezone-aware datetime objects in UTC, so the fallback should also be timezone-aware for safe comparison. Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>
Import timezone from datetime module to use timezone.utc directly instead of datetime.timezone.utc for better readability. Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>
|
/lgtm Thanks @HKanoje @Fiona-Waters! /assign @kubeflow/kubeflow-sdk-team |
|
/ok-to-test |
Pull Request Test Coverage Report for Build 19430062171Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
Description
Problem:
The
get_job()API currently returns multiple Pods for the same TrainJob component(e.g.,
dataset-initializer,trainer-node-0) when Kubernetes recreates Pods based onBatch/Job restart policies.
This causes users to see duplicate components with conflicting statuses
—for example, one Pod may show "Failed" while another shows "Running"—leading to
confusion about the actual state of the training job.
Solution:
This PR improves the
get_job()API to filter duplicate Pods and display only the most recently created Pod for each TrainJob component.Key Improvements
1. Groups Pods by role
JOBSET_RJOB_NAME_LABELfor initializer PodsJOBSET_RJOB_NAME_LABEL+JOB_INDEX_LABELfor training-node Pods(ensures correct grouping across multi-node trainer replicas)
2. Selects the most recent Pod
creation_timestamp(e.g., old Pods in
Failedstate)3. Maintains backward compatibility
This ensures users see clean, de-duplicated component statuses that accurately represent the current state of their training job.
Example Impact:
Before this fix:
After this fix:
Changes Made
Modified Files
backend.py__get_trainjob_from_cr()method to implement Pod de-duplication and filtering logicJOBSET_RJOB_NAME_LABELJOBSET_RJOB_NAME_LABEL+JOB_INDEX_LABELcreation_timestampbackend_test.pytest_get_job_with_pod_restarts()Testing
All tests passing:
make verify— PASSED (lint + format checks)test_get_job_with_pod_restarts— PASSED (new test for Pod restart filtering)test_get_job— PASSED (existing behavior remains compatible)Test Coverage
creation_timestampvaluesChecklist
make verifypasses)make test-python)Related Issues
Fixes #25