-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[STRMHELP-315] Rollback on Failed Job Monitoring 🐛 #291
Conversation
/PTAL @maghamravi @anandswaminathan |
} | ||
return updateJobAndReturn(ctx, job, s, allVerticesRunning, app, hash) | ||
logger.Info(ctx, "Monitoring job vertices with timeout ", flinkJobVertexTimeout) | ||
jobStarted, err := monitorJobStart(job, flinkJobVertexTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would be good to call the method name monitorJobSubmission
and jobStarted
to status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with monitorJobSubmission. Cleaner.
I don't fully understand why call jobStarted
to status
. I believe jobStarted
is more intuitive as to what the monitorJobStart actually returns. In the case where all vertices are not running jobStarted is false. If all vertices are running jobStarted is true. If any vertex is failed it throws an error.
Unless you're suggesting that status should be a string rather than a bool and correspond to something like "NOT_STARTED", "STARTED". Can you clarify status
and why it should be status
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are only two hard things in Computer Science: cache invalidation and naming things :)
My rationale to rename jobStarted
to status
was primarily due my other suggestion for renaming the method to monitorJobSubmission
. Given the method was returning a boolean, a status felt more natural. Okay to keep it as jobStarted
.
// wait until all vertices have been scheduled and running | ||
hasFailure := false | ||
failedVertexIndex := -1 | ||
func monitorJobStart(job *client.FlinkJobOverview, timeout config2.Duration) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice to see this method being succinct now !
overview
In the job monitoring PR we introduced a bug such that when the job monitoring fails due to timeout or a failed vertex, the state DeployFailed is reached instead of attempting to rollback. This simplifies the logic of submitting job and job monitoring as well as results in the job attempting to roll back
additional info
Errors returned by a state in the state machine are added to the status as the last error. The shouldRollback at the beginning of these states checks to see if it is retryable and moves to rolling back if not. Thus, the change made is to return an error if monitoring results in a failed vertex or vertex timeout