-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WFLY-19706 More gracefully handle the job execution status during a server crash. #590
base: main
Are you sure you want to change the base?
Conversation
@jamezp I'm working on this direction by first adding a These methods can be used to filter out suspicious crashed jobs and provide a method to force-stop them. For the force-stop process, the batch status in these tables need to be updated:
|
public class ForceStopJobOperatorImpl extends DefaultJobOperatorImpl { | ||
public void forceStop(final long executionId) { | ||
final JobExecutionImpl jobExecution = getJobExecutionImpl(executionId); | ||
jobExecution.setBatchStatus(BatchStatus.STOPPED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can set the exitstatus
into CRASHED
.
And we need to update the endtime
to the current time.
We also need to update the other tables here:
partition_execution
step_execution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't thread safe. You'd really need all this to happen in something similar to a transaction which either all or none are updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I should put them into a single tx like the implementation in the current stop()
method in JdbcRepository
:
@Override
public void stopJobExecution(final JobExecutionImpl jobExecution) {
super.stopJobExecution(jobExecution);
final String[] stopExecutionSqls = {
sqls.getProperty(STOP_JOB_EXECUTION),
sqls.getProperty(STOP_STEP_EXECUTION),
sqls.getProperty(STOP_PARTITION_EXECUTION)
};
final String jobExecutionIdString = String.valueOf(jobExecution.getExecutionId());
final String newBatchStatus = BatchStatus.STOPPING.toString();
final Connection connection = getConnection();
Statement stmt = null;
try {
stmt = connection.createStatement();
for (String sql : stopExecutionSqls) {
stmt.addBatch(sql.replace("?", jobExecutionIdString));
}
stmt.executeBatch();
} catch (Exception e) {
throw BatchMessages.MESSAGES.failToRunQuery(e, Arrays.toString(stopExecutionSqls));
} finally {
close(connection, stmt, null, null);
}
}
@@ -183,6 +183,12 @@ public List<JobExecution> getJobExecutions(final JobInstance jobInstance) { | |||
return result; | |||
} | |||
|
|||
// todo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I can move the InfinispanRepostiory
and the MongoRepository
out of jberet-core
to reduce the dependencies. That's another task though.
Is the idea that you'd query jobs that are in say a |
I plan to query the And maybe it also needs a query for a single job execution to see if its timeout or not. I'll add it too. |
There are several ways to use these APIs after they are implemented:
In addition, the WildFly Admin CLI and GUI can also use these APIs to provide functions like:
I'll work on the above task discuss it on the WildFly side, and write the RFE doc according to the discussion result. That's my work plan on this topic. I plan to complete this PR as the first step by the end of this year. @jamezp Thanks for reviewing this PR! It's an experimental feature and we can change the implementation if it doesn't work. Please let me know if you have any more suggestions and opinions. |
provide a force stop method to job execution.
My personal opinion is My understanding is the issue attempting to be solved is that a job is in an invalid state and we want to update the state. Nothing is really being stopped as much as the state is being updated. |
@jamezp Thanks for the review! I'll rename the method name. In this method, there are three tables changed accordingly:
Because they all have the
I plan to put these update actions into one transaction. In the future, if we do need to have a |
No description provided.