WFLY-19706 More gracefully handle the job execution status during a server crash. #590

liweinan · 2024-11-21T15:18:22Z

No description provided.

liweinan · 2024-11-21T15:36:43Z

@jamezp I'm working on this direction by first adding a ForceStopJobOperatorImpl class that supports the forceStop() method and then adding a findAllTimoutJobExecutions() method.

These methods can be used to filter out suspicious crashed jobs and provide a method to force-stop them.

For the force-stop process, the batch status in these tables need to be updated:

batch_db=# \dt
                 List of relations
 Schema |        Name         | Type  |   Owner
--------+---------------------+-------+------------
 public | job_execution       | table | batch_user
 public | job_instance        | table | batch_user
 public | partition_execution | table | batch_user
 public | step_execution      | table | batch_user
(4 rows)

liweinan · 2024-11-21T16:21:20Z

jberet-core/src/main/java/org/jberet/operations/ForceStopJobOperatorImpl.java

+public class ForceStopJobOperatorImpl extends DefaultJobOperatorImpl {
+    public void forceStop(final long executionId) {
+        final JobExecutionImpl jobExecution = getJobExecutionImpl(executionId);
+        jobExecution.setBatchStatus(BatchStatus.STOPPED);


Maybe we can set the exitstatus into CRASHED.

And we need to update the endtime to the current time.

We also need to update the other tables here:

partition_execution

step_execution

This isn't thread safe. You'd really need all this to happen in something similar to a transaction which either all or none are updated.

Yes, I should put them into a single tx like the implementation in the current stop() method in JdbcRepository :

@Override public void stopJobExecution(final JobExecutionImpl jobExecution) { super.stopJobExecution(jobExecution); final String[] stopExecutionSqls = { sqls.getProperty(STOP_JOB_EXECUTION), sqls.getProperty(STOP_STEP_EXECUTION), sqls.getProperty(STOP_PARTITION_EXECUTION) }; final String jobExecutionIdString = String.valueOf(jobExecution.getExecutionId()); final String newBatchStatus = BatchStatus.STOPPING.toString(); final Connection connection = getConnection(); Statement stmt = null; try { stmt = connection.createStatement(); for (String sql : stopExecutionSqls) { stmt.addBatch(sql.replace("?", jobExecutionIdString)); } stmt.executeBatch(); } catch (Exception e) { throw BatchMessages.MESSAGES.failToRunQuery(e, Arrays.toString(stopExecutionSqls)); } finally { close(connection, stmt, null, null); } }

liweinan · 2024-11-21T16:23:28Z

jberet-core/src/main/java/org/jberet/repository/InfinispanRepository.java

@@ -183,6 +183,12 @@ public List<JobExecution> getJobExecutions(final JobInstance jobInstance) {
        return result;
    }

+    // todo


Maybe I can move the InfinispanRepostiory and the MongoRepository out of jberet-core to reduce the dependencies. That's another task though.

jamezp · 2024-11-21T17:21:31Z

Is the idea that you'd query jobs that are in say a STOPPING state, then execute forceStop() on them to change the state?

liweinan · 2024-11-22T02:05:35Z

Is the idea that you'd query jobs that are in say a STOPPING state, then execute forceStop() on them to change the state?

I plan to query the STOPPING and the STARTING states in the findAllTimoutJobExecutions() method. We can change the logic of this method according to the detail requirement in the future.

And maybe it also needs a query for a single job execution to see if its timeout or not. I'll add it too.

liweinan · 2024-11-22T02:18:24Z

There are several ways to use these APIs after they are implemented:

Add a batchlet that uses these new APIs to batch-close the timeout jobs.

In addition, the WildFly Admin CLI and GUI can also use these APIs to provide functions like:

Find out the timeout jobs.
Provide force-close() operation on one single or multiple job executions.

I'll work on the above task discuss it on the WildFly side, and write the RFE doc according to the discussion result.

That's my work plan on this topic. I plan to complete this PR as the first step by the end of this year. @jamezp Thanks for reviewing this PR! It's an experimental feature and we can change the implementation if it doesn't work. Please let me know if you have any more suggestions and opinions.

provide a force stop method to job execution.

jamezp · 2024-12-11T16:27:12Z

My personal opinion is forceStop() doesn't seem like the right name to me. It's not really stopping anything, it's just changing the job status. I think it might make more sense to have something named changeJobStatus(BatchStatus expectedStatus, BatchStatus newStatus).

My understanding is the issue attempting to be solved is that a job is in an invalid state and we want to update the state. Nothing is really being stopped as much as the state is being updated.

liweinan · 2024-12-12T01:08:00Z

@jamezp Thanks for the review! I'll rename the method name. In this method, there are three tables changed accordingly:

job_execution
partition_execution
step_execution

Because they all have the batchstatus and exitstatus needs to be updated:

batch_db=# \d job_execution
                                               Table "public.job_execution"
     Column      |           Type           | Collation | Nullable |                        Default
-----------------+--------------------------+-----------+----------+-------------------------------------------------------
 jobexecutionid  | bigint                   |           | not null | nextval('job_execution_jobexecutionid_seq'::regclass)
 jobinstanceid   | bigint                   |           | not null |
 version         | integer                  |           |          |
 createtime      | timestamp with time zone |           |          |
 starttime       | timestamp with time zone |           |          |
 endtime         | timestamp with time zone |           |          |
 lastupdatedtime | timestamp with time zone |           |          |
 batchstatus     | character varying(30)    |           |          |
 exitstatus      | character varying(512)   |           |          |
 jobparameters   | character varying(3000)  |           |          |
 restartposition | character varying(255)   |           |          |
Indexes:
    "job_execution_pkey" PRIMARY KEY, btree (jobexecutionid)
Foreign-key constraints:
    "fk_job_execution_job_instance" FOREIGN KEY (jobinstanceid) REFERENCES job_instance(jobinstanceid) ON DELETE CASCADE
Referenced by:
    TABLE "step_execution" CONSTRAINT "fk_step_exe_job_exe" FOREIGN KEY (jobexecutionid) REFERENCES job_execution(jobexecutionid) ON DELETE CASCADE

batch_db=# \d step_execution
                                                  Table "public.step_execution"
        Column        |           Type           | Collation | Nullable |                         Default
----------------------+--------------------------+-----------+----------+---------------------------------------------------------
 stepexecutionid      | bigint                   |           | not null | nextval('step_execution_stepexecutionid_seq'::regclass)
 jobexecutionid       | bigint                   |           | not null |
 version              | integer                  |           |          |
 stepname             | character varying(255)   |           |          |
 starttime            | timestamp with time zone |           |          |
 endtime              | timestamp with time zone |           |          |
 batchstatus          | character varying(30)    |           |          |
 exitstatus           | character varying(512)   |           |          |
 executionexception   | character varying(2048)  |           |          |
 persistentuserdata   | bytea                    |           |          |
 readcount            | integer                  |           |          |
 writecount           | integer                  |           |          |
 commitcount          | integer                  |           |          |
 rollbackcount        | integer                  |           |          |
 readskipcount        | integer                  |           |          |
 processskipcount     | integer                  |           |          |
 filtercount          | integer                  |           |          |
 writeskipcount       | integer                  |           |          |
 readercheckpointinfo | bytea                    |           |          |
 writercheckpointinfo | bytea                    |           |          |
Indexes:
    "step_execution_pkey" PRIMARY KEY, btree (stepexecutionid)
Foreign-key constraints:
    "fk_step_exe_job_exe" FOREIGN KEY (jobexecutionid) REFERENCES job_execution(jobexecutionid) ON DELETE CASCADE
Referenced by:
    TABLE "partition_execution" CONSTRAINT "fk_partition_exe_step_exe" FOREIGN KEY (stepexecutionid) REFERENCES step_execution(stepexecutionid) ON DELETE CASCADE

batch_db=# \d partition_execution
                       Table "public.partition_execution"
        Column        |          Type           | Collation | Nullable | Default
----------------------+-------------------------+-----------+----------+---------
 partitionexecutionid | integer                 |           | not null |
 stepexecutionid      | bigint                  |           | not null |
 version              | integer                 |           |          |
 batchstatus          | character varying(30)   |           |          |
 exitstatus           | character varying(512)  |           |          |
 executionexception   | character varying(2048) |           |          |
 persistentuserdata   | bytea                   |           |          |
 readercheckpointinfo | bytea                   |           |          |
 writercheckpointinfo | bytea                   |           |          |
Indexes:
    "partition_execution_pkey" PRIMARY KEY, btree (partitionexecutionid, stepexecutionid)
Foreign-key constraints:
    "fk_partition_exe_step_exe" FOREIGN KEY (stepexecutionid) REFERENCES step_execution(stepexecutionid) ON DELETE CASCADE

batch_db=#

I plan to put these update actions into one transaction.

In the future, if we do need to have a forceStop() method that not only change the statuses of these columns but also take other actions in a single transaction, we can add it then.

liweinan requested a review from a team as a code owner November 21, 2024 15:18

liweinan marked this pull request as draft November 21, 2024 15:18

liweinan commented Nov 21, 2024

View reviewed changes

liweinan self-assigned this Nov 22, 2024

liweinan force-pushed the WFLY-19706 branch from c4bb3c9 to d6be965 Compare November 28, 2024 16:28

liweinan added 7 commits December 4, 2024 00:00

WFLY-19706

febeb8d

provide a force stop method to job execution.

cleaup

dd8a143

add ForceStopJobOperatorImpl

5e3cc58

refactor

6c6f52d

refactor the test

105bfac

add the getTimeoutJobExecutions() method in JobRepository

d41ccd8

save progress

8da5c4f

liweinan force-pushed the WFLY-19706 branch from 907af1e to 8da5c4f Compare December 3, 2024 16:00

liweinan added 5 commits December 4, 2024 00:26

update getTimeoutJobExecutions() in JdbcRepository

5cf6943

update

795b0cc

update

0c07ea7

update

191106a

update SQL

d44eb4b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WFLY-19706 More gracefully handle the job execution status during a server crash. #590

WFLY-19706 More gracefully handle the job execution status during a server crash. #590

liweinan commented Nov 21, 2024

liweinan commented Nov 21, 2024

liweinan Nov 21, 2024

jamezp Nov 21, 2024

liweinan Nov 22, 2024

liweinan Nov 21, 2024 •

edited

Loading

jamezp commented Nov 21, 2024

liweinan commented Nov 22, 2024 •

edited

Loading

liweinan commented Nov 22, 2024 •

edited

Loading

jamezp commented Dec 11, 2024

liweinan commented Dec 12, 2024

WFLY-19706 More gracefully handle the job execution status during a server crash. #590

Are you sure you want to change the base?

WFLY-19706 More gracefully handle the job execution status during a server crash. #590

Conversation

liweinan commented Nov 21, 2024

liweinan commented Nov 21, 2024

liweinan Nov 21, 2024

Choose a reason for hiding this comment

jamezp Nov 21, 2024

Choose a reason for hiding this comment

liweinan Nov 22, 2024

Choose a reason for hiding this comment

liweinan Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

jamezp commented Nov 21, 2024

liweinan commented Nov 22, 2024 • edited Loading

liweinan commented Nov 22, 2024 • edited Loading

jamezp commented Dec 11, 2024

liweinan commented Dec 12, 2024

liweinan Nov 21, 2024 •

edited

Loading

liweinan commented Nov 22, 2024 •

edited

Loading

liweinan commented Nov 22, 2024 •

edited

Loading