Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve job acquisition job execution logging #2666

Open
2 tasks
ThorbenLindhauer opened this issue Feb 4, 2022 · 6 comments
Open
2 tasks

Improve job acquisition job execution logging #2666

ThorbenLindhauer opened this issue Feb 4, 2022 · 6 comments
Assignees
Labels
scope:core-api Changes to the core API: engine, dmn-engine, feel-engine, REST API, OpenAPI type:bug Issues that describe a user-facing bug in the project.

Comments

@ThorbenLindhauer
Copy link
Member

ThorbenLindhauer commented Feb 4, 2022

This issue was imported from JIRA:

Field Value
JIRA Link CAM-14334
Reporter @daniel-ewing
Has restricted visibility comments true

Environment (Required on creation):

Camunda Platform, 2 or more node cluster.

Description (Required on creation; please attach any relevant screenshots, stacktraces, log files, etc. to the ticket):

As described in SUPPORT-12684, with:

  • Multiple nodes
  • The first node in the BPMN after start is an async before delegate task that takes too long long to execute, e.g.:
    • Property lock-time-in-millis is 5 minutes.
    • Sometimes the delegate takes 15 + minutes to execute, e.g: an unexpected slowdown in SOAP response from external system.

It is possible for the nodes to become locked in a cycle of constantly processing the same jobs, getting OLEs, and produce debug logging like:

2022-01-28 11:58:59.849 DEBUG 1188403 --- <JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor>] org.camunda.bpm.engine.jobexecutor       : ENGINE-14012 Job acquisition thread woke up
2022-01-28 11:58:59.849 DEBUG 1188403 --- <JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor>] org.camunda.bpm.engine.jobexecutor       : ENGINE-14022 Acquired 0 jobs for process engine 'default': <]
2022-01-28 11:58:59.849 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor>] org.camunda.bpm.engine.jobexecutor       : ENGINE-14023 Execute jobs for process engine 'default': <32480db8-8029-11ec-acaf-005056bf79af>
2022-01-28 11:58:59.850 DEBUG 1188403 --- <JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor>] org.camunda.bpm.engine.jobexecutor       : ENGINE-14023 Execute jobs for process engine 'default': <334d21ec-8029-11ec-acaf-005056bf79af>
2022-01-28 11:58:59.850 DEBUG 1188403 --- <JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor>] org.camunda.bpm.engine.jobexecutor       : ENGINE-14023 Execute jobs for process engine 'default': <359d2bb0-8029-11ec-acaf-005056bf79af>
2022-01-28 11:58:59.850 DEBUG 1188403 --- <JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor>] org.camunda.bpm.engine.jobexecutor       : ENGINE-14011 Job acquisition thread sleeping for 99 millis
2022-01-28 11:58:59.949 DEBUG 1188403 --- <JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor>] org.camunda.bpm.engine.jobexecutor       : ENGINE-14012 Job acquisition thread woke up
2022-01-28 11:58:59.949 DEBUG 1188403 --- <JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor>] org.camunda.bpm.engine.jobexecutor       : ENGINE-14022 Acquired 0 jobs for process engine 'default': <]
2022-01-28 11:58:59.949 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor>] org.camunda.bpm.engine.jobexecutor       : ENGINE-14023 Execute jobs for process engine 'default': <32480db8-8029-11ec-acaf-005056bf79af>
2022-01-28 11:58:59.949 DEBUG 1188403 --- <JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor>] org.camunda.bpm.engine.jobexecutor       : ENGINE-14023 Execute jobs for process engine 'default': <334d21ec-8029-11ec-acaf-005056bf79af>
2022-01-28 11:58:59.949 DEBUG 1188403 --- <JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor>] org.camunda.bpm.engine.jobexecutor       : ENGINE-14023 Execute jobs for process engine 'default': <359d2bb0-8029-11ec-acaf-005056bf79af>
2022-01-28 11:58:59.949 DEBUG 1188403 --- <JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor>] org.camunda.bpm.engine.jobexecutor       : ENGINE-14011 Job acquisition thread sleeping for 100 millis
2022-01-28 11:59:00.049 DEBUG 1188403 --- <JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor>] org.camunda.bpm.engine.jobexecutor       : ENGINE-14012 Job acquisition thread woke up
2022-01-28 11:59:00.049 DEBUG 1188403 --- <JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor>] org.camunda.bpm.engine.jobexecutor       : ENGINE-14022 Acquired 0 jobs for process engine 'default': <]
2022-01-28 11:59:00.049 DEBUG 1188403 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor>] org.camunda.bpm.engine.jobexecutor       : ENGINE-14023 Execute jobs for process engine 'default': <32480db8-8029-11ec-acaf-005056bf79af>
2022-01-28 11:59:00.050 DEBUG 1188403 --- <JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor>] org.camunda.bpm.engine.jobexecutor       : ENGINE-14023 Execute jobs for process engine 'default': <334d21ec-8029-11ec-acaf-005056bf79af>
2022-01-28 11:59:00.050 DEBUG 1188403 --- <JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor>] org.camunda.bpm.engine.jobexecutor       : ENGINE-14023 Execute jobs for process engine 'default': <359d2bb0-8029-11ec-acaf-005056bf79af>
2022-01-28 11:59:00.050 DEBUG 1188403 --- <JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor>] org.camunda.bpm.engine.jobexecutor       : ENGINE-14011 Job acquisition thread sleeping for 99 millis

Steps to reproduce (Required on creation):

Analysis from this comment regarding the scenario that causes the above logging:

  • As transaction boundary / asynchronous continuation is reached, job is created and inserted to ACTRUJOB
  • Job acquisition of both the nodes polls the database
  • Job acquisition of both the nodes competes and tries to lock the job in ACTRUJOB
  • But job acquisition of one of the process engine succeeds and locks the job in ACTRUJOB
  • Job acquisition submits the job to execution thread pool's queue
  • Execution thread picks up job and executes it
  • Job takes more than 5 minutes to finish (because SOAP requests took more than 5 minutes to return the response)
  • Job gets unlocked in the database (as lock-time-in-millis which is 5 minutes got expired)
  • The same job is now available to be picked up by Job executor
  • Another node picks the same Job
  • Now, both the nodes are executing same Job
  • This leads to the situation where in same entity is updated by both the jobs at the same time - causing OptimisticLockingExceptions
  • In case of OptimisticLockingException, job keeps on retrying until it succeeds. However it you case because of this situation, OLE keeps on happening and hence job keeps on retrying.
  • As Thread pool queue gets full with such long running Jobs, new jobs are rejected and eventually no new Jobs are acquired, leading to the job blocked situation.

Observed Behavior (Required on creation):

The DEBUG logs are unhelpful and a bit confusing:

  • All that is logged is:
    • Acquired 0 jobs for process engine 'default'
    • Then both nodes process 3 jobs.
      • Repeat above for the same exact 3 jobs, ad infinitum.

If 0 jobs were acquired, how were there 3 jobs to execute?
Where are the OLE exception errors?

Expected behavior (Required on creation):

Helpful DEBUG logs, including:

  • available job execution threads
  • jobs rejected
  • number of jobs to acquire

Root Cause (Required on prioritization):

Solution Ideas (Optional):

Hints (optional):

Links:

Pull requests

  1. prakashpalanisamy
  2. ci:all-as ci:all-db ci:default-build ci:migration ci:rolling-update ci:spring-boot
    yanavasileva
@prakashpalanisamy
Copy link

I would like to pick this one to start contributing. Please let me know if that's okay..

@yanavasileva
Copy link
Member

yanavasileva commented Jan 23, 2024

Hi @prakashpalanisamy,

Great to hear you would like to make a contribution! We are always open to assist you to do so.
Are you interested in this particular problem to resolve? Or you were exploring a good first issue to start?
Nevertheless, check our CONTRIBUTING guide as a starting and let me know if you have any questions. Don't forget to link the newly create pull request to the issue that you are working on.

Best,
Yana

@prakashpalanisamy
Copy link

Hi @yanavasileva ,

Thank you for responding. I found this issue interesting when browsing through the open issues. Going through the guidelines for contribution. Will reachout in case of questions. Thank you!!

@prakashpalanisamy
Copy link

Hi @yanavasileva, Hope you are doing good. I was able to setup local camunda clusters (Camunda 7.20) pointing to single postgres DB Instance and was able to reproduce the Issue. Was using a model with Async continuation task calling a delegate which takes more time to complete the task compared to the value in property lock-time-in-millis. Im working with a single process instance of the model and hence a single Job to be executed.

I was able to observe the below from the logs:

  • JE of a node acquired 1 Job and proceeds to execute it.
  • Since the execution takes time than the lock-time, the Job is eventually taken up by another node's JE.
  • and multiple nodes executing the same Job causing OptimisticLockingException while trying to update the same.
  • the above repeats

The difference that I observed compared to the above logs in the description was, I see that 1 Job was acquired, as opposed to the above, where it states 0 Acquired jobs followed by Execute Job logs. Since that (number of Job acquired part) conflicts with the Observed Behaviour stated above, wanted to check with you.

Attaching my Log snippets below for the reference.

2024-02-21T18:02:10.602+05:30 DEBUG 18768 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14022 Acquired 1 jobs for process engine 'default': [[3bc2bd0b-d0b5-11ee-ab52-bc03580de904]]
2024-02-21T18:02:10.602+05:30 DEBUG 18768 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14023 Execute jobs for process engine 'default': [3bc2bd0b-d0b5-11ee-ab52-bc03580de904]
2024-02-21T18:02:10.603+05:30 DEBUG 18768 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.cmd               : ENGINE-13005 Starting command -------------------- AcquireJobsCmd ----------------------
2024-02-21T18:02:10.604+05:30 DEBUG 18768 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.cmd               : ENGINE-13009 opening new command context
2024-02-21T18:02:10.649+05:30  INFO 18768 --- [camundaTaskExecutor-1] com.test.delegates.JeDelegate         : Starting the JE Delegate... by camundaTaskExecutor-1
2024-02-21T18:02:15.610+05:30 DEBUG 18768 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.cmd               : ENGINE-13006 Finishing command -------------------- AcquireJobsCmd ----------------------
2024-02-21T18:02:15.610+05:30 DEBUG 18768 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14022 Acquired 0 jobs for process engine 'default': []
2024-02-21T18:02:15.610+05:30 DEBUG 18768 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14011 Job acquisition thread sleeping for 9994 millis
2024-02-21T18:02:25.605+05:30 DEBUG 18768 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14012 Job acquisition thread woke up
2024-02-21T18:02:25.605+05:30 DEBUG 18768 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.cmd               : ENGINE-13005 Starting command -------------------- AcquireJobsCmd ----------------------
2024-02-21T18:02:25.608+05:30 DEBUG 18768 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.cmd               : ENGINE-13009 opening new command context
2024-02-21T18:02:25.613+05:30 DEBUG 18768 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.cmd               : ENGINE-13006 Finishing command -------------------- AcquireJobsCmd ----------------------
2024-02-21T18:02:25.613+05:30 DEBUG 18768 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14022 Acquired 0 jobs for process engine 'default': []
2024-02-21T18:02:25.613+05:30 DEBUG 18768 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14011 Job acquisition thread sleeping for 19992 millis
2024-02-21T18:02:25.681+05:30 DEBUG 9060 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.jobexecutor       : ENGINE-14012 Job acquisition thread woke up
2024-02-21T18:02:25.682+05:30 DEBUG 9060 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.cmd               : ENGINE-13005 Starting command -------------------- AcquireJobsCmd ----------------------
2024-02-21T18:02:25.684+05:30 DEBUG 9060 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.cmd               : ENGINE-13009 opening new command context
2024-02-21T18:02:25.685+05:30 DEBUG 9060 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] o.c.b.e.i.p.e.J.selectNextJobsToExecute  : ==>  Preparing: select RES.ID_, RES.REV_, RES.DUEDATE_, RES.PROCESS_INSTANCE_ID_, RES.EXCLUSIVE_ from ACT_RU_JOB RES where (RES.RETRIES_ > 0) and ( RES.DUEDATE_ is null or RES.DUEDATE_ <= ? ) and (RES.LOCK_OWNER_ is null or RES.LOCK_EXP_TIME_ < ?) and RES.SUSPENSION_STATE_ = 1 and ( ( RES.EXCLUSIVE_ = true and not exists( select J2.ID_ from ACT_RU_JOB J2 where J2.PROCESS_INSTANCE_ID_ = RES.PROCESS_INSTANCE_ID_ -- from the same proc. inst. and (J2.EXCLUSIVE_ = true) -- also exclusive and (J2.LOCK_OWNER_ is not null and J2.LOCK_EXP_TIME_ >= ?) -- in progress ) ) or RES.EXCLUSIVE_ = false ) LIMIT ? OFFSET ?
2024-02-21T18:02:25.685+05:30 DEBUG 9060 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] o.c.b.e.i.p.e.J.selectNextJobsToExecute  : ==> Parameters: 2024-02-21 18:02:25.685(Timestamp), 2024-02-21 18:02:25.685(Timestamp), 2024-02-21 18:02:25.685(Timestamp), 3(Integer), 0(Integer)
2024-02-21T18:02:25.687+05:30 DEBUG 9060 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] o.c.b.e.i.p.e.J.selectNextJobsToExecute  : <==      Total: 0
2024-02-21T18:02:25.687+05:30 DEBUG 9060 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.cmd               : ENGINE-13011 closing existing command context
2024-02-21T18:02:25.687+05:30 DEBUG 9060 --- [JobExecutor[org.camunda.bpm.engine.spring.components.jobexecutor.SpringJobExecutor]] org.camunda.bpm.engine.persistence       : ENGINE-03006 Cache state after flush: [
]  

2024-02-21T18:13:25.690+05:30 DEBUG 18768 --- [camundaTaskExecutor-2] org.camunda.bpm.engine.context           : ENGINE-16002 Exception while closing command context: ENGINE-03005 Execution of 'DELETE MessageEntity[3bc2bd0b-d0b5-11ee-ab52-bc03580de904]' failed. Entity was updated by another transaction concurrently.

org.camunda.bpm.engine.OptimisticLockingException: ENGINE-03005 Execution of 'DELETE MessageEntity[3bc2bd0b-d0b5-11ee-ab52-bc03580de904]' failed. Entity was updated by another transaction concurrently.
	at org.camunda.bpm.engine.impl.db.EnginePersistenceLogger.concurrentUpdateDbEntityException(EnginePersistenceLogger.java:139) ~[camunda-engine-7.20.0.jar:7.20.0]
	at org.camunda.bpm.engine.impl.db.entitymanager.DbEntityManager.handleConcurrentModification(DbEntityManager.java:412) ~[camunda-engine-7.20.0.jar:7.20.0]
	at org.camunda.bpm.engine.impl.db.entitymanager.DbEntityManager.flushDbOperations(DbEntityManager.java:355) ~[camunda-engine-7.20.0.jar:7.20.0]
	at org.camunda.bpm.engine.impl.db.entitymanager.DbEntityManager.flushDbOperationManager(DbEntityManager.java:323) ~[camunda-engine-7.20.0.jar:7.20.0]
	at org.camunda.bpm.engine.impl.db.entitymanager.DbEntityManager.flush(DbEntityManager.java:295) ~[camunda-engine-7.20.0.jar:7.20.0]
	at org.camunda.bpm.engine.impl.interceptor.CommandContext.flushSessions(CommandContext.java:272) ~[camunda-engine-7.20.0.jar:7.20.0]
	at org.camunda.bpm.engine.impl.interceptor.CommandContext.close(CommandContext.java:188) ~[camunda-engine-7.20.0.jar:7.20.0]
	at org.camunda.bpm.engine.impl.interceptor.CommandContextInterceptor.execute(CommandContextInterceptor.java:119) ~[camunda-engine-7.20.0.jar:7.20.0]
	at org.camunda.bpm.engine.spring.SpringTransactionInterceptor.lambda$execute$0(SpringTransactionInterceptor.java:71) ~[camunda-engine-spring-6-7.20.0.jar!/:7.20.0]
	at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140) ~[spring-tx-6.0.10.jar!/:6.0.10]
	at org.camunda.bpm.engine.spring.SpringTransactionInterceptor.execute(SpringTransactionInterceptor.java:71) ~[camunda-engine-spring-6-7.20.0.jar!/:7.20.0]
	at org.camunda.bpm.engine.impl.interceptor.ProcessApplicationContextInterceptor.execute(ProcessApplicationContextInterceptor.java:70) ~[camunda-engine-7.20.0.jar:7.20.0]
	at org.camunda.bpm.engine.impl.interceptor.CommandCounterInterceptor.execute(CommandCounterInterceptor.java:35) ~[camunda-engine-7.20.0.jar:7.20.0]
	at org.camunda.bpm.engine.impl.interceptor.LogInterceptor.execute(LogInterceptor.java:33) ~[camunda-engine-7.20.0.jar:7.20.0]
	at org.camunda.bpm.engine.impl.interceptor.ExceptionCodeInterceptor.execute(ExceptionCodeInterceptor.java:55) ~[camunda-engine-7.20.0.jar:7.20.0]
	at org.camunda.bpm.engine.impl.jobexecutor.ExecuteJobHelper.executeJob(ExecuteJobHelper.java:57) ~[camunda-engine-7.20.0.jar:7.20.0]
	at org.camunda.bpm.engine.impl.jobexecutor.ExecuteJobsRunnable.executeJob(ExecuteJobsRunnable.java:110) ~[camunda-engine-7.20.0.jar:7.20.0]
	at org.camunda.bpm.engine.impl.jobexecutor.ExecuteJobsRunnable.run(ExecuteJobsRunnable.java:71) ~[camunda-engine-7.20.0.jar:7.20.0]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[na:na]
	at java.base/java.lang.Thread.run(Thread.java:840) ~[na:na]```

@yanavasileva
Copy link
Member

Hi @prakashpalanisamy,

Sounds great that you're getting started!

The difference that I observed compared to the above logs in the description was, I see that 1 Job was acquired, as opposed to the above, where it states 0 Acquired jobs followed by Execute Job logs.

You do have the Acquired 0 jobs occurrences in the output after the first acquisition. The snippet in the description just doesn't contain the very first acquisitions, therefore only the following 0 jobs acquired is there.
In other words, that's expected and no need to worry, you can continue with next steps.

Best,
Yana

@prakashpalanisamy
Copy link

@yanavasileva : I have created the below PR for the above Issue. Requesting you to please take a look. Thanks!

#4217

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scope:core-api Changes to the core API: engine, dmn-engine, feel-engine, REST API, OpenAPI type:bug Issues that describe a user-facing bug in the project.
Projects
None yet
Development

No branches or pull requests

3 participants