Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release BLOCKED triggers in releaseAcquiredTrigger #146

Merged
merged 1 commit into from
Feb 12, 2019

Conversation

shelmling
Copy link
Contributor

No description provided.

@shelmling
Copy link
Contributor Author

Dear Quartz Team,

Based on my findings from Issue 145, I'd like to propose the following change. I hope this is useful for you.

Thanks,
Sebastian

@mstead
Copy link

mstead commented Nov 6, 2017

Any update on when this will get merged? We are currently getting hit by this issue and need it fixed ASAP.

@pbuckley
Copy link

pbuckley commented Dec 1, 2017

👍 on fixing, I think we've been hit by this issue as well

@IshwarKhandelwal
Copy link

When this fix will be available in next release of quartz scheduler.

@dersteve
Copy link

Why don't we merge this fix in? We are seeing similar problems in our environment

@fbokovikov
Copy link

Let's merge this fix. @zemian @jhouserizer @chrisdennis

@zemian
Copy link
Contributor

zemian commented Feb 11, 2019

Hello folks, sorry it took so long to respond. I will take a look at this and will try to merge it in next day or so.

@zemian zemian merged commit d8497ff into quartz-scheduler:master Feb 12, 2019
zemian added a commit to zemian/quartz that referenced this pull request Feb 12, 2019
@dersteve
Copy link

@zemian Thanks for the merge! What are your plans for releasing this fix in any new versions?

@zemian
Copy link
Contributor

zemian commented Feb 27, 2019

Hi @dersteve , then next release should be 2.3.1. See https://github.com/quartz-scheduler/quartz/blob/quartz-2.3.x/docs/changelog.adoc

Don't have a date, but should be soon. I am trying to get it publish with help of Terracotta folks.

@fbokovikov
Copy link

@zemian Thanks for the merge! Can you specify the release date of new quartz version please? We really need this fix!

@zemian
Copy link
Contributor

zemian commented Mar 6, 2019

Hi @fbokovikov no release date yet :( Hopefully soon. At the meantime, you can simply do a local build from latest branch.

@vincentjames501
Copy link

I think we've traced down an issue related to this commit/fix. When running in a clustered environment, with DisallowConcurrentExecution, and lots of triggers for that job, something appears to "hang" for several minutes doing nothing (all triggers are in a WAITING state and none are in COMPLETED/BLOCKED/ACQUIRED, the fire time is still valid and w/in our 30 min misfire range). I'm not sure why this would be as I don't know the quartz data model too well, however, if I comment out this line:

 getDelegate().updateTriggerStateFromOtherState(conn,
                    trigger.getKey(), STATE_WAITING, STATE_BLOCKED);

The issue goes away. Also, I merged these two into a single one locally and was also not able to reproduce the hang issue:

getDelegate().updateTriggerStateFromOtherState(conn,
                    trigger.getKey(), STATE_WAITING, STATE_ACQUIRED);	                    
getDelegate().updateTriggerStateFromOtherState(conn,
                    trigger.getKey(), STATE_WAITING, STATE_BLOCKED);

with

getDelegate().updateTriggerStateFromOtherStates(conn,
                    trigger.getKey(), STATE_WAITING, STATE_ACQUIRED, STATE_BLOCKED);	

Can anyone hypothesize why this would be?
Is there something about this being done in two separate queries that could introduce race conditions?

Here is my theory:

  • Node A acquires a triggers (one trigger acquired, rest blocked)
  • Node A begins to release the acquired trigger by executing getDelegate().updateTriggerStateFromOtherState(conn, trigger.getKey(), STATE_WAITING, STATE_ACQUIRED); (now all triggers are in waiting)
  • Node B before Node A executes getDelegate().updateTriggerStateFromOtherState(conn, trigger.getKey(), STATE_WAITING, STATE_BLOCKED); acquires a trigger as everything to it is in the WAITING state now and Node B can acquire things (RACE CONDITION) (one trigger acquired, rest blocked).
  • Node A then executes getDelegate().updateTriggerStateFromOtherState(conn, trigger.getKey(), STATE_WAITING, STATE_BLOCKED); while Node B had one acquired and the rest blocked (now triggers are incorrectly set to a waiting state).
  • Node B then releases the acquired trigger and things are hosed (I'm not sure why this happens but I do think there is a race condition above?).

It would probably not be a bad idea to merge into a single query anyways for performance. CC @zemian @shelmling

lahma added a commit to quartznet/quartznet that referenced this pull request Nov 7, 2019
@oridool
Copy link

oridool commented Feb 9, 2021

I sill have similar issue on v2.3.2, when using cluster more and enabling @DisallowConcurrentExecution.
I'm not sure the issue is fixed.
It happens only occasionally and not always.
I have a log just be fore my job execution ends, so I'm sure it is not working anymore. From application side everything seems to be normal. But trigger is still hangs in status BLOCKED.

Is there any workaround fix?

Thanks.

@IovanAlexandru
Copy link

I sill have similar issue on v2.3.2, when using cluster more and enabling @DisallowConcurrentExecution.
I'm not sure the issue is fixed.
It happens only occasionally and not always.
I have a log just be fore my job execution ends, so I'm sure it is not working anymore. From application side everything seems to be normal. But trigger is still hangs in status BLOCKED.

Is there any workaround fix?

Thanks.

@zemian @oridool Having the same issue on 2.3.2 as stated in #145 while using @DisallowConcurrentExecution annotation (145 PR specifies the problem is fixed via this PR 146 on version 2.3.2)

If we set up the following:

  • TRIGGER table: next_fire_time in the past, trigger_state to blocked
  • FIRED_TRIGGERS table: empty or (not containing the blocked trigger)
    This will never make the job associated with this blocked trigger run again.
    PR Release BLOCKED triggers in releaseAcquiredTrigger #146 (this one) mostly makes sure of proper clean-up but this is not ensured when instances die suddenly or in the middle of the release process of the BLOCKED jobs.
    I think the problem should be addressed at the moment when Quartz polls the jobs from the DB. It currently polls only for WAITING state:
    SELECT TRIGGER_NAME, TRIGGER_GROUP, NEXT_FIRE_TIME, PRIORITY FROM TRIGGERS WHERE SCHED_NAME = '?' AND TRIGGER_STATE = 'WAITING' AND NEXT_FIRE_TIME <= ? AND (MISFIRE_INSTR = -1 OR (MISFIRE_INSTR != -1 AND NEXT_FIRE_TIME >= ?)) ORDER BY NEXT_FIRE_TIME ASC, PRIORITY DESC

@borisvaningelgom
Copy link

@zemian We also face this issue in our production environment and have to manually correct the job trigger state in the database to solve it. Our job runs every 5 minutes.
We see no exception or errors in the logs. It just stops.
Even when the job DOES throw an exception, it doesn't necessarily get blocked.

Quartz version v2.3.2
Job is marked with @DisallowConcurrentExecution

@koti-muppavarapu
Copy link

@zemian We also face this issue in our production environment and have to manually correct the job trigger state in the database to solve it. Our job runs every 5 minutes. We see no exception or errors in the logs. It just stops. Even when the job DOES throw an exception, it doesn't necessarily get blocked.

Quartz version v2.3.2 Job is marked with @DisallowConcurrentExecution

We are facing exact same issue in our production. Did you find a solution or workaround for this issue? I am also using Quartz version 2.3.2 and my Job is marked with @DisallowConcurrentExecution as well.

@borisvaningelgom
Copy link

@koti-muppavarapu We solved it by properly configuring our quartz to run in clustered mode. It looks like if you don't do this, the @DisallowConcurrentExecution creates issues.

We has some missing properties that were the root cause.
Make sure "org.quartz.scheduler.isClustered" is set to true

@koti-muppavarapu
Copy link

Thanks for you reply @borisvaningelgom , I will try this property and see if I this will fix the issue. This is very random issue on which happen very randomly and rarely. Hopefully this will fix it.

@vincentjames501
Copy link

@borisvaningelgom @koti-muppavarapu we’ve been running clustered mode since the beginning and that doesn’t solve it for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.