Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Lock is causing issues #329

Open
phillijw opened this issue Oct 6, 2023 · 5 comments
Open

Distributed Lock is causing issues #329

phillijw opened this issue Oct 6, 2023 · 5 comments

Comments

@phillijw
Copy link

phillijw commented Oct 6, 2023

The default value of the lock timeout is 10min but if a job takes longer than 10min, it will spawn a new cron job, assuming that the lock timed out. So we set our lock to be more like 6 hours instead. Of course, when the application shuts down, this lock doesn't get cleared which means that job can't run until that lock gets cleared when it expires.

This seems like a bug. Can anyone provide more info on how this works vs how the mssql version works and how to handle it better from my point of view?

@dmitry-vychikov
Copy link
Contributor

Hi @phillijw

  • What version of Hangfire and Hangfire Postgres are you using?
  • Can you share a reproducible application where we can observe the issue?

I have just tested with a small app on my side. Recurring jobs work fine, and no locks get stuck.

Can anyone provide more info on how this works

If you are asking about Recurring jobs, AFAIR, distributed lock is used to ensure only one worker executes the cron scheduling part. When it is time to trigger a recurring job, it takes the lock, and creates a normal (non-recurring) job in the queue which will do the actual calculation, then releases the lock immediately.

In other words, job execution time and lock timeout are NOT related. This should be working normally without you having to set 6 hours time outs.

@phillijw
Copy link
Author

phillijw commented Oct 10, 2023

  • Hangfire 1.8.1
  • Hangfire.PostgreSql 1.19.12

I think the lock is actually coming from the DisableConcurrentExecution attribute. I'm still trying to repro it properly and I think that's the trick. You have to add that attribute to the method to attempt to avoid overlapping jobs.

@phillijw
Copy link
Author

Ok, yes, so the main issue I'm seeing is that when I set DisableConcurrentExecution and I stop the process in the middle of a job, that lock doesn't get removed during shutdown. It stays there even after the application restarts and then it won't run the job again until that times out.

@dmitry-vychikov
Copy link
Contributor

@phillijw

Ok, yes, so the main issue I'm seeing is that when I set DisableConcurrentExecution and I stop the process in the middle of a job, that lock doesn't get removed during shutdown. It stays there even after the application restarts and then it won't run the job again until that times out.

It makes sense now.

In terms of distributed locking, hangfire sql and postgres are different. SqlServer uses application locks which are cleared automatically when connection closes. In comparison, Postgres uses a simple table to store locks, which cleans up only inside of code. Postgres is vulnerable to cases when job is interrupted ungracefully.

There were discussions in this repo about improving lock mechanism, but as I understand it got stuck with a major issue which was not solved so far.

Anyway, for your scenario consider the following recommendations:

  • if you need to stop a job, then do it gracefully. Pass a cancellation token, check its status periodically, and exit your method when it cancels. This will prevent lock from being stuck.
  • you mentioned 6 hours timeout. Is this how much time your job takes? Consider improving execution times or splitting that job into smaller pieces.
  • Stop using DisableConcurrent execution attribute. That is the worst hangfire invention. It is very bad because it forces all workers who got into lock wait to be IDLe. They are NOT taking other available jobs while waiting on DisableConcurrent execution. At least, that's how it worked several years ago
    • Back then we built our own semaphore-like version of concurrency limiter. It was a Hangfire extension that stored extra metadata in job properties to keep track of how many executions of the job there are, and workers that were exceeding the threshold would be pushed off to took for other jobs.
    • unfortunately, I do not have access to that code anymore.

@VmPeter
Copy link

VmPeter commented Mar 7, 2024

Hi all, just wanted to respond to this!

In comparison, Postgres uses a simple table to store locks, which cleans up only inside of code. Postgres is vulnerable to cases when job is interrupted ungracefully

We're using Hangfire but with MSSQL. We specifically looked into this, because we wanted to make sure app crashes did not leave any locks hanging. MSSQL uses sp_getapplock, which is tied to the session or transaction, so even killing (kill -9, we did not test power-outage, but I hope it suffices) should release lock. https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-getapplock-transact-sql?view=sql-server-ver16

I'm NOT a postgres guy, so this is just me suggesting something (because I want to help).

For postgres I see there's something called pg_try_advisory_lock and pg_advisory_unlock which should work on the session if I read https://www.postgresql.org/docs/9.1/functions-admin.html correctly.

There's also one tied to transaction (see link).

We have also been using DistributedLock nuget from madelson, but the SQL version. There's a postgres version which uses these advisory locks: https://github.com/madelson/DistributedLock/blob/master/docs/DistributedLock.Postgres.md so maybe their approach could be a guiding hand to implementing it the same as the MSSQL locks?

Hope this can help. Take care.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants