-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The fetch operation seems unsafe #4
Comments
I come from distributed MS SQL Servers too ;-) |
UPD. Currently I'm an idiot and all of my conclusions are weak and unfounded. Going to sleep :-) |
Well, I think I have enough information now. There are two solutions to this issue. The first one is to use the default (ReadCommitted) isolation level as now, but with a SELECT FOR UPDATE statement instead of a simple SELECT one to pick the job to dequeue. This ultimately causes an explicit lock on the row, thus no other transaction leads to a conflict. However, this seems not to be a viable option because of the need for an I/O to persist the lock. It just does cast away the beauty of MVCC. The other option is to use the RepeatableRead isolation level in which no locks are hold by the engine but all concurrency is controlled by the means of MVCC. In other words, any I/O required for locks is not necessary. The transaction that induces a conflicting update will return an error code and it must be handled by the client just as in the case of the constructor PostgreSqlDistributedLock. I have been away from my computer, running on the mobile for the last two days; so I will try and send a PR when available. (: More info: http://www.postgresql.org/docs/9.1/static/transaction-iso.html By the way, it was really nice to learn that there is actually no such thing called ReadUncommitted in terms of PostgreSQL. |
A possible solution could be to add a 'touchcount' field to the jobqueue table. Then first get the id and the touchcount from similar SQL. SELECT ""id"", ""touchcount""
FROM ""hangfire"".""jobqueue""
WHERE ""queue"" = ANY @queues
AND ""fetchedat"" {0}
ORDER BY ""queue"", ""fetchedat""
LIMIT 1 After that I would execute: UPDATE ""hangfire"".""jobqueue""
SET ""fetchedat"" = NOW() AT TIME ZONE 'UTC', ""touchcount"" = ""touchcount"" + 1
WHERE ""id"" = {retrievedIdFromFirstSql}
AND ""touchcount"" = {retrievedTouchCountFirstSql}
RETURNING ""id"" AS ""Id"", ""jobid"" AS ""JobId"", ""queue"" AS ""Queue""; If the second SQL doesn't return anything the row is picked up by another server and I need to repeat the loop. Does that make sense to you? |
Yeah, it solves the problem in a very generic way called optimistic locking of which I made a mention in the issue-opening comment. That solution is applicable to almost all RDBMS. However, PostgreSQL can handle the update in this way without an extra column since it might use xmin system column for this purpose. ServiceStack.OrmLite documentation makes reference to the use of this column, so we can look at the generated SQL statements to have an idea about. https://github.com/ServiceStack/ServiceStack.OrmLite#optimistic-concurrency |
I think I prefer to build my solution, because I might have the wish in the future to make this work for other RDMS'es as well. |
Guys, we should make use of ServiceStack.OrmLite to have a Hangfire storage extension working for all supported DBMSes. It can even be merged with the one for SQL Server. (: |
I don't have experience with OrmLite. The piece of software we are writing which uses Hangfire uses NHibernate. So that would be the logical ORM for me to use. |
I have just had a chance to have a look at your solution attentively and found that the column 'touchcount' is actually unnecessary to implement an optimistic lock. Just select 'fetchedat' as well as 'id' and use it for the safety check of the record update operation. |
Optimistic locking is already achieved by setting current timestamp on |
@frankhommers, in cases like this I prefer not to think and use benchmarking :) |
@odinserj I hate the idea of building a cluster just for a benchmark at the moment. I'm quite busy with other things ;-) |
@odinserj Optimistic locking is not employed in the current query. To implement it properly the query should be something like the following: SELECT
"id",
"fetchedat"
INTO TEMP TABLE jobToFetch
FROM ""hangfire"".""jobqueue""
WHERE
""queue"" = ANY @queues AND
""fetchedat"" {0}
ORDER BY ""queue"", ""fetchedat""
LIMIT 1;
UPDATE ""hangfire"".""jobqueue""
SET
""fetchedat"" = NOW() AT TIME ZONE 'UTC'
WHERE
""id"" IN (SELECT ""id"" FROM jobToFetch) AND
""fetchedat"" IN (SELECT ""fetchedat"" FROM jobToFetch)
RETURNING
""id"" AS ""Id"",
""jobid"" AS ""JobId"",
""queue"" AS ""Queue""; The query used by the SQL Server storage provider consists of a single update but that makes it impossible to order by 'queue' and 'fetchedat'. @frankhommers It is a much-used reliable pattern underpinning the Multi-Version Concurrency Control method. You are not allowed to update stale records in this way. There is no need to form a cluster to assess the results; you can practice it step by step in a single machine using a client like pgAdmin.
|
@barhun I'd like to prevent the use of a temp table. And I'm not totally confident that the fetchedat in the jobToFetch table is always different when another server updates it. About assessing the results, what @odinserj and I were talking about is that the "touchcount" (or your jobToFetch (which is very similar, I think)) mechanism is faster than transaction locking in a cluster. And to test that you need a cluster. About assessing the behaviour, you are right. |
@frankhommers Considering that what is selected as 'fetchedat' must be a null or a timestamp < NOW() AT TIME ZONE 'UTC' - INTERVAL '[SOME] SECONDS' we can be sure that it must be different if already updated. |
Without a temp table I must rely on Dapper to retrieve and request the DateTime without any changes. See my code here: da4d1cc#diff-ef5e0ac23ea818747b64429c83bb4ca5 I'm still liking "touchcount" more because it prevents possible DateTime precision issues. |
Guys, why not to use simple repeatable-read transaction with auto-retries on serialization issues? It is much easier to understand than other solutions. Btw, I have a simple benchmarking program for Hangfire. If other solutions performed much better, then they are worth additional programming and their complexity. But I have no access to it since I am out-of-office before Jan, 13. |
I am still confused about why the PostgreSQL extension orders by both 'queue' and 'fetchedat' while the SQL Server provider does not. The sanest expectation is that they must be compatible with each other. However, I've changed my mind about SELECT FOR UPDATE vs REPEATABLE READ. Locking the row for update, after all, may not result in much overhead - considering the lock is taken during the first block access for select and released after the following update. This way the application wouldn't lose time retrying in case of a repeatable-read transaction. Adding a FOR UPDATE clause to the subquery appears to be sufficient to solve the issue. UPDATE ""hangfire"".""jobqueue""
SET ""fetchedat"" = NOW() AT TIME ZONE 'UTC'
WHERE ""id"" IN (
SELECT ""id""
FROM ""hangfire"".""jobqueue""
WHERE ""queue"" = ANY @queues
AND ""fetchedat"" {0}
ORDER BY ""queue"", ""fetchedat""
LIMIT 1
FOR UPDATE
)
RETURNING ""id"" AS ""Id"", ""jobid"" AS ""JobId"", ""queue"" AS ""Queue""; I, at the end, think repeatable-read isolation and optimistic locking are quite similar in terms of the conflict-handling logic employed by the application. It seems better to hand over this burden of concurrency management to the RDBMS. I have had enough words on this subject and will have sent two PRs for both FOR UPDATE and REPEATABLE READ by Jan, 13. We can then see the benchmark results regarding the three solutions. |
It might be better to order by fetchedat, jobid. OK. I will put the FOR UPDATE for this version in the master. Are these 3 the solutions you mean?:
|
I think so, it is the behavior that we expect. However, that still breaks the compatibility between storage providers but it isn't a big problem. Yes, I talk about those three. I like being mostly busy talking about things. (: Kidding, I have just got no time to setup, fork & patch and test. |
@frankhommers @odinserj Could you have a look at the PR #6? |
@barhun I will have a look at it soon. But I've been thinking, I really want the updatecount flavor as well. So I think I will make it optional within the StorageOptions "UseNativeTransaction". When it's false it will fallback to updatecount. The reason for this is that it will not depend on transactions that way and that will make it easier (in my opionion) to make a generic IDbConnection storage provider. |
I've just fallen foul of this, getting multiple dequeues of jobs with Hangfire.PostgreSql even when running with a single server but multiple workers. I've been chasing this down all day and thought it was something to do with my DI registrations or fancy filters I'd added - but obviously not :) Is there anything I can do to help here? Provide a test case/unit test? I can't reproduce this 100% of the time, but approx 1 in 5 on a dual core i7 with postgres running locally (Windows 8.1 x64). Is the PR ready to be pushed to a nuget release? EDIT: @odinserj I was also wondering if it was worth adding something into HF core to at least detect the double processing state (in the dashboard only?) to make it obvious what has happened? It took me a while to dig through all my logs/jobs and see this: |
Everything discussed here is in the master branch, it's all ready to push to a nuget package. If you like I can do that tomorrow. I would appreciate that you test this it by compiling your own from the current master branch. Unit tests are all OK, but they were in the previous nupkg as well ;-) |
@frankhommers I haven't been able to break a build from master when testing by hand. Can you please push to nuget and then I'll try and expand my testing suite? |
@kieranbenton I have published it. Just one little request: Can you try to run it with UseNativeDatabaseTransactions = false as well? Please report the results ;-) |
Will do Frank - it might be a day or two until get I round to it but will On 21 January 2015 at 13:32, Frank Hommers notifications@github.com wrote:
|
Right, I'm no longer getting multiple executions but I do get: |
@kieranbenton The default value is true. (As preferred by @odinserj and @barhun ;-)). Could you test with true and false as well? Can you include more of the stacktrace as well? |
@kieranbenton @frankhommers Serialization errors are supposed to provide the application with the error code 40001; so that kind of errors are considered a sign for retrying, then smoothed and not allowed to go beyond the scope of the utility method. I can be mistaken, so we ought to have the right error code to filter. |
That's whats being done with UseNativeDatabaseTransactions set to true at https://github.com/frankhommers/Hangfire.PostgreSql/blob/master/src/Hangfire.PostgreSql/PostgreSqlJobQueue.cs#L106 |
@frankhommers Yeah, I know. I wrote that piece of code. (: What I was trying to say was that I couldn't make sure which error code is returned when serialization fails to happen. I just assumed that the error code would be 40001 in any case of concurrent updates just by having a look at the documentation for the title of errors created by hand. EDIT: I didn't have time but I will write, asap, the code to figure out what error code npgsql provides us with when serialization attempts fail. The property named ErrorCode is of type int but I now see there are error codes whose values consist of alphabetical characters. There may be a mapping between the PostgreSQL error codes and those of npgsql. I will come back to clarify. |
@odinserj @frankhommers @kieranbenton Oh, forgive me. I have just created another PR for the recent inconvenience. Check it: #7 |
@kieranbenton @barhun I will push a new version to nuget tomorrow. That will have @kieranbenton's problem fixed ;-) Thanks @barhun |
And I hadn’t even had chance to reply! Thanks guys. Do you still want it testing with native transactions off?
|
Yes! |
@frankhommers @kieranbenton You're welcome. Sorry, again, for the mess I caused. By the way, the property ErrorCode has the same value as HResult which is obviously irrelevant to PostgreSQL error codes. I feel like an idiot. (: |
@barhun No problem man, we're all human ;-) @kieranbenton A new version is pushed to NuGet. Any testing results yet? |
@kieranbenton Hey, is it all ok? I'm going to close this one if. (: |
Hi,
I am honestly not well-acquainted with PostgreSQL but have developed transaction-safe queues for distributed (multi-consumer) access on SQL Server. It seems your fetching statement fetchJobSqlTemplate in PostgreSqlJobQueue.cs fails under concurrent load leading to jobs dequeued multiple times. You may use optimistic locking or a SELECT FOR UPDATE statement to let a farm of applications run safely.
More info: http://www.postgresql.org/docs/9.1/static/explicit-locking.html
The text was updated successfully, but these errors were encountered: