Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The fetch operation seems unsafe #4

Closed
barhun opened this issue Jan 7, 2015 · 40 comments
Closed

The fetch operation seems unsafe #4

barhun opened this issue Jan 7, 2015 · 40 comments
Assignees

Comments

@barhun
Copy link
Contributor

barhun commented Jan 7, 2015

Hi,

I am honestly not well-acquainted with PostgreSQL but have developed transaction-safe queues for distributed (multi-consumer) access on SQL Server. It seems your fetching statement fetchJobSqlTemplate in PostgreSqlJobQueue.cs fails under concurrent load leading to jobs dequeued multiple times. You may use optimistic locking or a SELECT FOR UPDATE statement to let a farm of applications run safely.

More info: http://www.postgresql.org/docs/9.1/static/explicit-locking.html

@frankhommers
Copy link
Collaborator

I come from distributed MS SQL Servers too ;-)
Also Npgsql doesn't fully implement the TransactionScope correctly yet, so I think you are right. Can you supply me with the full SQL in how you think it's safe? (Or a Pull Request would be even better ;-)).

@frankhommers frankhommers self-assigned this Jan 7, 2015
@odinserj
Copy link

odinserj commented Jan 7, 2015

Ahh, sorry again, all of my previous comments were wrong (need to sleep) :( @barhun, I agree with you, there is a mistake in SQL query – a row may be changed between SELECT and UPDATE operations, so the latter will use obsolete FetchedAt value. I think table-level lock will be enough.

UPD. Currently I'm an idiot and all of my conclusions are weak and unfounded. Going to sleep :-)

@barhun
Copy link
Contributor Author

barhun commented Jan 8, 2015

Well, I think I have enough information now. There are two solutions to this issue.

The first one is to use the default (ReadCommitted) isolation level as now, but with a SELECT FOR UPDATE statement instead of a simple SELECT one to pick the job to dequeue. This ultimately causes an explicit lock on the row, thus no other transaction leads to a conflict. However, this seems not to be a viable option because of the need for an I/O to persist the lock. It just does cast away the beauty of MVCC.

The other option is to use the RepeatableRead isolation level in which no locks are hold by the engine but all concurrency is controlled by the means of MVCC. In other words, any I/O required for locks is not necessary. The transaction that induces a conflicting update will return an error code and it must be handled by the client just as in the case of the constructor PostgreSqlDistributedLock.

I have been away from my computer, running on the mobile for the last two days; so I will try and send a PR when available. (:

More info: http://www.postgresql.org/docs/9.1/static/transaction-iso.html

By the way, it was really nice to learn that there is actually no such thing called ReadUncommitted in terms of PostgreSQL.

@frankhommers
Copy link
Collaborator

A possible solution could be to add a 'touchcount' field to the jobqueue table. Then first get the id and the touchcount from similar SQL.

SELECT ""id"", ""touchcount""
FROM ""hangfire"".""jobqueue"" 
WHERE ""queue"" = ANY @queues 
AND ""fetchedat"" {0} 
ORDER BY ""queue"", ""fetchedat"" 
LIMIT 1

After that I would execute:

UPDATE ""hangfire"".""jobqueue"" 
SET ""fetchedat"" = NOW() AT TIME ZONE 'UTC', ""touchcount"" = ""touchcount"" + 1
WHERE ""id"" = {retrievedIdFromFirstSql} 
AND ""touchcount"" = {retrievedTouchCountFirstSql}
RETURNING ""id"" AS ""Id"", ""jobid"" AS ""JobId"", ""queue"" AS ""Queue"";

If the second SQL doesn't return anything the row is picked up by another server and I need to repeat the loop.

Does that make sense to you?

@barhun
Copy link
Contributor Author

barhun commented Jan 8, 2015

Yeah, it solves the problem in a very generic way called optimistic locking of which I made a mention in the issue-opening comment. That solution is applicable to almost all RDBMS. However, PostgreSQL can handle the update in this way without an extra column since it might use xmin system column for this purpose. ServiceStack.OrmLite documentation makes reference to the use of this column, so we can look at the generated SQL statements to have an idea about.

https://github.com/ServiceStack/ServiceStack.OrmLite#optimistic-concurrency

@frankhommers
Copy link
Collaborator

I think I prefer to build my solution, because I might have the wish in the future to make this work for other RDMS'es as well.

@barhun
Copy link
Contributor Author

barhun commented Jan 8, 2015

Guys, we should make use of ServiceStack.OrmLite to have a Hangfire storage extension working for all supported DBMSes. It can even be merged with the one for SQL Server. (:

@frankhommers
Copy link
Collaborator

I don't have experience with OrmLite. The piece of software we are writing which uses Hangfire uses NHibernate. So that would be the logical ORM for me to use.

@barhun
Copy link
Contributor Author

barhun commented Jan 8, 2015

I have just had a chance to have a look at your solution attentively and found that the column 'touchcount' is actually unnecessary to implement an optimistic lock. Just select 'fetchedat' as well as 'id' and use it for the safety check of the record update operation.

@odinserj
Copy link

odinserj commented Jan 8, 2015

Optimistic locking is already achieved by setting current timestamp on FetchedAt column. The problem described is non-repeatable read, so a simple transaction (around the query) with the corresponding isolation level will solve it.

@frankhommers
Copy link
Collaborator

@barhun because all dates in my provider are determined by the database server(s), this would make it somewhat unreliable.
@odinserj I the way I proposed will perform better in most cases.. Don't you think?

@odinserj
Copy link

odinserj commented Jan 8, 2015

@frankhommers, in cases like this I prefer not to think and use benchmarking :)

@frankhommers
Copy link
Collaborator

@odinserj I hate the idea of building a cluster just for a benchmark at the moment. I'm quite busy with other things ;-)

@barhun
Copy link
Contributor Author

barhun commented Jan 9, 2015

@odinserj Optimistic locking is not employed in the current query. To implement it properly the query should be something like the following:

SELECT
    "id",
    "fetchedat"
INTO TEMP TABLE jobToFetch
FROM ""hangfire"".""jobqueue"" 
WHERE
    ""queue"" = ANY @queues AND
    ""fetchedat"" {0} 
ORDER BY ""queue"", ""fetchedat"" 
LIMIT 1;

UPDATE ""hangfire"".""jobqueue"" 
SET
    ""fetchedat"" = NOW() AT TIME ZONE 'UTC'
WHERE
    ""id"" IN (SELECT ""id"" FROM jobToFetch) AND
    ""fetchedat"" IN (SELECT ""fetchedat"" FROM jobToFetch)
RETURNING
    ""id"" AS ""Id"",
    ""jobid"" AS ""JobId"",
    ""queue"" AS ""Queue"";

The query used by the SQL Server storage provider consists of a single update but that makes it impossible to order by 'queue' and 'fetchedat'.

@frankhommers It is a much-used reliable pattern underpinning the Multi-Version Concurrency Control method. You are not allowed to update stale records in this way.

There is no need to form a cluster to assess the results; you can practice it step by step in a single machine using a client like pgAdmin.

  1. Open a transaction and execute the select statement.
  2. Open another transaction and execute the select statement.
  3. Execute the update and commit the first or the second transaction.
  4. Execute the update and commit the remaining transaction.

@frankhommers
Copy link
Collaborator

@barhun I'd like to prevent the use of a temp table. And I'm not totally confident that the fetchedat in the jobToFetch table is always different when another server updates it.

About assessing the results, what @odinserj and I were talking about is that the "touchcount" (or your jobToFetch (which is very similar, I think)) mechanism is faster than transaction locking in a cluster. And to test that you need a cluster. About assessing the behaviour, you are right.

@barhun
Copy link
Contributor Author

barhun commented Jan 9, 2015

@frankhommers Considering that what is selected as 'fetchedat' must be a null or a timestamp

< NOW() AT TIME ZONE 'UTC' - INTERVAL '[SOME] SECONDS'

we can be sure that it must be different if already updated.

@frankhommers
Copy link
Collaborator

Without a temp table I must rely on Dapper to retrieve and request the DateTime without any changes. See my code here: da4d1cc#diff-ef5e0ac23ea818747b64429c83bb4ca5

I'm still liking "touchcount" more because it prevents possible DateTime precision issues.

@odinserj
Copy link

odinserj commented Jan 9, 2015

Guys, why not to use simple repeatable-read transaction with auto-retries on serialization issues? It is much easier to understand than other solutions.

Btw, I have a simple benchmarking program for Hangfire. If other solutions performed much better, then they are worth additional programming and their complexity. But I have no access to it since I am out-of-office before Jan, 13.

@barhun
Copy link
Contributor Author

barhun commented Jan 9, 2015

I am still confused about why the PostgreSQL extension orders by both 'queue' and 'fetchedat' while the SQL Server provider does not. The sanest expectation is that they must be compatible with each other.

However, I've changed my mind about SELECT FOR UPDATE vs REPEATABLE READ. Locking the row for update, after all, may not result in much overhead - considering the lock is taken during the first block access for select and released after the following update. This way the application wouldn't lose time retrying in case of a repeatable-read transaction. Adding a FOR UPDATE clause to the subquery appears to be sufficient to solve the issue.

UPDATE ""hangfire"".""jobqueue"" 
SET ""fetchedat"" = NOW() AT TIME ZONE 'UTC'
WHERE ""id"" IN (
    SELECT ""id"" 
    FROM ""hangfire"".""jobqueue"" 
    WHERE ""queue"" = ANY @queues 
    AND ""fetchedat"" {0} 
    ORDER BY ""queue"", ""fetchedat"" 
    LIMIT 1
    FOR UPDATE
)
RETURNING ""id"" AS ""Id"", ""jobid"" AS ""JobId"", ""queue"" AS ""Queue"";

I, at the end, think repeatable-read isolation and optimistic locking are quite similar in terms of the conflict-handling logic employed by the application. It seems better to hand over this burden of concurrency management to the RDBMS. I have had enough words on this subject and will have sent two PRs for both FOR UPDATE and REPEATABLE READ by Jan, 13. We can then see the benchmark results regarding the three solutions.

@frankhommers
Copy link
Collaborator

It might be better to order by fetchedat, jobid.
That way the oldest jobs are processed first (which is logical to users).

OK. I will put the FOR UPDATE for this version in the master. Are these 3 the solutions you mean?:

  1. For update,
  2. Repeatable Read,
  3. The "updatecount" in the latest dev commit.

@barhun
Copy link
Contributor Author

barhun commented Jan 9, 2015

I think so, it is the behavior that we expect. However, that still breaks the compatibility between storage providers but it isn't a big problem.

Yes, I talk about those three. I like being mostly busy talking about things. (: Kidding, I have just got no time to setup, fork & patch and test.

@barhun
Copy link
Contributor Author

barhun commented Jan 13, 2015

@frankhommers @odinserj Could you have a look at the PR #6?

@frankhommers
Copy link
Collaborator

@barhun I will have a look at it soon. But I've been thinking, I really want the updatecount flavor as well. So I think I will make it optional within the StorageOptions "UseNativeTransaction". When it's false it will fallback to updatecount.

The reason for this is that it will not depend on transactions that way and that will make it easier (in my opionion) to make a generic IDbConnection storage provider.

@kieranbenton
Copy link
Contributor

I've just fallen foul of this, getting multiple dequeues of jobs with Hangfire.PostgreSql even when running with a single server but multiple workers. I've been chasing this down all day and thought it was something to do with my DI registrations or fancy filters I'd added - but obviously not :)

Is there anything I can do to help here? Provide a test case/unit test? I can't reproduce this 100% of the time, but approx 1 in 5 on a dual core i7 with postgres running locally (Windows 8.1 x64).

Is the PR ready to be pushed to a nuget release?

EDIT: @odinserj I was also wondering if it was worth adding something into HF core to at least detect the double processing state (in the dashboard only?) to make it obvious what has happened? It took me a while to dig through all my logs/jobs and see this:

image

@frankhommers
Copy link
Collaborator

Everything discussed here is in the master branch, it's all ready to push to a nuget package. If you like I can do that tomorrow. I would appreciate that you test this it by compiling your own from the current master branch. Unit tests are all OK, but they were in the previous nupkg as well ;-)

@kieranbenton
Copy link
Contributor

@frankhommers I haven't been able to break a build from master when testing by hand. Can you please push to nuget and then I'll try and expand my testing suite?

@frankhommers
Copy link
Collaborator

@kieranbenton I have published it. Just one little request: Can you try to run it with UseNativeDatabaseTransactions = false as well? Please report the results ;-)

@kieranbenton
Copy link
Contributor

Will do Frank - it might be a day or two until get I round to it but will
let you know :)

On 21 January 2015 at 13:32, Frank Hommers notifications@github.com wrote:

@kieranbenton https://github.com/kieranbenton I have published it. Just
one little request: Can you try to run it with
UseNativeDatabaseTransactions = false as well? Please report the results ;-)


Reply to this email directly or view it on GitHub
#4 (comment)
.

@kieranbenton
Copy link
Contributor

Right, I'm no longer getting multiple executions but I do get: could not serialize access due to concurrent update. Thats with UseNativeDatabaseTransactionsat its default value (false i assume?).

@frankhommers
Copy link
Collaborator

@kieranbenton The default value is true. (As preferred by @odinserj and @barhun ;-)). Could you test with true and false as well? Can you include more of the stacktrace as well?

@barhun
Copy link
Contributor Author

barhun commented Jan 27, 2015

@kieranbenton @frankhommers Serialization errors are supposed to provide the application with the error code 40001; so that kind of errors are considered a sign for retrying, then smoothed and not allowed to go beyond the scope of the utility method. I can be mistaken, so we ought to have the right error code to filter.
http://www.postgresql.org/docs/9.4/static/errcodes-appendix.html

@frankhommers
Copy link
Collaborator

That's whats being done with UseNativeDatabaseTransactions set to true at https://github.com/frankhommers/Hangfire.PostgreSql/blob/master/src/Hangfire.PostgreSql/PostgreSqlJobQueue.cs#L106

@barhun
Copy link
Contributor Author

barhun commented Jan 27, 2015

@frankhommers Yeah, I know. I wrote that piece of code. (: What I was trying to say was that I couldn't make sure which error code is returned when serialization fails to happen. I just assumed that the error code would be 40001 in any case of concurrent updates just by having a look at the documentation for the title of errors created by hand.

EDIT: I didn't have time but I will write, asap, the code to figure out what error code npgsql provides us with when serialization attempts fail. The property named ErrorCode is of type int but I now see there are error codes whose values consist of alphabetical characters. There may be a mapping between the PostgreSQL error codes and those of npgsql. I will come back to clarify.

@barhun
Copy link
Contributor Author

barhun commented Jan 27, 2015

@odinserj @frankhommers @kieranbenton Oh, forgive me. I have just created another PR for the recent inconvenience. Check it: #7

@frankhommers
Copy link
Collaborator

@kieranbenton @barhun I will push a new version to nuget tomorrow. That will have @kieranbenton's problem fixed ;-) Thanks @barhun

@kieranbenton
Copy link
Contributor

And I hadn’t even had chance to reply! Thanks guys. Do you still want it testing with native transactions off?

On 27 Jan 2015, at 17:27, Frank Hommers notifications@github.com wrote:

@kieranbenton https://github.com/kieranbenton @barhun https://github.com/barhun I will push a new version to nuget tomorrow. That will have @kieranbenton https://github.com/kieranbenton's problem fixed ;-) Thanks @barhun https://github.com/barhun

Reply to this email directly or view it on GitHub #4 (comment).

@frankhommers
Copy link
Collaborator

Yes!

@barhun
Copy link
Contributor Author

barhun commented Jan 27, 2015

@frankhommers @kieranbenton You're welcome. Sorry, again, for the mess I caused.

By the way, the property ErrorCode has the same value as HResult which is obviously irrelevant to PostgreSQL error codes. I feel like an idiot. (:
Screenshot

@frankhommers
Copy link
Collaborator

@barhun No problem man, we're all human ;-) @kieranbenton A new version is pushed to NuGet. Any testing results yet?

@barhun
Copy link
Contributor Author

barhun commented Feb 9, 2015

@kieranbenton Hey, is it all ok? I'm going to close this one if. (:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants