Duplicate key value violates unique constraint #1719

mpanne · 2020-02-06T11:40:41Z

Describe the bug

For our project we are using Hydra and a PostgreSQL database. Sometimes when trying to get an access token calling the /oauth2/token endpoint, Hydra returns an 500 Internal Server Error and logs the following error message:

level=error msg="An error occurred" debug="pq: duplicate key value violates unique constraint \"hydra_oauth2_access_request_id_idx\": Unable to insert or update resource because a resource with that value exists already" description="The authorization server encountered an unexpected condition that prevented it from fulfilling the request" error=server_error

Hydra seems to try to add an access token request to the database with a request id which already exists. What we could observe is that every time this happened multiple (2-3) token requests have been received in a very short time period (<100ms).

Did you experience this at some point or do you have any hint what could be the root cause for it?
We are using an older version (v1.0.0) but I couldn't find anything regarding this in the changelog.

Reproducing the bug

We could not reproduce the bug not even when sending ~500 token requests (almost) simultaneously.

Environment

Version: v1.0.0
Environment: Docker, PostgreSQL DB

The text was updated successfully, but these errors were encountered:

aeneasr · 2020-02-06T14:17:49Z

Never seen this before, could you set LOG_LEVEL=trace and print out the trace? That would help a lot! Please also include logs surrounding the error (so we can see what requests are happening etc).

It would also be interesting to see if this happens with the latest version.

mpanne · 2020-02-07T14:24:28Z

We will set the log level as requested and provide additional logs as soon as it happens again, thank you!
We are also about to update to a newer Hydra version soon.

mpanne · 2020-02-07T15:28:33Z

Is log level trace even supported? As stated here, only debug is supported. And even when setting the log level to debug, we cannot observe any debug logs.

aeneasr · 2020-02-07T15:29:49Z

Yeah, that's a docs issue. trace should work but you can also use debug if you want. The error messages, if some occur, should be enhanced with stack traces.

mpanne · 2020-02-10T11:50:59Z

It happened again today. Following the logs enhanced with debug log. The request_id is randomly generated and added to the request headers by our nginx ingress controller.

time="2020-02-10T09:42:18Z" level=info msg="started handling request" method=POST remote=172.20.6.158 request=/oauth2/token request_id=16ebd03dde4752f4d6b551c05e885e06
time="2020-02-10T09:42:18Z" level=info msg="started handling request" method=POST remote=172.20.2.196 request=/oauth2/token request_id=59f3ef389a2228f18ed3fd996d10a2d8
time="2020-02-10T09:42:19Z" level=debug msg="Got an empty session in toRequest"
time="2020-02-10T09:42:19Z" level=debug msg="Got an empty session in toRequest"
time="2020-02-10T09:42:19Z" level=error msg="An error occurred" debug="pq: duplicate key value violates unique constraint \"hydra_oauth2_access_request_id_idx\": Unable to insert or update resource because a resource with that value exists already" description="The authorization server encountered an unexpected condition that prevented it from fulfilling the request" error=server_error
time="2020-02-10T09:42:19Z" level=debug msg="Stack trace:
github.com/ory/fosite/handler/oauth2.(*RefreshTokenGrantHandler).PopulateTokenEndpointResponse
	/go/pkg/mod/github.com/ory/fosite@v0.30.0/handler/oauth2/flow_refresh.go:164
github.com/ory/fosite.(*Fosite).NewAccessResponse
	/go/pkg/mod/github.com/ory/fosite@v0.30.0/access_response_writer.go:36
github.com/ory/hydra/oauth2.(*Handler).TokenHandler
	/go/src/github.com/ory/hydra/oauth2/handler.go:585
net/http.HandlerFunc.ServeHTTP
	/usr/local/go/src/net/http/server.go:2007
github.com/julienschmidt/httprouter.(*Router).Handler.func1
	/go/pkg/mod/github.com/julienschmidt/httprouter@v1.2.0/params_go17.go:26
github.com/julienschmidt/httprouter.(*Router).ServeHTTP
	/go/pkg/mod/github.com/julienschmidt/httprouter@v1.2.0/router.go:334
github.com/urfave/negroni.Wrap.func1
	/go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:46
github.com/urfave/negroni.HandlerFunc.ServeHTTP
	/go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:29
github.com/urfave/negroni.middleware.ServeHTTP
	/go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:38
net/http.HandlerFunc.ServeHTTP
	/usr/local/go/src/net/http/server.go:2007
github.com/ory/hydra/x.RejectInsecureRequests.func1
	/go/src/github.com/ory/hydra/x/tls_termination.go:55
github.com/urfave/negroni.HandlerFunc.ServeHTTP
	/go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:29
github.com/urfave/negroni.middleware.ServeHTTP
	/go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:38
github.com/ory/x/metricsx.(*Service).ServeHTTP
	/go/pkg/mod/github.com/ory/x@v0.0.73/metricsx/middleware.go:261
github.com/urfave/negroni.middleware.ServeHTTP
	/go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:38
github.com/ory/hydra/metrics/prometheus.(*MetricsManager).ServeHTTP
	/go/src/github.com/ory/hydra/metrics/prometheus/middleware.go:26
github.com/urfave/negroni.middleware.ServeHTTP
	/go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:38
github.com/meatballhat/negroni-logrus.(*Middleware).ServeHTTP
	/go/pkg/mod/github.com/meatballhat/negroni-logrus@v0.0.0-20170801195057-31067281800f/middleware.go:136
github.com/urfave/negroni.middleware.ServeHTTP
	/go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:38
github.com/urfave/negroni.(*Negroni).ServeHTTP
	/go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:96
net/http.serverHandler.ServeHTTP
	/usr/local/go/src/net/http/server.go:2802
net/http.(*conn).serve
	/usr/local/go/src/net/http/server.go:1890
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1357"
time="2020-02-10T09:42:19Z" level=info msg="completed handling request" measure#hydra/public: https://***.latency=110682774 method=POST remote=172.20.2.196 request=/oauth2/token request_id=59f3ef389a2228f18ed3fd996d10a2d8 status=500 text_status="Internal Server Error" took=110.682774ms
time="2020-02-10T09:42:19Z" level=info msg="completed handling request" measure#hydra/public: https://***.latency=146429073 method=POST remote=172.20.6.158 request=/oauth2/token request_id=16ebd03dde4752f4d6b551c05e885e06 status=200 text_status=OK took=146.429073ms

aeneasr · 2020-02-10T12:59:25Z

This is still 1.0.0 right?

aeneasr · 2020-02-10T13:06:01Z

It appears that his is caused by two processes trying to refresh the access token at the same time, and because the transaction is not yet committed, one fails. It is actually intended that this results in an error condition (although this error isn't really a "good" symptomatic) because a refresh token is supposed to be refreshed only once. Otherwise one refresh token could be used to create two (three, four, five) tokens and so on.

aeneasr · 2020-02-10T13:14:25Z

I've scooped around some other projects. The general guideline is to have clients refresh tokens only once. This can be tricky sometimes though, when there is no shared state between the components. It appears that some providers implement a grace period (e.g. 60 seconds) during which the old refresh token returns the same response. I don't think it would solve your problem though, because the requests are apparently only nanoseconds apart - so fast that the first database transaction hasn't even completed yet.

I'd suggest you add a random tick (e.g. -10 ±rnd(0,5)s) to your clients to prevent synchronization issues. They can still occur of course, but should be much less common.

mpanne · 2020-02-10T14:00:11Z

Yes, it's still 1.0.0.
Yes, we are aware of the fact, that the refresh token is only valid for a single refresh request.
Unfortunately we do not own the clients requesting/refreshing the tokens so that we cannot implement the random tick on client side. We will try to clarify why the same refresh token is used twice almost simultaneously though. Maybe we can find a solution together with our customer. Thank you!

aeneasr · 2020-02-10T14:00:55Z

Noi problem, I'm pretty sure that's what's causing it :)

mpanne · 2020-02-10T14:09:07Z

Yes, that sounds reasonable. Could using a newer version fix this issue?

aeneasr · 2020-02-10T14:28:47Z

I don't think so. I've also looked into this a bit more. We can't reliably echo the same response twice because that would mean that we have to cache the full valid access token somewhere (e.g. database) which we explicitly have said we won't do.

What you could obviously try is reduce the latency between hydra and the DB - 130ms seems a lot to me.

aaslamin · 2020-02-11T19:16:36Z

What I am stumped by is how two requests with the same request ID were proxied to your Hydra nodes?

That should definitely not happen. What does your NGINX config look like for your Hydra routes? If I remember correctly, some old versions of NGINX would even retry non-idempotent upstream requests by default. Although I doubt you are on one of those versions.

With that said, I noticed you are not injecting UUID V4's which would make it very unlikely you will generate the same ID twice. Finally, double check you LB config to make sure a client cannot add a request ID header to their request.

aeneasr · 2020-02-11T20:33:20Z

@aaslamin great to see you here again :)

I checked the code, I think the problem is happening because we use the original request ID to track any refresh/access tokens over time. So what's happening is:

refresh token is fetched from store (request id is extracted)
old tokens are removed
new tokens are minted with extracted request id
new tokens are inserted into store

What I'm curios about is that the transaction apparently isn't helping here. Instead of a conflict it should give a 404 (refresh token not found). It might be because the first SELECT does not count lock the table (which is performance wise a good thing lmao) so we get into a state where 429 can happen.

I'm not sure if there's really a way for us to fix this. If we generated a new request ID, you would end up with two tokens. The only thing we could do is figure out how to return 404 instead of 429 but it would definitely touch some internals, potentially even individually for every database

aeneasr · 2020-02-11T20:35:56Z

This is causing the request ID to cause a 429:

https://github.com/ory/fosite/blob/master/handler/oauth2/flow_refresh.go#L159-L160

This is the "fetch from store" I was referring to:

https://github.com/ory/fosite/blob/master/handler/oauth2/flow_refresh.go#L135-L140

Actually wondering now if a rollback could cause the old refresh token to come back into the store now...

aeneasr · 2020-02-11T20:39:21Z

Looks like this can be solved by changing the IsolationLevel: https://golang.org/pkg/database/sql/#IsolationLevel

Which can be set for transactions: https://golang.org/pkg/database/sql/#TxOptions

So basically by picking another isolation level, this wouldn't return 429 but instead 404 (or rather 401/403 as the appropriate error type)

aeneasr · 2020-02-11T20:42:51Z

Yeah I think by setting the isolation level to LevelRepeatableRead the 429 should go away

aeneasr · 2020-02-11T20:43:54Z

Last comment for now - marking this as a bug :)

aaslamin · 2020-02-11T22:27:40Z

@aaslamin great to see you here again :)

🍻 cheers!

The only thing we could do is figure out how to return 404 instead of 429

You mean a 409 (Conflict), right? It's currently returning a 5xx error.

Yeah I think by setting the isolation level to LevelRepeatableRead the 429 should go away

I think you are right 🙏

• The default isolation level in MySQL using InnoDB as the storage engine (most common I believe) is Repeatable Read
• Postgres only has one storage engine (as opposed to nine in the MySQL 🌎...) and it's default isolation level is Read Committed

The behavioral difference in this isolation mode is significant:

UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target rows: they will only find target rows that were committed as of the transaction start time. However, such a target row might have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the repeatable read transaction will wait for the first updating transaction to commit or roll back (if it is still in progress). If the first updater rolls back, then its effects are negated and the repeatable read transaction can proceed with updating the originally found row. But if the first updater commits (and actually updated or deleted the row, not just locked it) then the repeatable read transaction will be rolled back with the message:
ERROR: could not serialize access due to concurrent update

Source: https://www.postgresql.org/docs/12/transaction-iso.html

For a fix, I would love to see:

A test case that brings up PostgresSQL instance and does a stress test on the refresh token flow using both isolation levels:
- Repeatable Read
- Read Committed (just to prove that you would run into an error w/ this isolation level)
It may be a bit gnarly to write this test, but basically you would need to:
- Prior to the start of the test, seed the test DB so there is an active refresh token session
- Have N refresh token requests lined up ready to go
- Fire all requests concurrently and let them fight it out 👊

This brings up another point. Ideally, any first class supported database in Hydra should have full support for the Repeatable Read isolation level - perhaps this can be a mandatory requirement if someone in the future wants to add another implementation of the fosite storage interfaces for a new DB. More accurately, any database added to Hydra that wishes to implement fosite's transactional interface should support this isolation level.

aeneasr · 2020-02-11T22:39:11Z

Yeah, 409! :)

I think Repeatable Read has an acceptable trade-off for refreshing the tokens. Since that operation doesn't happen often for a specific row (which if I understood correctly will be locked, not the entire table).

I think the test case makes sense. One refresh token should be enough though. You can probably fire n go routines that try to refresh the token concurrently and you should definitely see the error.

Regarding isolation - this really is only required for things where a write is dependent on a previous read. I'm not sure if setting RepeatableRead for every transaction is performant? But then again, the performance impact is probably minimal while the consistency is much better. Maybe MaybeBeginTx should always request RepeatableRead isolation.

aaslamin · 2020-02-12T00:12:03Z

Maybe MaybeBeginTx should always request RepeatableRead isolation.

That's an interesting approach, although currently fosite doesn't have understanding of what an isolation level is. Maybe it should? 😀

Perhaps, before BeginTX is called, fosite could inject a recommended/desired isolation level under a unique key to the proxied context. We could even use the types defined in the SQL package as the context values although that may make it too tightly coupled 🤷‍♂

Finally, as a best practice, the authorization service that is fosite compliant should check for this key and configure its TXOptions accordingly prior to creating the transaction.

This way we can have flexibility in isolation levels for various flows instead of going all out w/ RepeatableRead which as you pointed out could impact performance.

aeneasr · 2020-02-12T10:15:27Z

That's an interesting approach, although currently fosite doesn't have understanding of what an isolation level is. Maybe it should? 😀

Yeah, I think that's ok. There's actually so much implicit knowledge already when it comes to the storage implementation that I gave up on imposing that in fosite to be honest.

I think setting the appropriate isolation level in begintx in hydra would be a good solution, plus having a failing test to verify the implementation.

This commit provides the functionality required to address ory/hydra#1719 & ory/hydra#1735 by adding error checking to the RefreshTokenGrantHandler's PopulateTokenEndpointResponse method so it can deal with errors due to concurrent access. This will allow the authorization server to render a better error to the user-agent.

…#1719 & ory#1735

…t handler (#402) This commit provides the functionality required to address ory/hydra#1719 & ory/hydra#1735 by adding error checking to the RefreshTokenGrantHandler's PopulateTokenEndpointResponse method so it can deal with errors due to concurrent access. This will allow the authorization server to render a better error to the user-agent. No longer returns fosite.ErrServerError in the event the storage. Instead a wrapped fosite.ErrNotFound is returned when fetching the refresh token fails due to it no longer being present. This scenario is caused when the user sends two or more request to refresh using the same token and one request gets into the handler just after the prior request finished and successfully committed its transaction. Adds unit test coverage for transaction error handling logic added to the RefreshTokenGrantHandler's PopulateTokenEndpointResponse method

…#1719 & ory#1735

aeneasr added the bug Something is not working. label Feb 11, 2020

aeneasr added this to the v1.3.0 milestone Feb 11, 2020

aeneasr added package/oauth2 upstream Issue is caused by an upstream dependency. labels Feb 11, 2020

aeneasr assigned zepatrik Feb 18, 2020

aaslamin added a commit to aaslamin/hydra that referenced this issue Feb 19, 2020

new: add test case to reproduce the bug raised in ory#1719

de933f5

aaslamin mentioned this issue Feb 19, 2020

test: add test to reproduce https://github.com/ory/hydra/issues/1719 #1734

Closed

6 tasks

aaslamin added a commit to aaslamin/hydra that referenced this issue Feb 19, 2020

new: add test case to reproduce the bug raised in ory#1719

6b4a79e

aaslamin added a commit to aaslamin/hydra that referenced this issue Mar 21, 2020

new: add test case to reproduce the bug raised in ory#1719

5b485db

aaslamin added a commit to aaslamin/hydra that referenced this issue Mar 22, 2020

new: add test case to verify fix for ory#1719 & ory#1735

5b19c18

aaslamin mentioned this issue Mar 22, 2020

fix: force transaction isolation level to LevelRepeatableRead #1766

Merged

11 tasks

aaslamin added a commit to aaslamin/hydra that referenced this issue Mar 22, 2020

new: add test case to verify fix for ory#1719 & ory#1735

dc1847d

aaslamin mentioned this issue Mar 24, 2020

fix: handle transactional errors due to concurrent updates in the refresh token grant handler ory/fosite#402

Merged

7 tasks

aaslamin mentioned this issue Mar 24, 2020

new: add support to sqlcon to handle postgreSQL serialization_failure error ory/x#129

Merged

5 tasks

aaslamin added a commit to aaslamin/hydra that referenced this issue Mar 24, 2020

new: catch transaction errors due to concurrent access to address ory…

89a4752

…#1719 & ory#1735

aaslamin added a commit to aaslamin/hydra that referenced this issue Mar 25, 2020

new: add test case to verify fix for ory#1719 & ory#1735

fcb49dc

aaslamin added a commit to aaslamin/hydra that referenced this issue Mar 25, 2020

new: catch transaction errors due to concurrent access to address ory…

3a2a65b

…#1719 & ory#1735

aaslamin added a commit to aaslamin/hydra that referenced this issue Mar 25, 2020

new: add test case to verify fix for ory#1719 & ory#1735

b25d95a

aaslamin added a commit to aaslamin/hydra that referenced this issue Mar 25, 2020

new: catch transaction errors due to concurrent access to address ory…

35192cc

…#1719 & ory#1735

aaslamin added a commit to aaslamin/hydra that referenced this issue Mar 26, 2020

new: add test case to verify fix for ory#1719 & ory#1735

7707fc5

aaslamin added a commit to aaslamin/hydra that referenced this issue Mar 26, 2020

new: catch transaction errors due to concurrent access to address ory…

71e698a

…#1719 & ory#1735

aeneasr closed this as completed in ad7ae00 Mar 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate key value violates unique constraint #1719

Duplicate key value violates unique constraint #1719

mpanne commented Feb 6, 2020 •

edited

Loading

aeneasr commented Feb 6, 2020

mpanne commented Feb 7, 2020

mpanne commented Feb 7, 2020

aeneasr commented Feb 7, 2020

mpanne commented Feb 10, 2020 •

edited

Loading

aeneasr commented Feb 10, 2020

aeneasr commented Feb 10, 2020

aeneasr commented Feb 10, 2020

mpanne commented Feb 10, 2020

aeneasr commented Feb 10, 2020

mpanne commented Feb 10, 2020

aeneasr commented Feb 10, 2020

aaslamin commented Feb 11, 2020

aeneasr commented Feb 11, 2020 •

edited

Loading

aeneasr commented Feb 11, 2020

aeneasr commented Feb 11, 2020

aeneasr commented Feb 11, 2020

aeneasr commented Feb 11, 2020

aaslamin commented Feb 11, 2020 •

edited

Loading

aeneasr commented Feb 11, 2020

aaslamin commented Feb 12, 2020

aeneasr commented Feb 12, 2020

Duplicate key value violates unique constraint #1719

Duplicate key value violates unique constraint #1719

Comments

mpanne commented Feb 6, 2020 • edited Loading

aeneasr commented Feb 6, 2020

mpanne commented Feb 7, 2020

mpanne commented Feb 7, 2020

aeneasr commented Feb 7, 2020

mpanne commented Feb 10, 2020 • edited Loading

aeneasr commented Feb 10, 2020

aeneasr commented Feb 10, 2020

aeneasr commented Feb 10, 2020

mpanne commented Feb 10, 2020

aeneasr commented Feb 10, 2020

mpanne commented Feb 10, 2020

aeneasr commented Feb 10, 2020

aaslamin commented Feb 11, 2020

aeneasr commented Feb 11, 2020 • edited Loading

aeneasr commented Feb 11, 2020

aeneasr commented Feb 11, 2020

aeneasr commented Feb 11, 2020

aeneasr commented Feb 11, 2020

aaslamin commented Feb 11, 2020 • edited Loading

aeneasr commented Feb 11, 2020

aaslamin commented Feb 12, 2020

aeneasr commented Feb 12, 2020

mpanne commented Feb 6, 2020 •

edited

Loading

mpanne commented Feb 10, 2020 •

edited

Loading

aeneasr commented Feb 11, 2020 •

edited

Loading

aaslamin commented Feb 11, 2020 •

edited

Loading