Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate key value violates unique constraint #1719

Closed
mpanne opened this issue Feb 6, 2020 · 22 comments
Closed

Duplicate key value violates unique constraint #1719

mpanne opened this issue Feb 6, 2020 · 22 comments
Assignees
Labels
bug Something is not working. package/oauth2 upstream Issue is caused by an upstream dependency.
Milestone

Comments

@mpanne
Copy link

mpanne commented Feb 6, 2020

Describe the bug

For our project we are using Hydra and a PostgreSQL database. Sometimes when trying to get an access token calling the /oauth2/token endpoint, Hydra returns an 500 Internal Server Error and logs the following error message:

level=error msg="An error occurred" debug="pq: duplicate key value violates unique constraint \"hydra_oauth2_access_request_id_idx\": Unable to insert or update resource because a resource with that value exists already" description="The authorization server encountered an unexpected condition that prevented it from fulfilling the request" error=server_error

Hydra seems to try to add an access token request to the database with a request id which already exists. What we could observe is that every time this happened multiple (2-3) token requests have been received in a very short time period (<100ms).

Did you experience this at some point or do you have any hint what could be the root cause for it?
We are using an older version (v1.0.0) but I couldn't find anything regarding this in the changelog.

Reproducing the bug

We could not reproduce the bug not even when sending ~500 token requests (almost) simultaneously.

Environment

  • Version: v1.0.0
  • Environment: Docker, PostgreSQL DB
@aeneasr
Copy link
Member

aeneasr commented Feb 6, 2020

Never seen this before, could you set LOG_LEVEL=trace and print out the trace? That would help a lot! Please also include logs surrounding the error (so we can see what requests are happening etc).

It would also be interesting to see if this happens with the latest version.

@mpanne
Copy link
Author

mpanne commented Feb 7, 2020

We will set the log level as requested and provide additional logs as soon as it happens again, thank you!
We are also about to update to a newer Hydra version soon.

@mpanne
Copy link
Author

mpanne commented Feb 7, 2020

Is log level trace even supported? As stated here, only debug is supported. And even when setting the log level to debug, we cannot observe any debug logs.

@aeneasr
Copy link
Member

aeneasr commented Feb 7, 2020

Yeah, that's a docs issue. trace should work but you can also use debug if you want. The error messages, if some occur, should be enhanced with stack traces.

@mpanne
Copy link
Author

mpanne commented Feb 10, 2020

It happened again today. Following the logs enhanced with debug log. The request_id is randomly generated and added to the request headers by our nginx ingress controller.

time="2020-02-10T09:42:18Z" level=info msg="started handling request" method=POST remote=172.20.6.158 request=/oauth2/token request_id=16ebd03dde4752f4d6b551c05e885e06
time="2020-02-10T09:42:18Z" level=info msg="started handling request" method=POST remote=172.20.2.196 request=/oauth2/token request_id=59f3ef389a2228f18ed3fd996d10a2d8
time="2020-02-10T09:42:19Z" level=debug msg="Got an empty session in toRequest"
time="2020-02-10T09:42:19Z" level=debug msg="Got an empty session in toRequest"
time="2020-02-10T09:42:19Z" level=error msg="An error occurred" debug="pq: duplicate key value violates unique constraint \"hydra_oauth2_access_request_id_idx\": Unable to insert or update resource because a resource with that value exists already" description="The authorization server encountered an unexpected condition that prevented it from fulfilling the request" error=server_error
time="2020-02-10T09:42:19Z" level=debug msg="Stack trace:
github.com/ory/fosite/handler/oauth2.(*RefreshTokenGrantHandler).PopulateTokenEndpointResponse
	/go/pkg/mod/github.com/ory/fosite@v0.30.0/handler/oauth2/flow_refresh.go:164
github.com/ory/fosite.(*Fosite).NewAccessResponse
	/go/pkg/mod/github.com/ory/fosite@v0.30.0/access_response_writer.go:36
github.com/ory/hydra/oauth2.(*Handler).TokenHandler
	/go/src/github.com/ory/hydra/oauth2/handler.go:585
net/http.HandlerFunc.ServeHTTP
	/usr/local/go/src/net/http/server.go:2007
github.com/julienschmidt/httprouter.(*Router).Handler.func1
	/go/pkg/mod/github.com/julienschmidt/httprouter@v1.2.0/params_go17.go:26
github.com/julienschmidt/httprouter.(*Router).ServeHTTP
	/go/pkg/mod/github.com/julienschmidt/httprouter@v1.2.0/router.go:334
github.com/urfave/negroni.Wrap.func1
	/go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:46
github.com/urfave/negroni.HandlerFunc.ServeHTTP
	/go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:29
github.com/urfave/negroni.middleware.ServeHTTP
	/go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:38
net/http.HandlerFunc.ServeHTTP
	/usr/local/go/src/net/http/server.go:2007
github.com/ory/hydra/x.RejectInsecureRequests.func1
	/go/src/github.com/ory/hydra/x/tls_termination.go:55
github.com/urfave/negroni.HandlerFunc.ServeHTTP
	/go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:29
github.com/urfave/negroni.middleware.ServeHTTP
	/go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:38
github.com/ory/x/metricsx.(*Service).ServeHTTP
	/go/pkg/mod/github.com/ory/x@v0.0.73/metricsx/middleware.go:261
github.com/urfave/negroni.middleware.ServeHTTP
	/go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:38
github.com/ory/hydra/metrics/prometheus.(*MetricsManager).ServeHTTP
	/go/src/github.com/ory/hydra/metrics/prometheus/middleware.go:26
github.com/urfave/negroni.middleware.ServeHTTP
	/go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:38
github.com/meatballhat/negroni-logrus.(*Middleware).ServeHTTP
	/go/pkg/mod/github.com/meatballhat/negroni-logrus@v0.0.0-20170801195057-31067281800f/middleware.go:136
github.com/urfave/negroni.middleware.ServeHTTP
	/go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:38
github.com/urfave/negroni.(*Negroni).ServeHTTP
	/go/pkg/mod/github.com/urfave/negroni@v1.0.0/negroni.go:96
net/http.serverHandler.ServeHTTP
	/usr/local/go/src/net/http/server.go:2802
net/http.(*conn).serve
	/usr/local/go/src/net/http/server.go:1890
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1357"
time="2020-02-10T09:42:19Z" level=info msg="completed handling request" measure#hydra/public: https://***.latency=110682774 method=POST remote=172.20.2.196 request=/oauth2/token request_id=59f3ef389a2228f18ed3fd996d10a2d8 status=500 text_status="Internal Server Error" took=110.682774ms
time="2020-02-10T09:42:19Z" level=info msg="completed handling request" measure#hydra/public: https://***.latency=146429073 method=POST remote=172.20.6.158 request=/oauth2/token request_id=16ebd03dde4752f4d6b551c05e885e06 status=200 text_status=OK took=146.429073ms

@aeneasr
Copy link
Member

aeneasr commented Feb 10, 2020

This is still 1.0.0 right?

@aeneasr
Copy link
Member

aeneasr commented Feb 10, 2020

It appears that his is caused by two processes trying to refresh the access token at the same time, and because the transaction is not yet committed, one fails. It is actually intended that this results in an error condition (although this error isn't really a "good" symptomatic) because a refresh token is supposed to be refreshed only once. Otherwise one refresh token could be used to create two (three, four, five) tokens and so on.

@aeneasr
Copy link
Member

aeneasr commented Feb 10, 2020

I've scooped around some other projects. The general guideline is to have clients refresh tokens only once. This can be tricky sometimes though, when there is no shared state between the components. It appears that some providers implement a grace period (e.g. 60 seconds) during which the old refresh token returns the same response. I don't think it would solve your problem though, because the requests are apparently only nanoseconds apart - so fast that the first database transaction hasn't even completed yet.

I'd suggest you add a random tick (e.g. -10 ±rnd(0,5)s) to your clients to prevent synchronization issues. They can still occur of course, but should be much less common.

@mpanne
Copy link
Author

mpanne commented Feb 10, 2020

Yes, it's still 1.0.0.
Yes, we are aware of the fact, that the refresh token is only valid for a single refresh request.
Unfortunately we do not own the clients requesting/refreshing the tokens so that we cannot implement the random tick on client side. We will try to clarify why the same refresh token is used twice almost simultaneously though. Maybe we can find a solution together with our customer. Thank you!

@aeneasr
Copy link
Member

aeneasr commented Feb 10, 2020

Noi problem, I'm pretty sure that's what's causing it :)

@mpanne
Copy link
Author

mpanne commented Feb 10, 2020

Yes, that sounds reasonable. Could using a newer version fix this issue?

@aeneasr
Copy link
Member

aeneasr commented Feb 10, 2020

I don't think so. I've also looked into this a bit more. We can't reliably echo the same response twice because that would mean that we have to cache the full valid access token somewhere (e.g. database) which we explicitly have said we won't do.

What you could obviously try is reduce the latency between hydra and the DB - 130ms seems a lot to me.

@aaslamin
Copy link
Contributor

What I am stumped by is how two requests with the same request ID were proxied to your Hydra nodes?

That should definitely not happen. What does your NGINX config look like for your Hydra routes? If I remember correctly, some old versions of NGINX would even retry non-idempotent upstream requests by default. Although I doubt you are on one of those versions.

With that said, I noticed you are not injecting UUID V4's which would make it very unlikely you will generate the same ID twice. Finally, double check you LB config to make sure a client cannot add a request ID header to their request.

@aeneasr
Copy link
Member

aeneasr commented Feb 11, 2020

@aaslamin great to see you here again :)

I checked the code, I think the problem is happening because we use the original request ID to track any refresh/access tokens over time. So what's happening is:

  1. refresh token is fetched from store (request id is extracted)
  2. old tokens are removed
  3. new tokens are minted with extracted request id
  4. new tokens are inserted into store

What I'm curios about is that the transaction apparently isn't helping here. Instead of a conflict it should give a 404 (refresh token not found). It might be because the first SELECT does not count lock the table (which is performance wise a good thing lmao) so we get into a state where 429 can happen.

I'm not sure if there's really a way for us to fix this. If we generated a new request ID, you would end up with two tokens. The only thing we could do is figure out how to return 404 instead of 429 but it would definitely touch some internals, potentially even individually for every database

@aeneasr
Copy link
Member

aeneasr commented Feb 11, 2020

This is causing the request ID to cause a 429:

https://github.com/ory/fosite/blob/master/handler/oauth2/flow_refresh.go#L159-L160

This is the "fetch from store" I was referring to:

https://github.com/ory/fosite/blob/master/handler/oauth2/flow_refresh.go#L135-L140

Actually wondering now if a rollback could cause the old refresh token to come back into the store now...

@aeneasr
Copy link
Member

aeneasr commented Feb 11, 2020

Looks like this can be solved by changing the IsolationLevel: https://golang.org/pkg/database/sql/#IsolationLevel

Which can be set for transactions: https://golang.org/pkg/database/sql/#TxOptions

So basically by picking another isolation level, this wouldn't return 429 but instead 404 (or rather 401/403 as the appropriate error type)

@aeneasr
Copy link
Member

aeneasr commented Feb 11, 2020

Yeah I think by setting the isolation level to LevelRepeatableRead the 429 should go away

@aeneasr aeneasr added the bug Something is not working. label Feb 11, 2020
@aeneasr
Copy link
Member

aeneasr commented Feb 11, 2020

Last comment for now - marking this as a bug :)

@aeneasr aeneasr added this to the v1.3.0 milestone Feb 11, 2020
@aeneasr aeneasr added package/oauth2 upstream Issue is caused by an upstream dependency. labels Feb 11, 2020
@aaslamin
Copy link
Contributor

aaslamin commented Feb 11, 2020

@aaslamin great to see you here again :)

🍻 cheers!

The only thing we could do is figure out how to return 404 instead of 429

You mean a 409 (Conflict), right? It's currently returning a 5xx error.

Yeah I think by setting the isolation level to LevelRepeatableRead the 429 should go away

I think you are right 🙏

• The default isolation level in MySQL using InnoDB as the storage engine (most common I believe) is Repeatable Read
• Postgres only has one storage engine (as opposed to nine in the MySQL 🌎...) and it's default isolation level is Read Committed

The behavioral difference in this isolation mode is significant:

UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target rows: they will only find target rows that were committed as of the transaction start time. However, such a target row might have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the repeatable read transaction will wait for the first updating transaction to commit or roll back (if it is still in progress). If the first updater rolls back, then its effects are negated and the repeatable read transaction can proceed with updating the originally found row. But if the first updater commits (and actually updated or deleted the row, not just locked it) then the repeatable read transaction will be rolled back with the message:
ERROR: could not serialize access due to concurrent update

Source: https://www.postgresql.org/docs/12/transaction-iso.html


For a fix, I would love to see:

  • A test case that brings up PostgresSQL instance and does a stress test on the refresh token flow using both isolation levels:
    • Repeatable Read
    • Read Committed (just to prove that you would run into an error w/ this isolation level)
  • It may be a bit gnarly to write this test, but basically you would need to:
    • Prior to the start of the test, seed the test DB so there is an active refresh token session
    • Have N refresh token requests lined up ready to go
    • Fire all requests concurrently and let them fight it out 👊

This brings up another point. Ideally, any first class supported database in Hydra should have full support for the Repeatable Read isolation level - perhaps this can be a mandatory requirement if someone in the future wants to add another implementation of the fosite storage interfaces for a new DB. More accurately, any database added to Hydra that wishes to implement fosite's transactional interface should support this isolation level.

@aeneasr
Copy link
Member

aeneasr commented Feb 11, 2020

Yeah, 409! :)

I think Repeatable Read has an acceptable trade-off for refreshing the tokens. Since that operation doesn't happen often for a specific row (which if I understood correctly will be locked, not the entire table).

I think the test case makes sense. One refresh token should be enough though. You can probably fire n go routines that try to refresh the token concurrently and you should definitely see the error.

Regarding isolation - this really is only required for things where a write is dependent on a previous read. I'm not sure if setting RepeatableRead for every transaction is performant? But then again, the performance impact is probably minimal while the consistency is much better. Maybe MaybeBeginTx should always request RepeatableRead isolation.

@aaslamin
Copy link
Contributor

Maybe MaybeBeginTx should always request RepeatableRead isolation.

That's an interesting approach, although currently fosite doesn't have understanding of what an isolation level is. Maybe it should? 😀

Perhaps, before BeginTX is called, fosite could inject a recommended/desired isolation level under a unique key to the proxied context. We could even use the types defined in the SQL package as the context values although that may make it too tightly coupled 🤷‍♂

Finally, as a best practice, the authorization service that is fosite compliant should check for this key and configure its TXOptions accordingly prior to creating the transaction.

This way we can have flexibility in isolation levels for various flows instead of going all out w/ RepeatableRead which as you pointed out could impact performance.

@aeneasr
Copy link
Member

aeneasr commented Feb 12, 2020

That's an interesting approach, although currently fosite doesn't have understanding of what an isolation level is. Maybe it should? 😀

Yeah, I think that's ok. There's actually so much implicit knowledge already when it comes to the storage implementation that I gave up on imposing that in fosite to be honest.

I think setting the appropriate isolation level in begintx in hydra would be a good solution, plus having a failing test to verify the implementation.

aaslamin added a commit to aaslamin/hydra that referenced this issue Feb 19, 2020
aaslamin added a commit to aaslamin/hydra that referenced this issue Mar 21, 2020
aaslamin added a commit to aaslamin/hydra that referenced this issue Mar 22, 2020
aaslamin added a commit to aaslamin/hydra that referenced this issue Mar 22, 2020
aaslamin added a commit to aaslamin/fosite that referenced this issue Mar 24, 2020
This commit provides the functionality required to address ory/hydra#1719 & ory/hydra#1735 by adding error checking to the RefreshTokenGrantHandler's PopulateTokenEndpointResponse method so it can deal with errors due to concurrent access. This will allow the authorization server to render a better
error to the user-agent.
aaslamin added a commit to aaslamin/fosite that referenced this issue Mar 24, 2020
This commit provides the functionality required to address ory/hydra#1719 & ory/hydra#1735 by adding error checking to the RefreshTokenGrantHandler's PopulateTokenEndpointResponse method so it can deal with errors due to concurrent access. This will allow the authorization server to render a better error to the user-agent.
aaslamin added a commit to aaslamin/fosite that referenced this issue Mar 24, 2020
This commit provides the functionality required to address ory/hydra#1719 & ory/hydra#1735 by adding error checking to the RefreshTokenGrantHandler's PopulateTokenEndpointResponse method so it can deal with errors due to concurrent access. This will allow the authorization server to render a better error to the user-agent.
aeneasr pushed a commit to ory/fosite that referenced this issue Mar 25, 2020
…t handler (#402)

This commit provides the functionality required to address ory/hydra#1719 & ory/hydra#1735 by adding error checking to the RefreshTokenGrantHandler's PopulateTokenEndpointResponse method so it can deal with errors due to concurrent access. This will allow the authorization server to render a better error to the user-agent.

No longer returns fosite.ErrServerError in the event the storage. Instead a wrapped fosite.ErrNotFound is returned when fetching the refresh token fails due to it no longer being present. This scenario is caused when the user sends two or more request to refresh using the same token and one request gets into the handler just after the prior request finished and successfully committed its transaction.

Adds unit test coverage for transaction error handling logic added to the RefreshTokenGrantHandler's PopulateTokenEndpointResponse method
aaslamin added a commit to aaslamin/hydra that referenced this issue Mar 25, 2020
aaslamin added a commit to aaslamin/hydra that referenced this issue Mar 25, 2020
aaslamin added a commit to aaslamin/hydra that referenced this issue Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is not working. package/oauth2 upstream Issue is caused by an upstream dependency.
Projects
None yet
Development

No branches or pull requests

4 participants