Skip to content

Prod Incident 25/07/24 - Hasura 429/504 errors, Postgres Connection Limit, and Unexpected Deprovisioning #930

@darunrs

Description

@darunrs

On July 25, 2024, Rate Exceeded errors were observed from the production Hasura instance. Following this, an investigation was performed with the help of SRE. One of the actions taken was raising the concurrent request limit of each Hasura instance from 80 to 200 while increasing the max instances from 5 to 10. This increase was sufficient to stop the Rate Exceeded error. The following morning, it was discovered that the number of DB connections had spiked to 600 and was floating at 600, with the number of active connections locked at roughly 400. As a result, Hasura did not have enough connections to maintain its metadata, causing it to fall out of sync. This led to QueryApi once again experiencing issues. After QueryApi was shut down in prod, and the database restarted, the connection count fell. However, when QueryApi restarted, it immediately began to deprovision many indexers without cause. QueryApi was shut down again, and the deprovisioning was investigated. After the impacted indexers were documented, QueryApi was restarted with a custom commit which increased the timeout between stalled stream/executor restart attempts, and disabled deprovisioning. After this, the deprovisioned indexers were all brought back and backfilled on Jul 26, 2024.

TLDR:

  • Hasura rejects all requests due to accumulated timeout queries from either a block stream which was being repeatedly restarted or from KitWallet which tried accessing Hasura after Postgres connections reached the limit
  • Postgres connections rapidly rise to maximum due to above timing out queries from QueryApi creating permanently active connections
  • Indexers suddenly deprovisioned when not deleted from contract

More details on Incident Document.

I've separated the task list into two as the two incidents are unrelated.

### Hasura and Postgres Incident
- [ ] https://github.com/near/queryapi/issues/931
- [ ] https://github.com/near/queryapi/issues/938
- [ ] https://github.com/near/queryapi/issues/948
- [ ] https://github.com/near/queryapi/issues/946
- [ ] https://github.com/near/queryapi/issues/947
- [ ] https://github.com/near/queryapi/issues/949
- [ ] https://github.com/near/queryapi/issues/950
- [ ] https://github.com/near/queryapi/issues/951
- [ ] https://github.com/near/queryapi/issues/952
- [ ] https://github.com/near/queryapi/issues/978
- [ ] Migrate Hasura to work through pgBouncer
- [ ] https://github.com/near/queryapi/issues/967
### Deprovisioning Incident
- [ ] https://github.com/near/queryapi/issues/940
- [ ] https://github.com/near/queryapi/issues/941
- [ ] https://github.com/near/queryapi/issues/979
- [ ] https://github.com/near/queryapi/issues/942
- [ ] https://github.com/near/queryapi/issues/965
- [ ] https://github.com/near/queryapi/issues/915
- [ ] https://github.com/near/queryapi/issues/964
- [ ] https://github.com/near/queryapi/issues/953
- [ ] https://github.com/near/queryapi/issues/943
- [ ] https://github.com/near/queryapi/issues/944
- [ ] https://github.com/near/queryapi/issues/945
- [ ] https://github.com/near/queryapi/issues/954
- [ ] https://github.com/near/queryapi/issues/968

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions