-
Notifications
You must be signed in to change notification settings - Fork 799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cortex store gateway keeps going into crash/failure during startup #4993
Comments
Is it going oom or terminating for other reason? If is not oom can u try to fetch the log from the dead container (-p option on the kubectl logs)? |
Yes checked the dead container logs and it was not due to OOM, as we recently increased a fair bit of memory. Errors were more on the side of memberlist failures (relatively minimal though) - and not much info on cause for termination caller=tcp_transport.go:428 component="memberlist TCPTransport" msg="WriteTo failed" |
Can you try to set the lazy load config to true?
And also bucket index?
|
Yes, we have these enabled too. Also there is no definite pattern to these failures/crashes as it occurs intermittently but more often than not. bucket_index: |
If there are no enough log from pods, could you please increase log level to debug and try again? |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions. |
After taking another look at this issue, I believe it is related to thanos-io/thanos#6509. This bug caused SG initial sync takes too much memory, which is totally uncessary. The fix was included in the latest release RC so I will close this issue. Feel free to try it out and let us know if it works or not. https://github.com/cortexproject/cortex/releases/tag/v1.16.0-rc.0 |
Issue -
On cortex hosted on AKS distributed env- during startup/new deployment, store gateway keeps going into crashloopback and never really comes up. store gateway has the PVC mount and its associated blob storage and is currently running with replication factor of 3. Tuned the settings of readiness/liveness probe timeouts to give the ring more time to turn out healthy but its not really helping.
All 3 instances are going into crashloop eventually.
When deployed fresh with blob and PVC deleted, store gateway comes up normally without any issues. But in a shared cluster env, this is not really a permanent option.
K8s events doesnt really help on narrowing down to what makes the SG to fail nor does the SG logs.
Infra - K8S istio environment
Arch - Microservices
The text was updated successfully, but these errors were encountered: