Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal Server Error when activating Cap'n Proto in routing receivers #7944

Closed
verejoel opened this issue Nov 28, 2024 · 3 comments
Closed

Comments

@verejoel
Copy link
Contributor

Thanos, Prometheus and Golang version used: 0.37.0

Object Storage Provider: Azure

What happened: Rolled out receivers and routing receivers with v0.37.0. The receivers are running with --receive.capnproto-address=0.0.0.0:19391.

Then rolled out routing receivers with --receive.replication-protocol=capnproto. We immediately see that around 10% of remote-write requests to the routing receivers fail with internal server error, and the following log line:

ts=2024-11-27T17:36:07.120121599Z caller=handler.go:611 level=error component=receive component=receive-handler tenant=cloudinfrastructure err="2 errors: forwarding request to endpoint {thanos-ingester-0.thanos-ingester.thanos.svc.cluster.local:10901 thanos-ingester-0.thanos-ingester.thanos.svc.cluster.local:19391 }: failed writing to peer: pkg/receive/writecapnp/write_request.capnp:Writer.write: rpc: send message: rpc: send message: rpc: build message: rpc: place arguments: new struct: preferred segment is not part of the arena; forwarding request to endpoint {thanos-ingester-4.thanos-ingester.thanos.svc.cluster.local:10901 thanos-ingester-4.thanos-ingester.thanos.svc.cluster.local:19391 }: failed writing to peer: pkg/receive/writecapnp/write_request.capnp:Writer.write: rpc: send message: rpc: send message: rpc: build message: rpc: place arguments: new struct: preferred segment is not part of the arena" msg="internal server error"

However, metrics seem to be shipped successfully. I didn't notice anything actually being dropped.

Our hashring is managed with the hashring controller, and has the following format:

[
  {
    "hashring": "soft-tenants",
    "endpoints": [
      "thanos-ingester-0.thanos-ingester.thanos.svc.cluster.local:10901",
      "thanos-ingester-1.thanos-ingester.thanos.svc.cluster.local:10901",
      "thanos-ingester-2.thanos-ingester.thanos.svc.cluster.local:10901",
      "thanos-ingester-3.thanos-ingester.thanos.svc.cluster.local:10901",
      "thanos-ingester-4.thanos-ingester.thanos.svc.cluster.local:10901",
      "thanos-ingester-5.thanos-ingester.thanos.svc.cluster.local:10901"
    ]
  }
]

What you expected to happen: No HTTP 500 errors or internal server error logs.

How to reproduce it (as minimally and precisely as possible): Upgrade routing receivers and receivers to 0.37.0, then enable capnproto in the routing receivers.

Full logs to relevant components: As above.

Anything else we need to know:

Copy link

dosubot bot commented Nov 28, 2024

The error message you're encountering, "preferred segment is not part of the arena," suggests an issue with Cap'n Proto's message segmentation. This error can occur if there's an attempt to access or manipulate a segment that isn't part of the single segment arena used by Thanos for message construction [1].

Here are a few steps you can take to troubleshoot and potentially resolve this issue:

  1. Ensure Consistent Versioning: Make sure all your receivers and routing receivers are updated to version 0.37.0, as Cap'n Proto support was introduced in this version. Inconsistent versions might lead to compatibility issues [2].

  2. Check Configuration: Verify that the Cap'n Proto configurations are correctly set. Ensure that the --receive.capnproto-address and --receive.replication-protocol=capnproto flags are correctly configured across all relevant components [3][4].

  3. Review Network and Resource Limits: Since Cap'n Proto is designed to reduce CPU and memory usage, ensure that your system resources are not being exhausted. Check for any network issues that might be causing message segmentation problems.

  4. Examine Logs for Patterns: Look for any patterns in the logs that might indicate specific conditions under which the errors occur. This can help in identifying if the issue is related to specific endpoints or network conditions.

  5. Test with Different Configurations: If possible, test with different configurations or in a staging environment to see if the issue persists. This can help isolate whether the problem is related to specific settings or environmental factors.

If these steps do not resolve the issue, further investigation into the specific implementation of Cap'n Proto in your setup might be necessary.

To continue talking to Dosu, mention @dosu.

@fpetkovski
Copy link
Contributor

@verejoel should we close this?

@verejoel
Copy link
Contributor Author

verejoel commented Dec 1, 2024

Yes, should be fixed with 0.37.1

@verejoel verejoel closed this as completed Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants