-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in ingester: cannot grow buffer beyond 2 gigabytes #1258
Comments
Hi thanks for reporting this. It seems that this error is occurring while writing the
search_tags_deny_list:
- opencensus.exporterversion
- sampler.type
- sampler.param
- client-uuid
- component
- ip |
Hi @mdisibio , we are writing large volumes of data, running load tests with thousands of requests per minute continously. Can you suggest any appropriate configuration to flush the data frequently? |
Hi. Could you repost those graphs with These are the flush-related settings that we use and are known to work well up to 2M spans/s. ingester:
flush_check_period: 5s
trace_idle_period: 5s
max_block_duration: 15m0s
max_block_bytes: 734003200 Of these |
hi @mdisibio we just have applied your suggested configuration, will report back on this. |
hi @mdisibio ,
Current configuration
|
When a trace disappears/reappears, it is usually explained by a mismatch between component timings, where a querier cannot see a block, for example. To start, I would adjust this setting. The default is 3x ingester:
complete_block_timeout: 15m There are a few more ways to dig deeper if it is still occurring. (2) Capture tracing data for tempo itself, and lookup the failed request that was logged in (1). The log message will contain another trace ID which is tempo's own trace. We can inspect the querier and ingester work from there. |
Hi, this issue seems to have moved on from the original reported flatbuffer error. If you can provide the requested information, we can keep troubleshooting, else I think it makes sense to close this issue and open new ones for other issues. Or we can always do general help in the forums or slack channel.
|
We had this happen to us just now, so here's the /var/tempo/wal (ls -ltR)
The PV is
So should have plenty of space. Some logs:
EDIT: we have
and |
Hi thanks for the information! Sorry that this is happening, my thought is it's very traffic-dependent, and not necessarily a bug. We haven't seen this ourselves. If this is causing immediate trouble, the best workaround is disable search and delete the files from
Internally our ratio looks like this:
~60% smaller. Please see question (3) in the comment above. Do your traces include any very large attributes like http request/response bodies? If so we can remove them from being recorded by setting The 2GB limit is a hard limit that can't be changed, so the goal is to determine what is using such space and eliminate it. |
We've had search enabled for a while now, but this seems to have started happening either because of the upgrade to 1.4.1, or because we switched to tail-based sampling in otel-collector (those two happened around the same time). I'm not aware of any huge tags, though I did find one java command-line that was quite large and I excluded it. Is there a way to look at the data in the search files to find any large tags and/or verify that my exclusion worked? The crash just happened again (sadly I deleted the PV's and recreated the ingester statefulset to get things working again, so I can't get a direcory listing; I have a feeling I can get it later, when it crashes again, though):
|
BTW: once an ingester gets into this state, it seems to be hosed until I manually "delete the large file" (i.e. kill the PV everything is stored on, since I don't think I should be messing with ingester-internal files with a scalpel). If nothing else, it might be good to fix it so that ingesters can recover without manual intervention (perhaps that's a tall order, though). Suggestion2: perhaps there should be (or perhaps there is? Let me know!) some setting via which to reject large tags. Or not add to a file that is already too big, so we don't get into this unrecoverable state to begin with. |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. |
Describe the bug
Tempo ingesters are failing with the below error:
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Ingesters should be running fine.
Environment:
Additional Context
Here are the values I am using:
The text was updated successfully, but these errors were encountered: