-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
out_cloudwatch: Segmentation fault #6616
Comments
@supergillis did you ever find a workaround for this? We are hitting the exact same thing but only when we enable the
Tagging some of the devs as this is stopping us from moving to production with this solution. |
Does this only happen on 2.0? If anyone is able to obtain a core dump that'd be super helpful: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/tutorials/remote-core-dump/README.md |
The AWS Disto does not support 2.x yet. We will soon. Currently we recommend our 2.29.0+ versions for cloudwatch customers. We have an upcoming concurrency issue fix for S3 in 2.31.0: aws/aws-for-fluent-bit@21c287e |
Enabling debug logging may give more info. Does anyone have stack trace for s3? @cdancy can you share your config. |
@PettitWesley we're using OOTB fluent-bit here and not the aws distro. I'm assuming that's OK? We've only been playing with 2.x but I can downgrade to 1.x and see what happens. I can certainly enable debug logs to see if it dumps anything else as well as get a core dump. config is as follows (being done through helm values.yaml):
EDIT: it should be noted ... running with either only the |
Another data point... when things do fail the fluent-bit pods crash a few times and eventually work themselves out ... but there is always a single one that doesn't. Every time it plays out in exactly the same way.
|
@PettitWesley another data point: downgrading to latest 1.9 produces the same issue. |
@cdancy thanks for letting me know. I see this:
Similar to the trace already in this issue. I am wondering if its related to the S3 concurrency issues we recently discovered: #6573 This will be released in AWS4FB 2.31.0 very soon, not sure when it will get in this distro: https://github.com/aws/aws-for-fluent-bit/releases |
@PettitWesley is there anything we can do config-wise to get around this for our purposes? Any way we can get impress upon whomever to get that PR merged with a new release out? |
@cdancy I have a pre-release build of S3 concurrency fix here: aws/aws-for-fluent-bit#495 (comment) Keep checking AWS distro, we will release it very very soon. If you can repro it with an AWS distro version and can open an issue at our repo, that will also help with tracking |
@PettitWesley thanks! Is it "better" to use the aws-for-fluent-bit project over this one? We don't seem to need all the bells and whistles the aws-for-fluent-bit project uses but if that has more fixes over this one that could be a reason to use the other. |
@cdancy AWS distro is focused on AWS customers and thus yes, is always the most up to date distro in terms of fixes for AWS customers.
We're distributing the same code, the only thing we add is the old AWS go plugins. What do you mean? https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#aws-go-plugins-vs-aws-core-c-plugins |
It's for that reason as it seemed, at least when we were looking into things, that for our use-cases it was unnecessary to go that route as there was nothing that we needed within it nor did we need to use the older plugins. With that said ... I will keep an eye on the aws-for-fluent-bit project to see if something is coming out sooner rather than later. |
@PettitWesley how soon is "very very soon"? Any time table one way or another? Just wondering if we should be trying to spin our wheels testing out hacks/etc or if the release is imminent and we should just hold off.
|
@PettitWesley so ... another data point ... we switched to using aws-for-fluentbit and still hit the same problem. However, if we changed from using the |
@cdancy I wrote a guide on reproducing crash reports, its focused on ECS FireLens, but the core dump S3 uploader technique can be used anywhere: https://github.com/aws/aws-for-fluent-bit/pull/533/files You'd need to customize it to build 2.0. My team is not setup to support 2.0 yet, the AWS distro is still on 1.9, but if you can use this to get a core dump from 2.0 and send it to us, we could help fix it. |
I want to note here that we think we've found a potential issue when you have multiple cloudwatch_logs outputs matching the same tag. Not sure which versions this impacts yet. |
@PettitWesley we're only using a single output for cloudwatch_logs here for what that's worth. I see there is a new version you all released which we have yet to try. |
@PettitWesley we tried using the
|
@cdancy actually, the issue seems to be cause simply by having any two outputs match the same tag. We've also discovered another issue on 1.9 (not sure if it applies to 2.0) which is much more likely to occur if you have The stack trace in this issue matches the "two outputs match same tag" issue we are still investigating. Sorry that's all I can say right now. Will post more when I have more clear info. |
Please see the notice for known high impact issues as of the beginning of this year: aws/aws-for-fluent-bit#542 |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the |
This should have been fixed in these: |
Bug Report
Describe the bug
fluent-bit
immediately crashes with a segmentation fault whencloudwatch_logs
output is used on ARM64.To Reproduce
Configure
fluent-bit
to output tocloudwatch_logs
.The following error is logged on startup and
fluent-bit
crashes:Expected behavior
The application should successfully start and start logging to CloudWatch.
Your Environment
v1.24.7-eks-fb459a0
arm64
kubernetes
,cloudwatch_logs
Additional context
There is another issue logged with a similar error. My expectation was that this would be resolved by version 2.0.8 that includes the fix, but it seems that is not the case. See #6451
The text was updated successfully, but these errors were encountered: