Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datadog output] Version 2.29.0 Causing Task to stop #491

Closed
aidant opened this issue Dec 7, 2022 · 85 comments
Closed

[Datadog output] Version 2.29.0 Causing Task to stop #491

aidant opened this issue Dec 7, 2022 · 85 comments

Comments

@aidant
Copy link

aidant commented Dec 7, 2022

When updating to version 2.29.0 (previously 2.28.4) of aws-observability/aws-for-fluent-bit we are observing one of our task definitions entering a cycle of provisioning and de-provisioning.

We are running ECS with Fargate and aws-observability/aws-for-fluent-bit plus datadog/agent version 7.40.1 as sidecars.

We have not had an opportunity to look into the cause of this. Hopefully, you can provide some insights into how we can debug this further. Our next steps will likely be to try the FLB_LOG_LEVEL=debug environment variable and report back.

@pylebecq
Copy link

pylebecq commented Dec 7, 2022

Same for us. We retrieved this in our logs:

AWS for Fluent Bit Container Image Version 2.29.0
--
Fluent Bit v1.9.10
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2022/12/07 10:10:47] [ info] [fluent bit] version=1.9.10, commit=760956f50c, pid=1
[2022/12/07 10:10:47] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/12/07 10:10:47] [ info] [cmetrics] version=0.3.7
[2022/12/07 10:10:47] [ info] [input:tcp:tcp.0] listening on 127.0.0.1:8877
[2022/12/07 10:10:47] [ info] [input:forward:forward.1] listening on unix:///var/run/fluent.sock
[2022/12/07 10:10:47] [ info] [input:forward:forward.2] listening on 127.0.0.1:24224
[2022/12/07 10:10:47] [ info] [output:null:null.0] worker #0 started
[2022/12/07 10:10:47] [ info] [sp] stream processor started
[2022/12/07 10:11:00] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory
[2022/12/07 10:11:00] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device
[2022/12/07 10:11:01] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory
[2022/12/07 10:11:01] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device
[2022/12/07 10:11:02] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory
[2022/12/07 10:11:02] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device
[2022/12/07 10:11:32] [engine] caught signal (SIGSEGV)

@rolandaskozakas
Copy link

Oh this made my day. Restoring to 2.28.4 helps.

@tsgoff
Copy link

tsgoff commented Dec 7, 2022

Same error since 2.29.0:
[error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate...
[error] [src/flb_sds.c:109 errno=12] Cannot allocate memory

platform: arm(graviton)

Quickfix -> switch to stable tag

@ndevvy
Copy link

ndevvy commented Dec 8, 2022

Same here, also on ECS/Fargate with datadog-agent 7.40.1.

@sribas
Copy link

sribas commented Dec 8, 2022

Same problem, also on Fargate with datadog-agent.

@besbes
Copy link

besbes commented Dec 8, 2022

We are experiencing the same problem, we are not using datadog-agent but the fluent bit task seems to stop randomly after 15-60 minutes. Switched back to 2.28.4 did resolve the issue.

@MikeWise01
Copy link

We are running into this issue as well it seems to be related to this fluent/fluent-bit#6512

@mattjamesaus
Copy link

Got burned hard by this today swapping to stable fixed us up - we were using latest as per old documentation.

@jonrose-dev
Copy link

Same for us. We had an hour long outage trying to diagnose this. Using stable fixed for us. Same setup as everyone else with fargate and datadog

@nbutkowski-chub
Copy link

nbutkowski-chub commented Dec 8, 2022

+1 also seeing this. Running it sidecar style in the same TaskDefinition (fargate) with datadog-agent container.

@smai-f
Copy link

smai-f commented Dec 9, 2022

+1. Any update?

@PettitWesley
Copy link
Contributor

Are you all facing the error Cannot allocate memory?

We are actively investigating this issue.

In the meantime, please check out our relevant guides on high memory usage:

  1. https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#what-to-do-when-fluent-bit-memory-usage-is-high
  2. https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/oomkill-prevention

@PettitWesley
Copy link
Contributor

PettitWesley commented Dec 9, 2022

Are you all using the Fluent Bit datadog output? Can you please share your Fluent Bit configuration files?

@lubingfeng
Copy link
Contributor

Has someone from Datadog look into this? Anyone has engaged DataDog team?

@lubingfeng
Copy link
Contributor

lubingfeng commented Dec 9, 2022

Has someone from Datadog look into this? Anyone has engaged DataDog team?

I have sent a note to DataDog public Slack channel - https://datadoghq.slack.com/archives/C8PV5LVDX/p1670625992901619 for someone to take a look.

@lubingfeng
Copy link
Contributor

When updating to version 2.29.0 (previously 2.28.4) of aws-observability/aws-for-fluent-bit we are observing one of our task definitions entering a cycle of provisioning and de-provisioning.

We are running ECS with Fargate and aws-observability/aws-for-fluent-bit plus datadog/agent version 7.40.1 as sidecars.

We have not had an opportunity to look into the cause of this. Hopefully, you can provide some insights into how we can debug this further. Our next steps will likely be to try the FLB_LOG_LEVEL=debug environment variable and report back.

Can you share your ECS task definition pertaining to FireLens / DataDog log routing? This will help us isolate the problem.

@PettitWesley PettitWesley changed the title Version 2.29.0 Causing Task Definition to Deprovision [Datadog output] Version 2.29.0 Causing Task to stop Dec 10, 2022
@tessro
Copy link

tessro commented Dec 10, 2022

We consume Fluent Bit via the CDK. This was the offending code, in our case. It worked fine until we bumped to 2.29.0:

this.taskDefinition.addFirelensLogRouter('LogRouter', {
  image: ecs.obtainDefaultFluentBitECRImage(this.taskDefinition),
  essential: true,
  memoryLimitMiB: 256,
  firelensConfig: {
    type: ecs.FirelensLogRouterType.FLUENTBIT,
    options: {
      enableECSLogMetadata: true,
      configFileType: ecs.FirelensConfigFileType.FILE,
      configFileValue: '/fluent-bit/configs/parse-json.conf',
    },
  },
  logging: new ecs.FireLensLogDriver({
    options: {
      Name: 'datadog',
      Host: 'http-intake.logs.datadoghq.com',
      dd_service: 'log-router',
      dd_source: 'fluentbit',
      dd_tags: `env:production`,
      dd_message_key: 'log',
      TLS: 'on',
      provider: 'ecs',
    },
    secretOptions: {
      apikey: this.datadogToken,
    },
  },
});

@srussellextensis
Copy link

We are experiencing the same problem, we are not using datadog-agent but the fluent bit task seems to stop randomly after 15-60 minutes. Switched back to 2.28.4 did resolve the issue.

This ticket title has been changed to reference Datadog specifically, but @besbes above mentioned seeing it without Datadog in the mix. Has there been more confirmation that it's only Datadog-related?

@matthewfala
Copy link
Contributor

Could someone share their fluent bit configuration file which is used with the crash?

@tessro
Copy link

tessro commented Dec 10, 2022

In our case it's the one that ships in the image: /fluent-bit/configs/parse-json.conf

@aidant
Copy link
Author

aidant commented Dec 10, 2022

Apologies it's the weekend here in Australia, I'll talk to our DevOps about sharing the task definition on Monday.

@tessro
Copy link

tessro commented Dec 10, 2022

Here's the ECS configuration that gave us problems:

{
  "name": "LogRouter",
  "image": "906394416424.dkr.ecr.us-west-2.amazonaws.com/aws-for-fluent-bit:2.29.0",
  "cpu": 0,
  "memory": 256,
  "links": [],
  "portMappings": [],
  "essential": true,
  "entryPoint": [],
  "command": [],
  "environment": [],
  "environmentFiles": [],
  "mountPoints": [],
  "volumesFrom": [],
  "secrets": [],
  "user": "0",
  "dnsServers": [],
  "dnsSearchDomains": [],
  "extraHosts": [],
  "dockerSecurityOptions": [],
  "dockerLabels": {},
  "ulimits": [],
  "logConfiguration": {
    "logDriver": "awsfirelens",
    "options": {
      "Host": "http-intake.logs.datadoghq.com",
      "Name": "datadog",
      "TLS": "on",
      "dd_message_key": "log",
      "dd_service": "log-router",
      "dd_source": "fluentbit",
      "dd_tags": "env:production",
      "provider": "ecs"
    },
    "secretOptions": [
      {
        "name": "apikey",
        "valueFrom": "[REDACTED]"
      }
    ]
  },
  "systemControls": [],
  "firelensConfiguration": {
    "type": "fluentbit",
    "options": {
      "config-file-type": "file",
      "config-file-value": "/fluent-bit/configs/parse-json.conf",
      "enable-ecs-log-metadata": "true"
    }
  }
}

@tessro
Copy link

tessro commented Dec 10, 2022

Changing the version back to 2.28.4 resolves the issue with no other modifications.

@matthewfala
Copy link
Contributor

Thank you for the information, @paulrosania.
We're trying to reproduce the issue. Would it be possible to get a sample log? Currently testing Random logs of random size length 0 to 100, at a rate of 5 to 100 logs per second. Also tried sending some one off large logs. Have not yet seen the segfault, though gauging the number of affected people, it's probably not too hard to reproduce

@tessro
Copy link

tessro commented Dec 10, 2022

Unfortunately since Fluent Bit routes our logs straight from ECS, I don't think we have a capture of the log contents from the time of the issue. 😞

@PettitWesley
Copy link
Contributor

AWS is actively working on reproducing this bug report. You can help us out by providing with more information. We will try to fix it as quickly as possible, but please see: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#i-reported-an-issue-how-long-will-it-take-to-get-fixed

@PettitWesley
Copy link
Contributor

If possible, please, can anyone share example log messages that are emitted by their apps when experiencing the issue?

@matthewfala
Copy link
Contributor

@aidant Thank you!

@dhpiggott, thank you for the information. Would you try the prerelease images?

public.ecr.aws/fala-fluentbit/aws-for-fluent-bit:2.31.1-prerelease
public.ecr.aws/fala-fluentbit/aws-for-fluent-bit:init-2.31.1-prerelease

There are 2 sets of datadog segfaults

  1. Buffer resize segfault (Affects 2.28.4 and prior, and also in 2.29.1)
  2. Network segfault (Affects 2.29.0)

The 2.31.1-prerelease should get rid of both of these segfaults. We're hesitant to introduce these fixes in 2.29.1 due to a need for more validation that it actually resolves the segfaults we are experiencing here. It does resolve the segfaults from my reproduction tests.

Also, @dhpiggott, are you using the cloudwatch_logs plugin? If so, the /cloudwatch.so extension is for the cloudwatch go plugin not cloudwatch_logs c plugin and not update the c cloudwatch_logs plugin.

@dhpiggott
Copy link

Would you try the prerelease images?

I've put it on my list for next week :)

There are 2 sets of datadog segfaults

Ah sorry, I wasn't aware of that!

are you using the cloudwatch_logs plugin? If so, the /cloudwatch.so extension is for the cloudwatch go plugin not cloudwatch_logs c plugin

We are. I didn't realise that, thanks for correcting me! We've only just recently started dual-writing to Datadog, and none of the changes relating to that are in production yet, which is still using 2.29.0 directly. So we're only using the workaround I shared above in pre-production load tests at the moment. I'll make sure we don't inadverently switch production back to the go plugin now I know - thanks :)

@besbes
Copy link

besbes commented Jan 14, 2023

@matthewfala Is there a linux/arm64/v8 version of public.ecr.aws/fala-fluentbit/aws-for-fluent-bit:2.31.1-prerelease available? Would be happy to try it. Thank you!

@dhpiggott
Copy link

Oh that’s a good point - our service runs on Graviton 2 task instances so I would also need arm64 images (I hadn’t checked availability but it sounds like there might not be).

@dhpiggott
Copy link

Thanks @PettitWesley - I watched https://www.youtube.com/watch?v=F73MgV_c2MM when I first set our service up with Fluent Bit and was researching the options. The performance section was very informative :) I just completely missed out the significance of the names when I put that Dockerfile patch together. We are using the core C plugin, and for now we're using 2.28.4 which I realise means we don't benefit from fluent/fluent-bit#6339.

@PettitWesley
Copy link
Contributor

PettitWesley commented Jan 18, 2023

@dhpiggott For this reason we just put out a new release which should be running the same code for datadog as 2.28.4 but also includes the 2.29.0 cloudwatch_logs synchronous task scheduler fix that you mentioned: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.29.1

For everyone in this issue, please try out the new release and let us know your findings.

@matthewfala
Copy link
Contributor

matthewfala commented Jan 18, 2023

Please note that 2.29.1 and 2.30.0 have been officially released, and should be at least as stable as the 2.28.4 DataDog, but adds the cloudwatch hang fix.

Thank you @dhpiggott! I'll hopefully build an arm version of the prerelease 2.31.1 image today.

So in summary of all the versions to help us understand the potential problems:
2.28.4 and all versions prior: Cloudwatch Hang Issue, and Rare DataDog buffer overrun segfault
2.29.0: Common DataDog network segfault
2.29.1: At least as stable as 2.28.4, rare DataDog buffer overrun segfault (revert)
2.31.1 Prerelease: No issues hopefully

@dhpiggott
Copy link

I'll hopefully build an arm version of the prerelease 2.31.1 image today.

That would be great! At the moment we're a bit stuck with our service - we want to ship our changes that dual write to Datadog and CloudWatch Logs. We're currently running 2.29.0 in prod and it writes to CloudWatch Logs great. Datadog output fails if I try to enable it with 2.29.0 or 2.29.1, so we'd have to switch back to 2.28.4, but given the message about that having a CloudWatch hang issue, that doesn't seem safe either. So it sounds like the 2.31.1 prerelease is what we need. If the arm64 build passes all our tests once it's available my plan will be to ship our dual write change to prod using that.

@matthewfala
Copy link
Contributor

@dhpiggott 2.30.0 should be at least as stable as 2.28.4 in terms of datadog, but more stable in terms of cloudwatch_logs. 2.30.0 has been released officially! Feel free to use that one in production while 2.31.1 is in testing.

Thank you for waiting for the arm prerelease! My arm machine wasn't working, so I had to load up a new one.

Here are the arm prerelease images

public.ecr.aws/fala-fluentbit/aws-for-fluent-bit:2.31.1-prerelease-arm
public.ecr.aws/fala-fluentbit/aws-for-fluent-bit:init-2.31.1-prerelease-arm

The prerelease should be more stable than 2.28.4 in terms of datadog and cloudwatch_logs.

Hope this helps.

@aidant
Copy link
Author

aidant commented Jan 20, 2023

@matthewfala I've run the following images for about 15 min each and have not seen the issue arise.

public.ecr.aws/fala-fluentbit/aws-for-fluent-bit:2.31.1-prerelease
public.ecr.aws/fala-fluentbit/aws-for-fluent-bit:init-2.31.1-prerelease

@matthewfala
Copy link
Contributor

@aidant
Thank you for your help in testing the prerelease. That's great news, as the 2.31.1 image should be more stable than 2.28.4 since it resolves the buffer overrun errors some users were facing in 2.28.4

For now, if an official release image is needed, please use 2.30.0, as it is as stable as 2.28.4 for Datadog and resolves the cloudwatch hang issues.

2.31.1 will be coming out soon! I'll keep everyone posted here.

If anyone else want's to help validate the 2.31.1 image, to check it for segfaults in your workflow it would be greatly appreciated

@matthewfala
Copy link
Contributor

@dhpiggott, @besbes , @tw-sarah , @atlantis , @paulrosania , @MikeWise01,

Would any of you be willing to help test the 2.31.1 prerelease image? We would like to be sure that new segfaults are not introduced in unexpected workflows.

AWS is not actively maintaining the DataDog plugin except in rare circumstances like the recent segfaults found in 2.28.4 and prior which heavily impact customers, so we don't have as many test cases for this plugin, and want to make sure that it gets community validation to test as many workflows as possible before release.

@dhpiggott
Copy link

Hey @matthewfala - thanks for the arm64 build. I've been OoO today so I haven't had opportunity to test it yet, but do I plan to try it next week when I'm back in.

@dhpiggott
Copy link

dhpiggott commented Jan 23, 2023

I just tried 2.31.1 but saw segfaults within a minute of container startup. Unfortunately that's the only detail I saw in the Fluent Bit container logs:

23/01/2023, 12:16:02 | [2023/01/23 12:16:02] [engine] caught signal (SIGSEGV) | fluent-bit

@matthewfala
Copy link
Contributor

@dhpiggott,
Thank you for your response.
Are you able to use 2.29.1 with ARM without segfaults, how about 2.28.4?

@ArtDubya
Copy link

Just wanted to add my own results... We had the issue of Fargate Tasks recycling over and over because the 'aws-for-fluent-bit:2.29.0' container was marked as essential and would crash with the 'Cannot allocate memory' line. We are configured to send logs to DataDog.

Just updated to 2.30.0 and all is well... container starts, forwards the logs, and Tasks have been healthy for over an hour.

1/27/2023, 11:19:02 AM	AWS for Fluent Bit Container Image Version 2.30.0	log-router
1/27/2023, 11:19:02 AM	Fluent Bit v1.9.10	log-router
1/27/2023, 11:19:02 AM	* Copyright (C) 2015-2022 The Fluent Bit Authors	log-router
1/27/2023, 11:19:02 AM	* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd	log-router
1/27/2023, 11:19:02 AM	* https://fluentbit.io	log-router
1/27/2023, 11:19:02 AM	[2023/01/27 19:19:02] [ info] [fluent bit] version=1.9.10, commit=6345dd7422, pid=1	log-router
1/27/2023, 11:19:02 AM	[2023/01/27 19:19:02] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128	log-router
1/27/2023, 11:19:02 AM	[2023/01/27 19:19:02] [ info] [cmetrics] version=0.3.7	log-router
1/27/2023, 11:19:02 AM	[2023/01/27 19:19:02] [ info] [input:tcp:tcp.0] listening on 127.0.0.1:8877	log-router
1/27/2023, 11:19:02 AM	[2023/01/27 19:19:02] [ info] [input:forward:forward.1] listening on unix:///var/run/fluent.sock	log-router
1/27/2023, 11:19:02 AM	[2023/01/27 19:19:02] [ info] [input:forward:forward.2] listening on 127.0.0.1:24224	log-router
1/27/2023, 11:19:02 AM	[2023/01/27 19:19:02] [ info] [output:null:null.0] worker #0 started	log-router
1/27/2023, 11:19:02 AM	[2023/01/27 19:19:02] [ info] [sp] stream processor started	log-router

@besbes
Copy link

besbes commented Jan 29, 2023

@matthewfala I just tried 2.30.0 on ECS Fargate (no Datadog involved but we are running on Graviton) and got a segfault a few minutes after starting the container:

[2023/01/29 00:02:17] [engine] caught signal (SIGSEGV)

No other logs unfortunately.

@dhpiggott
Copy link

@matthewfala

Are you able to use 2.29.1 with ARM without segfaults

I tried 2.29.1 a couple of weeks ago but did get segfaults (#491 (comment)).

how about 2.28.4?

I think when I used 2.28.4 it ran ok, but I saw in #491 (comment) and #491 (comment) that you mentioned 2.28.4 has a hang issue with the CloudWatch Logs output. At the moment our service uses CloudWatch Logs as the source of truth, and we're in the process of transitioning to Datadog logs. We've never ran 2.28.4 in production (prior to that we were using the awslogs Docker logs driver directly) so I'm very hesitant to downgrade production from 2.29 (with Datadog disabled, since we don't depend on it yet) to 2.28.4 (which we've never ran in prod). In other words, we need our CloudWatch Logs delivery to be solid until we switch to Datadog (which will happen gradually).

@PettitWesley
Copy link
Contributor

@besbes (and others) this issue is for segfaults associated with using datadog output. If you are not using datadog and you see a crash, please open an new issue for that and give us your config, task def/pod yaml, etc.

@matthewfala
Copy link
Contributor

@besbes
@Claych is looking into resolving the arm segfault issue you and some others identified. Thank you for letting us know.

We'll be releasing aws-for-fluent-bit v2.31.1 shortly which should contain the most stable version of Datadog as it has fixes to both of the recent identified (non-arm) segfaults related to the datadog plugin.

@dhpiggott
Copy link

To update from my end, I've been trying each new release as I've seen them, and saw 2.31.2 was released yesterday. In all the tests I've done so far it works great! Hopefully we can go to prod with it next week.

@dhpiggott
Copy link

Resolve cloudwatch_logs duplicate tag match SIGSEGV issue introduced in 2.29.0

And that fits my observations - our config has a Cloudwatch and a Datadog output that both match all tags.

@lubingfeng
Copy link
Contributor

This is great to hear!

@PettitWesley
Copy link
Contributor

Closing this old issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests