-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stops sending logs after connection/tls failure #410
Comments
Just checking- so you are using this blog/example? https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/oomkill-prevention Also are any logs successfully making it to Loki? Here it mostly looks like every request fails? This seems to be mainly failing network. Can you curl the loki endpoint from an instance inside the same subnet? Since you indicated its crashing/stopping unexpectedly, this is the technique for getting a stack trace so we can take it to upstream and fix it: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#segfaults-and-crashes-sigsegv |
Yes, I looked at that blog/example along with many (many, many) others. Yes, some logs successfully are making it to Loki. In the example I pasted above the service was sending logs for about 90 minutes before it failed. Curling the loki endpoint still works. That service is one of > 10 on the same subnet and all of the others continued working even after this one failed. The time it takes to fail and the service that it fails on is not repeatable. I'll see what I can do re: getting the stack trace. |
I'll post more of the log when it crashes/exits, but on startup I see a bunch of "invalid read" and "invalid write" now.
|
I just noticed that the example to add valgrind had an old version of aws-for-fluent-bit - version 2.21.0. The log above was from running that version. That ran successfully for nearly two days without exiting/crashing. I'm going to try with version 2.28.0 now. |
@hankwallace Where did you see this? Our Dockerfile.debug is always supposed to be on the same version as the prod release: https://github.com/aws/aws-for-fluent-bit/blob/mainline/Dockerfile.debug#L4 |
@PettitWesley I saw it here: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md Also, I added the |
@PettitWesley there's another change needed to get it to work with 2.28. Want me to submit a PR for it? |
@hankwallace Sure |
@PettitWesley I just submitted #413 |
I have not seen a full crash in the last 2 days, but there are still lots of errors in the logs now that I'm using a build with valgrind. Is it possible that valgrind is preventing the crashes? Would uploading a log file just with the errors be helpful? |
@hankwallace Please do, more data is usually helpful |
@PettitWesley Here are two log files. It's very interesting that I haven't seen a crash/exit since running it with valgrind. The error that usually occurs a little while before a crash/exit is this one:
That often signals a retry loop until it reaches the retry limit and then shortly afterwards a crash/exit. There is at least one EOF error in each log file, but the retries succeeded in these examples. |
We're also running into this issue with a similar setup. Our ECS Fargate tasks would encounter successfully send some logs to Grafana Cloud's Loki before encountering a TLS issue, retry then hit Problem happened with Fluentbit v1.9.3 (amazon/aws-for-fluent-bit:stable) and v1.9.7
service.conf
task definition:
|
@PettitWesley any update on this? I am still running all containers in debug mode to prevent them from failing. |
@PettitWesley we are not currently losing logs, but ONLY because we are running the debug configuration via valgrind. The issues mentioned in the known issues don't appear to include the issue here. |
Hey all, I'm a fellow logging enthusiast wanted to chime in here on the loki issue I see here and wanted to provide a potential solution to a problem I've had working with wanting to send ECS logs to both loki & cloudwatch with firelens. TL;DR: I created a firelens docker image built from the It appears that @hankwallace has got the
While this can work, I believe @PettitWesley has pushed other ways to push logs by using plugins that are provided by using the plugin .so files, which the grafana team has provided (by creating the plugin after cloning the loki repo & making the .so file) I got setup from looking at this comment when using the newrelic plugin and creating my own image from the base aws-for-fluent-bit
I too was trying to find a solution to work because I was able to get cloudwatch logs working properly with Then I tried Loki's firelens solution where they create the dynamic plugin library and place it into an image to be used as a firelens image. When I used their proposed image I also share @hankwallace 's concern on not being able to find something within the grafana or fluentbit communities, it seemed like there wasn't a lot of tutorial support on how to diagnose or fix an issue like this. But you left a lot of great nuggets in your comments @PettitWesley and I appreciate all AWS blog posts around firelens. So I wanted to combine the two solutions. First create the loki plugin from source.
then I created my own docker image from the init image and loading in the plugin.
I created my own alt fluentbit conf to be used: (altered to just use samples, I have the task def configured to create the log groups and other components necessary to make this work)
This how I'm attempting to resolve having two stream endpoints and I hope this might be useful for documentation for firelens that @PettitWesley could use in the future to showcase on how to have firelens send logs to two different endpoints using their own plugins. I believe using the I think fluentbit probably does have fixes but it’s all in 2.X.X versions of fluentbit and while firelens is working, it’s still on fluentbit 1.9.X. There’s a lot more to dive into and figure out but I hope this has helped. |
So my work wasn't completed and I didn't fully grasp the So add in the plugin line to the init image:
then tell the init process to include that plugin in the base command:
Sorry to extend this and if you need to move this to a separate issue or if there is a way to include this in a new feature to add in more plugins, please do so, just wanted to get this information to a fellow AWS user who wants to use loki and to see if this can help resolve their issue from a better maintained angle. |
@wick02 Yea I think a different issue is needed for this. Also explain in it why the upstream loki doesn't work for you: https://docs.fluentbit.io/manual/pipeline/outputs/loki |
Any updates on this issue? We are still running the debug container (using valgrind) to limit the crashes/failures. |
Any updates? How can I help to move this forward? We are still using the debug container because the non-debug one fails intermittently. |
Describe the question/issue
The
aws-for-fluent-bit
log router stops sending logs to Loki through a HTTPS proxy after a connection/tls failure. The container sometimes exits shortly after and doesn't have anything in its log to indicate why. This causes the entire ECS task to restart because I have the log router containeressential=true
so that we don't lose logs for a long period of time.I have searched the issues here and in the fluent-bit repo. I have also searched the Grafana and Fluent slack communities.
Configuration
Deployment:
awsfirelens
to route logs to theaws-for-fluent-bit
containeraws-for-fluent-bit
container is routing logs to a loki task in the same clusterRelevant parts of ECS task definition. The first container is the web app and the second is the log router:
The
extra.conf
file contains:Fluent Bit Log Output
Here's a partial log file where the error starts, the container fails to send any more logs (even on the retries), and then exits - killing the entire task because I have
essential=true
.Fluent Bit Version Info
We are running the current stable version.
Cluster Details
The text was updated successfully, but these errors were encountered: