Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging stack using Loki #72

Closed
32 of 43 tasks
Kristian-ZH opened this issue Aug 7, 2020 · 6 comments
Closed
32 of 43 tasks

Logging stack using Loki #72

Kristian-ZH opened this issue Aug 7, 2020 · 6 comments
Assignees

Comments

@Kristian-ZH
Copy link
Contributor

Kristian-ZH commented Aug 7, 2020

Current tasks

  • logging documentations this and this.
  • gardenctl related changes.
  • Add fluent-bit filters and parsers for all of the extensions.
  • Make fluent-bit aware of testing and hibernated clusters (Currently if there is a log messages just before the hibernation or messages from tf-pods the fluent-bit tries to flush them into non existing Loki, because it is already deleted ).
  • Remove the admission that is annotating pods with fluentbit.io/exclude [bug]
  • Upgrade loki to 1.6.0
  • Deploy logging stack as early as possible in shoot reconciliation flow: deploy logging stack as early as possible in shoot reconciliation flow gardener#2457
  • Fix parser:lbReadvertiserParser time format PR
  • Use Fluent-bit tags to extract k8s metadata when it can not be obtained from the Api-server ( This change will prevent sending shoot logs to garden namespace)
  • Investigate Fluent-bit entry out of order issue
    • Container restart scenario: Add ContainerID in stream identity in Fluent-bit plugin
    • fluent-bit logs are flushed async -> try to increase the flush interval
    • implement ID per stream
  • Expose Logging configurations
  • Deliver Fluent-bit plug-in lib only Inject the custom plugin as a side car #62
  • PoC/Research bulk send logs to Loki - contribute to Fluent-bit plug-in
  • Rework Logging dashboards: gardener/gardener#2945
  • Review the labels and remove the unnecessary ones
  • Integration test which simulates logging stack in a loaded seed
    • Tail plugin silently loses logs under high load fluent-bit #2723
    • Mitigate in the Integration test with 1 minute waiting time.
  • Deploy custom plugin as init container
  • Implement Loki tenants (Security)
  • Add priorityClassName to fluent-bit pods
  • Metrics for the custom plugin
  • Expose Cloud-Controller-Manager and CSI-Driver to the end-users via Dashboards
  • Separate the seed and shoot logging stack deletion to avoid unwanted cluster scope resource deletion.
  • Investigate slow volume attachment for Loki.
  • Make Loki inode and volume-based curator.
  • Make VPA logs available for end-users.
  • Make integration test to check for entry out of order issue
  • Make integration test stable

Future improvements

  • Fetch logs from deleting shoot and store them to the central Loki instance
  • Fetch logs from the shoot nodes
  • Upgrade to Loki 2.0
  • Fetch Kernel logs: Make kernel logs accessible for seed machines. gardener#2991 - TBD
  • Throttling filter in fluent-bit plugin - cut logs from pods which are flooding aggressively - TBD
    • Create throttling filter
    • Create metrics based on which pod logs are throttled
    • Create alert when logs are throttled
@Kristian-ZH
Copy link
Contributor Author

/assign

@rfranzke
Copy link
Member

rfranzke commented Aug 7, 2020

/area logging

Just saw the following issue:

$ kg logs fluent-bit-trpr9
[proxy] error opening plugin /fluent-bit/plugins/out_loki.so: "/fluent-bit/plugins/out_loki.so: cannot open shared object file: No such file or directory"

@Kristian-ZH
Copy link
Contributor Author

/area logging

Just saw the following issue:

$ kg logs fluent-bit-trpr9
[proxy] error opening plugin /fluent-bit/plugins/out_loki.so: "/fluent-bit/plugins/out_loki.so: cannot open shared object file: No such file or directory"

@vlvasilev

@vpnachev
Copy link
Member

Remove the admission that is annotating pods with fluentbit.io/exclude as we now have a controller inside the fluent-bit dropping the unneeded logs.

@Kristian-ZH Kristian-ZH changed the title Logging stabilization Logging stack using Loki Aug 18, 2020
@rfranzke
Copy link
Member

rfranzke commented Sep 22, 2020

Given gardener/gardener#2865, let's add the following item as well:

  • Rework logging dashboards to make it easy to expose additional control plane component logs (e.g., cluster-autoscaler)

@Kristian-ZH
Copy link
Contributor Author

Since the issues are moved into gardener/gardener, we do not need to track the backlog here.

/close

@Kristian-ZH Kristian-ZH unpinned this issue Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants