Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue - telegraf-operator - MountVolume.SetUp failed for volume "telegraf-config" : secret "telegraf-XXXX" not found #137

Open
anthosz opened this issue Feb 7, 2024 · 10 comments

Comments

@anthosz
Copy link

anthosz commented Feb 7, 2024

Hello,

Since few month, we experiment this kind of issue (50% of time when we plan an upgrade (when the pod respawn) and 20% of time during a pod reschedule (when it switch from a node to another one).

It is included in a Varnish statefulset.

Template

apiVersion: v1
kind: Secret
metadata:
  name: varnish
[...]
data:
  key: "XXXXXXX"
---
[...]
---
apiVersion: apps/v1
kind: StatefulSet
[...]
spec:
  template:
    metadata:
      annotations:
        telegraf.influxdata.com/env-secretkeyref-SECRET_VALUE: varnish.secret
        telegraf.influxdata.com/volume-mounts: '{"datas":"/datas"}'
        telegraf.influxdata.com/inputs: |+
[...]

How to reproduce

Deploy a new version or move the pod to another node.

Current behavior (randomly):

Warning FailedMount 2m23s (x242 over 7h58m) kubelet MountVolume.SetUp failed for volume "telegraf-config" : secret "telegraf-config-varnish-0" not found

Due to that, the pod cannot start.

Workaround:

Kill the pod and the secret is well recreated.

Expected behavior:

The secret is found

Other informations

The age of secret source is more 100 days so cannot be related to this one.

But the telegraf secret seems to be recreated every time than the pod is spawn and it seems there is an issue here: the secret cannot be created so telegraf cannot spawn (unable to mount not found secret) so the pod is freezed until we terminate it.

Versions

  • K8S: 1.28.4
  • Telegraf: 1.28.5
  • Telegraf operator: chart: 1.3.12 / APP version: 1.3.11
@GBlodgett35
Copy link

We have the same issue. It looks like there is a race condition in the webhook with STS, due to the pod names and therefore the secret names to be the same from deploy to deploy. What is supposed to happen:

  • Old pod is deleted
  • Webhook deletes secret
  • New pod comes up
  • Webhook updates or creates secret

But what ends up happening is:

  • Old pod is deleted
  • New pod comes up
  • Webhook updates the existing secret
  • Webhook deletes secret due to the pod being deleted

@conman2305
Copy link

Throwing in a +1 for this issue happening for us as well. Running influxdb/telegraf-operator:v1.3.10

@anthosz
Copy link
Author

anthosz commented Mar 8, 2024

@GBlodgett35 did you found a workaround? It's pitty that we cannot tel to pod to be respawned when this issue occur :/

@anthosz
Copy link
Author

anthosz commented Apr 2, 2024

Not sure why but it start to be recurrent during deployment and especially consolidation of nodes (and in specific case, it's really blocker because there is now way to automatically trigger a restart if pod is in error due to this missing secret).

Starting to have a doubt if this project is always maintained or if we need to investigate for another solution? @gitirabassi @wojciechka

Cannot really help in go but if you need test or more informations, don't hesitate.

@GBlodgett35
Copy link

@anthosz We ended up embedding Telegraf on the image instead of using the operator :(

@anthosz
Copy link
Author

anthosz commented Apr 10, 2024

@anthosz We ended up embedding Telegraf on the image instead of using the operator :(

That's what I feared, seems this project is not maintained anymore 😑

@tlereste
Copy link

Hello,
I have the same issue with telegraf-operator 1.3.11 and kubernetes 1.27.3.
And also, this issue is random and appears 50% of the time.
I welcome any information on this problem.
Thanks.

@anthosz
Copy link
Author

anthosz commented Apr 19, 2024

Hello, I have the same issue with telegraf-operator 1.3.11 and kubernetes 1.27.3. And also, this issue is random and appears 50% of the time. I welcome any information on this problem. Thanks.

According to influxdata/telegraf#15192 (comment)

It seems that it's not maintained anymore by influxdata, the only way is to create the PR ourself with the fix.. (not sure if someone have knowledge about golang/operator..)

@jmickey
Copy link

jmickey commented Jun 24, 2024

@anthosz @tlereste @GBlodgett35 Hi folks, As this project appears to no longer be maintained, I've gone ahead re-written this project from scratch over at https://github.com/jmickey/telegraf-sidecar-operator.

It currently supports the majority of the annotations as this project, with one notable exception: telegraf.influxdata.com/istio- annotations aren't currently supported.

The project it technically pre-1.0.0, but I've been running it on a staging cluster for about a week and it's been working well.

It also resolves this issue.

@anthosz
Copy link
Author

anthosz commented Jun 25, 2024

@anthosz @tlereste @GBlodgett35 Hi folks, As this project appears to no longer be maintained, I've gone ahead re-written this project from scratch over at https://github.com/jmickey/telegraf-sidecar-operator.

It currently supports the majority of the annotations as this project, with one notable exception: telegraf.influxdata.com/istio- annotations aren't currently supported.

The project it technically pre-1.0.0, but I've been running it on a staging cluster for about a week and it's been working well.

It also resolves this issue.

Thank you for the feedback, at this time, personnally, I moved all the stuff to sidecar & removed the operator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants