Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find files with '/var/log/pods/*/*/*.log' pattern: open .: permission denied #33083

Closed
MathiasPius opened this issue May 15, 2024 · 15 comments
Closed
Labels
bug Something isn't working needs triage New item requiring triage receiver/filelog

Comments

@MathiasPius
Copy link

Component(s)

receiver/filelog

What happened?

Description

Filesystem permission issues attempting to set up an OpenTelemetry Collector to hoover up pod log files.

The collector is deployed using the opentelemetry operator.

I have a suspicion this might be caused by the way doublestar traverses the filesystem.

Steps to Reproduce

Deploy an OpenTelemetryCollector agent into a Talos Kubernetes cluster.

Expected Result

Logs are read from /var/log/pods/ correctly.

Actual Result

No logs are collected, and the collector agent pod reports permission denied and "no files match the configured criteria".

Collector version

v0.99.0 & v0.100.0 (probably others)

Environment information

Environment

OS: Talos Linux 1.6.4, 1.7.0, 1.7.1

OpenTelemetry Collector configuration

config:
  receivers:
    filelog/std:
      exclude:
        - /var/log/pods/*/otel-collector/*.log
        - /var/log/pods/*/otc-container/*.log
      include:
        - /var/log/pods/*/*/*.log
      include_file_name: false
      include_file_path: true
      start_at: end

Log output

2024-05-15T12:40:06.200Z        warn    fileconsumer/file.go:43 finding files: no files match the configured criteria
find files with '/var/log/pods/*/*/*.log' pattern: open .: permission denied    {"kind": "receiver", "name": "filelog/std", "data_type": "logs", "component": "fileconsumer"}

Additional context

Talos creators suggested the issue might be resolved by granting the daemonset/pod the CAP_DAC_READ_SEARCH capability, but this did not work. I additionally tried adding the following security context, which also did not work:

securityContext:
    privileged: true
    capabilities:
      add:
        - DAC_READ_SEARCH
        - SYS_ADMIN
@MathiasPius MathiasPius added bug Something isn't working needs triage New item requiring triage labels May 15, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@djaglowski
Copy link
Member

A very similar issue was reported yesterday. Notably, using ** for intermediate directories is necessary. Please try this and let us know the result.

@MathiasPius
Copy link
Author

That changed the error from open .: permission denied to stat .: permission denied, but the outcome remains the same.

Also, the OpenTelemetry docs themselves say to use singular *.

I tried using /var/log/containers/*.log as well, which is what the fluentbit deployment I'm using in the meantime is configured to use, and that also fails:

2024-05-16T16:29:43.668Z        warn    fileconsumer/file.go:43 finding files: no files match the configured criteria
find files with '/var/log/containers/*.log' pattern: open .: permission denied  {"kind": "receiver", "name": "filelog/std", "data_type": "logs", "component": "fileconsumer"}

The pods have identical hostPath mounts:

# Fluentbit
apiVersion: v1
kind: Pod
metadata:
  name: fluentbit-fluent-bit-dpnnk
  namespace: openobserve
spec:
  containers:
  - name: fluent-bit
    volumeMounts:
    - mountPath: /var/log
      name: varlog
  volumes:
  - hostPath:
      path: /var/log
      type: ""
    name: varlog
# agent-collector
apiVersion: v1
kind: Pod
metadata:
  name: openobserve-collector-agent-collector-bvrln
  namespace: openobserve
spec:
  containers:
  - name: otc-container
    volumeMounts:
    - mountPath: /var/log
      name: varlog
  volumes:
  - hostPath:
      path: /var/log
      type: ""
    name: varlog

@ChrsMark
Copy link
Member

Hey @MathiasPius , could you share more details about the environment you are running?
Did you maybe also try the collector Helm chart?

It could either be a permission issue that comes from the operator or could be platform specific 🤔 .

I wasn't able to reproduce it locally (using the collector Helm chart) on k8s v1.27.3 (kind v0.20.0 go1.20.4 linux/amd64). In this, could you try to run a busybox Pod with the same mounts and stat the files from inside?

Sharing the values file I used for reference (using latest main collector-contrib):

mode: daemonset
presets:
  logsCollection:
    enabled: true

command:
  name: otelcontribcol

config:
  exporters:
    debug:
      verbosity: detailed
    otlp/some:
      ....
  receivers:
    filelog:
      start_at: end
      include_file_name: false
      include_file_path: true
      exclude:
        - /var/log/pods/default_daemonset-opentelemetry-collector*_*/opentelemetry-collector/*.log
      include:
        - /var/log/pods/*/*/*.log

  service:
    pipelines:
      logs:
        receivers: [filelog]
        processors: [batch]
        exporters: [otlp/some]

@MathiasPius
Copy link
Author

MathiasPius commented May 18, 2024

I am using an OpenTelemetryCollector with the opentelemetry operator, because the helm charts I have found depend on a resolveable spec.nodeName to access the kubelet, which is not the case for my current setup. The platform is Talos Linux 1.7.1 using Kubernetes 1.29.1.

The whole config is available here: https://github.com/MathiasPius/kronform/blob/main/manifests/infrastructure/openobserve/agent-collector.yaml#L82-L146

Note that this is the version using /var/log/containers/*.log, but I previously tried using /var/log/pods/*/*/*.log as well as /var/log/pods/**/*.log.

I deployed the following pod:

apiVersion: v1
kind: Pod
metadata:
  name: investigator
  namespace: openobserve
spec:
  containers:
  - name: pod
    image: busybox
    command: ["sleep", "3600"]
    volumeMounts:
    - mountPath: /var/log
      name: varlog
  volumes:
  - hostPath:
      path: /var/log
      type: ""
    name: varlog

And did some digging around:

$ kubectl apply -f pod.yaml
pod/investigator created

$ kubectl exec -it -n openobserve investigator -- /bin/sh

/ # ls -lah /var/log/
total 28K
drwx------    5 root     root          49 May 14 11:29 .
drwxr-xr-x    1 root     root          28 May 18 09:55 ..
drwx------    3 root     root          18 May 14 11:29 audit
drwx------    2 root     root       12.0K May 18 09:55 containers
drwx------   47 root     root        8.0K May 18 09:55 pods

/ # ls -lah /var/log/pods/ | grep 'openobserve'
drwxr-xr-x    3 root     root          24 May 16 18:18 openobserve_fluentbit-fluent-bit-9hjjc_da25b3b5-87df-42a3-82e4-6337c0a16c83
drwxr-xr-x    3 root     root          17 May 18 09:55 openobserve_investigator_d52df42b-c9bf-4d09-983a-e63dc2f9b051
drwxr-xr-x    3 root     root          27 May 16 16:29 openobserve_openobserve-collector-agent-collector-bvrln_8cebb3f7-621b-4bd0-badc-855fed175f7d
drwxr-xr-x    3 root     root          27 May 15 23:46 openobserve_openobserve-collector-gateway-collector-0_6806e47b-9a89-44f2-9391-bf81508fd284

/ # ls -lah /var/log/pods/openobserve_openobserve-collector-agent-collector-bvrln_8cebb3f7-621b-4bd0-badc-855fed175f7d/otc-container/0.log
-rw-r-----    1 root     root        8.0K May 18 00:40 /var/log/pods/openobserve_openobserve-collector-agent-collector-bvrln_8cebb3f7-621b-4bd0-badc-855fed175f7d/otc-container/0.log

/ # stat /var/log/pods/openobserve_openobserve-collector-agent-collector-bvrln_8cebb3f7-621b-4bd0-badc-855fed175f7d/otc-container/0.log
  File: /var/log/pods/openobserve_openobserve-collector-agent-collector-bvrln_8cebb3f7-621b-4bd0-badc-855fed175f7d/otc-container/0.log
  Size: 8234      	Blocks: 24         IO Block: 4096   regular file
Device: 10306h/66310d	Inode: 850643      Links: 1
Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2024-05-16 16:29:43.462394419 +0000
Modify: 2024-05-18 00:40:14.707549589 +0000
Change: 2024-05-18 00:40:14.707549589 +0000

/ # tail -n 5 /var/log/pods/openobserve_openobserve-collector-agent-collector-bvrln_8cebb3f7-621b-4bd0-badc-855fed175f7d/otc-container/0.log
2024-05-16T16:29:43.6693092Z stderr F 2024-05-16T16:29:43.668Z	warn	fileconsumer/file.go:43	finding files: no files match the configured criteria
2024-05-16T16:29:43.66931705Z stderr F find files with '/var/log/containers/*.log' pattern: open .: permission denied	{"kind": "receiver", "name": "filelog/std", "data_type": "logs", "component": "fileconsumer"}
2024-05-16T16:29:43.669320868Z stderr F 2024-05-16T16:29:43.668Z	info	service@v0.100.0/service.go:195	Everything is ready. Begin running and processing data.
2024-05-16T16:29:43.669324467Z stderr F 2024-05-16T16:29:43.668Z	warn	localhostgate/featuregate.go:63	The default endpoints for all servers in components will change to use localhost instead of 0.0.0.0 in a future version. Use the feature gate to preview the new default.{"feature gate ID": "component.UseLocalHostAsDefaultHost"}
2024-05-18T00:40:14.709220437Z stderr F 2024-05-18T00:40:14.709Z	warn	kubelet/accumulator.go:102	failed to fetch container metrics	{"kind": "receiver", "name": "kubeletstats", "data_type": "metrics", "pod": "renovate-28599880-gfgpg", "container": "renovate", "error": "failed to set extra labels from metadata: pod \"9a657f71-e9f9-4365-b1e0-e07dc9b0628c\" with container \"renovate\" has an empty containerID"}

I won't rule out that Talos might have something to do with it, but since both my own pods as shown above and Fluent-bit works out of the box, filelog seems to be the odd one out.

@ChrsMark
Copy link
Member

Thank's @MathiasPius.

I tried to reproduce the issue (on a GKE cluster) using the operator as well but I can't.

I'm using:

  • collector version: opentelemetry-collector-k8s:0.99.0
  • operator version: opentelemetry-operator:0.99.0
  • k8s version: v1.29.1-gke.1589020

Sharing the manifest I used for reference:

otel-col.yaml
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
 name: otel-collector
 labels:
   rbac.authorization.k8s.io/aggregate-to-admin: "true"
   rbac.authorization.k8s.io/aggregate-to-edit: "true"
   rbac.authorization.k8s.io/aggregate-to-view: "true"
rules:
 - apiGroups: [""] # "" indicates the core API group
   resources:
     - nodes
     - namespaces
     - events
     - pods
     - services
     - persistentvolumes
     - persistentvolumeclaims
   verbs: ["get", "watch", "list"]
 - apiGroups: [ "extensions" ]
   resources:
     - replicasets
   verbs: [ "get", "list", "watch" ]
 - apiGroups: [ "apps" ]
   resources:
     - statefulsets
     - deployments
     - replicasets
     - daemonsets
   verbs: [ "get", "list", "watch" ]
 - apiGroups: [ "batch" ]
   resources:
     - jobs
     - cronjobs
   verbs: [ "get", "list", "watch" ]
 - apiGroups: [ "storage.k8s.io" ]
   resources:
     - storageclasses
   verbs: [ "get", "list", "watch" ]
 - apiGroups:
     - ""
   resources:
     - nodes/stats
   verbs:
     - get
 - nonResourceURLs:
     - "/metrics"
   verbs:
     - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
 name: otelcol
subjects:
 - kind: ServiceAccount
   name: daemonset-collector # name of your service account
   namespace: default
roleRef: # referring to your ClusterRole
 kind: ClusterRole
 name: otel-collector
 apiGroup: rbac.authorization.k8s.io
---
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
 name: daemonset
spec:
 mode: daemonset
 serviceAccount:
 hostNetwork: true
 envFrom:
   - secretRef:
       name: otlp-secret
 volumeMounts:
   - name: varlogpods
     mountPath: /var/log/pods
     readOnly: true
 volumes:
   - name: varlogpods
     hostPath:
       path: /var/log/pods
 config: |
   exporters:
     debug: {}
     otlp:
       compression: none
       endpoint: ${env:otlp_endpoint}
       headers:
         Authorization: Bearer ${env:otlp_secret_token}
   extensions:
     health_check: {}
   processors:
     batch: {}
     filter/logs_instrumented_pods:
       logs:
         log_record:
           - resource.attributes["logs.exporter"] == "otlp"
     resource/k8s:
       attributes:
         - key: service.name
           from_attribute: app.label.component
           action: insert
     k8sattributes:
       extract:
         metadata:
         - k8s.namespace.name
         - k8s.deployment.name
         - k8s.statefulset.name
         - k8s.daemonset.name
         - k8s.cronjob.name
         - k8s.job.name
         - k8s.node.name
         - k8s.pod.name
         - k8s.pod.uid
         - k8s.pod.start_time
         - container.id
         labels:
         - tag_name: app.label.component
           key: app.kubernetes.io/component
           from: pod
         - tag_name: logs.exporter
           key: otel.logs.exporter
           from: pod
       filter:
         node_from_env_var: K8S_NODE_NAME
       passthrough: false
       pod_association:
       - sources:
         - from: resource_attribute
           name: k8s.pod.ip
       - sources:
         - from: resource_attribute
           name: k8s.pod.uid
       - sources:
         - from: connection
     memory_limiter:
       check_interval: 5s
       limit_percentage: 80
       spike_limit_percentage: 25
   receivers:
     filelog:
       exclude:
       - /var/log/pods/default_daemonset-collector*_*/opentelemetry-collector/*.log
       include:
       - /var/log/pods/*/*/*.log
       include_file_name: false
       include_file_path: true
       operators:
       - id: get-format
         routes:
         - expr: body matches "^\\{"
           output: parser-docker
         - expr: body matches "^[^ Z]+ "
           output: parser-crio
         - expr: body matches "^[^ Z]+Z"
           output: parser-containerd
         type: router
       - id: parser-crio
         regex: ^(?P<time>[^ Z]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$
         timestamp:
           layout: 2006-01-02T15:04:05.999999999Z07:00
           layout_type: gotime
           parse_from: attributes.time
         type: regex_parser
       - combine_field: attributes.log
         combine_with: ""
         id: crio-recombine
         is_last_entry: attributes.logtag == 'F'
         max_log_size: 102400
         output: extract_metadata_from_filepath
         source_identifier: attributes["log.file.path"]
         type: recombine
       - id: parser-containerd
         regex: ^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$
         timestamp:
           layout: '%Y-%m-%dT%H:%M:%S.%LZ'
           parse_from: attributes.time
         type: regex_parser
       - combine_field: attributes.log
         combine_with: ""
         id: containerd-recombine
         is_last_entry: attributes.logtag == 'F'
         max_log_size: 102400
         output: extract_metadata_from_filepath
         source_identifier: attributes["log.file.path"]
         type: recombine
       - id: parser-docker
         output: extract_metadata_from_filepath
         timestamp:
           layout: '%Y-%m-%dT%H:%M:%S.%LZ'
           parse_from: attributes.time
         type: json_parser
       - id: extract_metadata_from_filepath
         parse_from: attributes["log.file.path"]
         regex: ^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[a-f0-9\-]+)\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$
         type: regex_parser
       - from: attributes.stream
         to: attributes["log.iostream"]
         type: move
       - from: attributes.container_name
         to: resource["k8s.container.name"]
         type: move
       - from: attributes.namespace
         to: resource["k8s.namespace.name"]
         type: move
       - from: attributes.pod_name
         to: resource["k8s.pod.name"]
         type: move
       - from: attributes.restart_count
         to: resource["k8s.container.restart_count"]
         type: move
       - from: attributes.uid
         to: resource["k8s.pod.uid"]
         type: move
       - from: attributes.log
         to: body
         type: move
       - type: json_parser
         if: 'body matches "^{.*}$"'
         severity:
           parse_from: attributes.level
       start_at: end
   service:
     extensions:
     - health_check
     pipelines:
       logs:
         exporters:
         - otlp
         processors:
         - k8sattributes
         - batch
         - resource/k8s
         - filter/logs_instrumented_pods
         receivers:
         - filelog
---

I would suggest simplifying your configuration here to only focus on filelog receiver and change the mounts in order to only mount the /var/log/pods (switching the filelog's receiver config as well):

  volumeMounts:
    - name: varlogpods
      mountPath: /var/log/pods
      readOnly: true
  volumes:
    - name: varlogpods
      hostPath:
        path: /var/log/pods 

@open-telemetry/operator-approvers any ideas on if that could be an operator specific issue and specifically the privileges set?

@MathiasPius
Copy link
Author

Just tried applying your settings to my setup: MathiasPius/kronform@13014b1#diff-90c4b8f3bdde68e8eefedefaa8f8a89b6f5360e5135ccc32c524ca49e9057d9dR84-R299

2024-05-22T19:23:29.139Z        warn    fileconsumer/file.go:47 finding files   {"kind": "receiver", "name": "filelog/std", "data_type": "logs", "component": "fileconsumer", "component": "fileconsumer", "error": "no files match the configured criteria\nfind files with '/var/log/pods/*/*/*.log' pattern: open .: permission denied"}

Same exact result.

I'm a little curious about the open call. I dug through filelog and into the doublestar implementation and saw it uses fs.ReadDir which calls open if the object does not implement ReadDir. I'm not very familiar with go, so I'm not exactly sure what it means to implement the ReadDirFS interface, nor why my filesystem supposedly does not implement this, but the call and source of the error seems suspect to me.

@swiatekm
Copy link
Contributor

swiatekm commented May 22, 2024

One obvious difference between Fluent Bit/busybox container images and the otel contrib image, is that the former run as root by default, while the latter runs as a normal user. Can you try explicitly setting the securityContext to run the otel container as root?

In the Collector CR, it'd be something like:

  securityContext:
    runAsUser: 0
    runAsGroup: 0

@MathiasPius
Copy link
Author

@swiatekm-sumo That's such an obvious oversight on my part 🤡. I had simply assumed that it ran as root since those logs are often owned by root, and frankly a lot of Kubernetes-packaged software doesn't bother configuring a correct securityContext or running as non-root.

Explicitly running as root like you suggested fixed the issue!

@Ivalberto
Copy link

Hi @MathiasPius where did You add :

 securityContext:
    runAsUser: 0
    runAsGroup: 0

which file ? i facing the same issue

@MathiasPius
Copy link
Author

@Ivalberto Here's my entire OpenTelemetryCollector configuration, with the securityContext section highlighted: https://github.com/MathiasPius/kronform/blob/4055fbc830cb829d247be5759f14dae44d1ceb6e/manifests/infrastructure/openobserve/agent-collector.yaml#L275-L277

@zviratko
Copy link

zviratko commented Nov 4, 2024

This solved it for me (Talos 1.8.0 + 1.8.2)
The problem is that /var/log on Talos is owned by root and doesn't allow any other user in (rwx------)
I'm not thrilled by it running as root, though...

otelAgent:
  securityContext:
    runAsUser: 0
    runAsGroup: 0

@zviratko
Copy link

Just a heads up, Talos Linux 1.8.3 no longer requires this workaround because the permissions have been changed.

@MathiasPius
Copy link
Author

Just a heads up, Talos Linux 1.8.3 no longer requires this workaround because the permissions have been changed.

Thanks for reporting it to Talos!

@zviratko
Copy link

Just a heads up, Talos Linux 1.8.3 no longer requires this workaround because the permissions have been changed.

Thanks for reporting it to Talos!

No, thank you for your blog! It pushed me (hopefully :)) in the right way and helped immensely especially in the beginning :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage New item requiring triage receiver/filelog
Projects
None yet
Development

No branches or pull requests

6 participants