Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filelogreceiver tries to read logs from beginning even start_at=end #24839

Closed
ashlee3209 opened this issue Aug 3, 2023 · 6 comments
Closed
Labels
bug Something isn't working question Further information is requested receiver/filelog waiting for author

Comments

@ashlee3209
Copy link

Component(s)

receiver/filelog

What happened?

Description

I have setup an otel-collector with a log pipeline of filelog receiver and loki exporter. The filelog receiver is set with start_at: end which should read new logs from the moment that the collector is started but it seems only respect this for start_at for the first few log paths but all other log paths are still reading from beginning and it seems also reading without order so the loki exporter returns 'entry too far behind' errror.

Steps to Reproduce

I started to run this collector with a clean loki and no existing checkpoint file.

2023-08-02T19:49:11.119Z error exporterhelper/queued_retry.go:391 Exporting failed. The error is not retryable. Dropping data.
{"kind": "exporter", "data_type": "logs", "name": "loki", "error": "Permanent error: HTTP 400 "Bad Request":
entry with timestamp 2022-07-21 14:27:48.808964741 +0000 UTC ignored, reason: 'entry too far behind, oldest acceptable timestamp is: 2023-02-03T08:02:57Z'
for stream: {exporter="OTLP", k8s_cluster_name="my_cluster", k8s_namespace_name="openshift-ingress-operator", k8s_pod_name="ingress-operator-6c46f79974-bcrpp", log_iostream="stderr"},", "dropped_items": 100}

I have found this timestamp 2022-07-21 14:27:48.808964741 +0000 in one of the old logs before my collector started which should not be read since I set start_at=end.

Also in the collector log, I can see it Started watching file other than Started watching file from end for those logs.

2023-08-02T19:49:06.389Z info fileconsumer/file.go:194 Started watching file from end. To read preexisting logs, configure the argument 'start_at' to 'beginning' {"kind": "receiver", "name": "filelog", "data_type": "logs", "component": "fileconsumer", "path": "/var/log/pods/default_testpod_44f7ded7-03a6-42d3-82d7-f5126a4ed125/testcontainer/0.log"}
2023-08-02T19:49:07.389Z info fileconsumer/file.go:194 Started watching file from end. To read preexisting logs, configure the argument 'start_at' to 'beginning' {"kind": "receiver", "name": "filelog", "data_type": "logs", "component": "fileconsumer", "path": "/var/log/pods/openshift-cluster-machine-approver_machine-approver-9bb775975-bwpd6_582d5452-0bfd-4790-8d83-adc9a9394ef2/machine-approver-controller/8.log"}
2023-08-02T19:49:07.901Z info fileconsumer/file.go:192 Started watching file {"kind": "receiver", "name": "filelog", "data_type": "logs", "component": "fileconsumer", "path": "/var/log/pods/openshift-ingress-canary_ingress-canary-5tb8p_57266302-0dff-4c53-9a07-93c2f8e6acd0/serve-healthcheck-canary/3.log"}

Expected Result

If start_at is end, then the receiver should only read incoming logs from the moment it starts.

Actual Result

It reads old logs.

Collector version

0.82.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

openshift k8s cluster.

OpenTelemetry Collector configuration

mode: "daemonset"
presets:
  # Configures the collector to collect logs.
  # Adds the filelog receiver to the logs pipeline
  # and adds the necessary volumes and volume mounts.
  # Best used with mode = daemonset.
  logsCollection:
    enabled: true
    includeCollectorLogs: true
    # Enabling this writes checkpoints in /var/lib/otelcol/ host directory.
    # Note this changes collector's user to root, so that it can write to host directory.
    storeCheckpoints: true
  # Configures the Kubernetes Processor to add Kubernetes metadata.
  # Adds the k8sattributes processor to all the pipelines
  # and adds the necessary rules to ClusteRole.
  # Best used with mode = daemonset.
  kubernetesAttributes:
    enabled: true

config:
  extensions:
    # The health_check extension is mandatory for this chart.
    # Without the health_check extension the collector will fail the readiness and liveliness probes.
    # The health_check extension can be modified, but should never be removed.
    health_check: { }
    memory_ballast:
      #  The ballast should be configured to be 1/3 to 1/2 of the memory allocated to the collector.
      size_in_percentage: 50
  exporters:
    logging:
      verbosity: detailed
    loki:
      endpoint: "http://loki_address:loki_port/loki/api/v1/push"
  processors:
    attributes:
      actions:
        - action: insert
          key: loki.attribute.labels
          value: log.iostream
    resource:
      attributes:
        - action: insert
          key: loki.resource.labels
          value: k8s.cluster.name,k8s.namespace.name,k8s.pod.name
    batch:
      send_batch_max_size: 800
      timeout: 10s
      send_batch_size: 500
    memory_limiter:
      # this calculation is based on the resource allocated to this collector, check resources/limits in the bottom
      check_interval: 1s
      # limit_mib is the hard limit
      limit_percentage: 90
      # The soft limit value will be equal to (limit_mib - spike_limit_mib).
      spike_limit_percentage: 20
    k8sattributes:
      passthrough: false
      filter:
        node_from_env_var: K8S_NODE_NAME
      pod_association:
        - sources:
            - from: resource_attribute
              name: k8s.pod.ip
        - sources:
            - from: resource_attribute
              name: k8s.pod.uid
        - sources:
            - from: connection
      extract:
        metadata:
          - k8s.namespace.name
          - k8s.deployment.name
          - k8s.statefulset.name
          - k8s.daemonset.name
          - k8s.cronjob.name
          - k8s.job.name
          - k8s.pod.name
          - k8s.pod.uid
          - k8s.node.name
  receivers:
    jaeger: null
    otlp: null
    prometheus: null
    zipkin: null
    filelog:
      include: [ /var/log/pods/*/*/*.log ]
      start_at: end
      poll_interval: 1s
      retry_on_failure:
        enabled: true
  service:
    telemetry:
      logs:
        level: "info"
      metrics:
        address: "[${env:MY_POD_IP}]:8888"
    extensions:
      - health_check
      - memory_ballast
    pipelines:
      logs:
        receivers:
          - filelog
        processors:
          - memory_limiter
          - k8sattributes
          - attributes
          - resource
        exporters:
          - loki
      metrics: null
      traces: null

# Resource limits & requests. Update according to your own use case as these values might be too low for a typical deployment.
resources:
  limits:
    cpu: 256m
    memory: 5120Mi

Log output

No response

Additional context

No response

@ashlee3209 ashlee3209 added bug Something isn't working needs triage New item requiring triage labels Aug 3, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@djaglowski
Copy link
Member

I have been unable to replicate the behavior you are describing. Would it be possible to describe a minimal scenario (e.g. set of simple files, simplest possible config, precise steps to follow), or even better, a unit test that fails?

@mwear mwear added question Further information is requested and removed needs triage New item requiring triage labels Aug 3, 2023
@fgalind1
Copy link

actually we were having a similar issue.. in our case we had > # of log files than the max_concurrent_files limit, which ended up enabling the batch mode. Looks like when batch is enabled, fromBeginning is enabled m.readerFactory.fromBeginning = true and that causes only for the first batch to work properly but not the rest.

In our case in the meantime we WA this by adding a huge limit of concurrent files to avoid batching and that solved the problem. Also I wasn't able to make it work with storage: file_storage. Only when we do not provide a storage for the file position/offset is when it wasn't reading files from start and always just tailing from the end

@djaglowski
Copy link
Member

@fgalind1, thanks for sharing your findings as well. Based on your description, I think this was likely fixed in #27773.

@atoulme
Copy link
Contributor

atoulme commented Dec 6, 2023

Closing as complete, please reopen if the problem still occurs.

@atoulme atoulme closed this as completed Dec 6, 2023
@obs-gh-enricogiorio
Copy link

obs-gh-enricogiorio commented Oct 23, 2024

Hi @djaglowski I am still facing this issue despite the fix.
I made sure start_at is set to "end" and I have checked the receiver's logs, nothing mentions that any batching happened.
Also given the amount of log files I have in the system, I can pretty much ensure that the default limit of 1024 concurrent files has ever been reached.
For instance, on 16/10/24 13:22:24.363 CEST, the receiver read a log line with timestamp 14/05/24 14:40:41.042 CEST, which is >5 months old.
image

From this screenshot you can see:

BUNDLE_TIMESTAMP is when the log line was parsed (technically it's when it reached our system, but it's likely just a few seconds earlier)
timestamp is the log's timestamp
delta is the discrepancy between the two. The logs you see are 4 months old
The purple bits above are when the data falls over time. You can see that this did not happen constantly but only twice, with many instances of very old logs being read all together

This is how I configured the receiver:

  filelog:
    exclude:  '[]'
    include:  '["/var/log/pods/*/*/*.log", "/var/log/kube-apiserver-audit.log"]'
    include_file_name: false
    include_file_path: true
    exclude_older_than: 24h
    operators:
    - id: container-parser
      max_log_size: 102400
      type: container
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_interval:  30s
      max_elapsed_time: 5m
    start_at: "end"
    storage: file_storage

Should I create a new issue or asking here is fine?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested receiver/filelog waiting for author
Projects
None yet
Development

No branches or pull requests

7 participants