Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logparser plugin process half of lines when nginx access log buffer is enabled #3613

Closed
vvxvv opened this issue Dec 21, 2017 · 3 comments
Closed
Labels
area/tail bug unexpected problem or unintended behavior

Comments

@vvxvv
Copy link

vvxvv commented Dec 21, 2017

Bug report

Relevant telegraf.conf:

[[inputs.logparser]]
files = ["/var/log/nginx/fastcgi-metrics.log"]
from_beginning = false
[inputs.logparser.grok]
patterns = ["%{NGINX_FCGI_METRICS}"]
measurement = "fcgi_metrics"
custom_patterns = '''
NGINX_FCGI_METRICS [%{HTTPDATE:ts:ts-httpd}] %{IPORHOST:server:tag} %{NUMBER:status:tag} %{NUMBER:upstream_status:tag} %{NUMBER:upstream_response_time:float} %{WORD:upstream_cache_status:tag}
'''

*** nginx.conf:
log_format telegraf '[$time_local] $server_name $status $upstream_status '
'$upstream_response_time $upstream_cache_status';

access_log /var/log/nginx/fastcgi-metrics.log telegraf buffer=1m flush=1;

System info:

OS: Debian Stretch
telegraf: v1.5

Steps to reproduce:

  1. Run: h2load -n 1000 -c 10 --h1 http://...
  2. SELECT count("upstream_response_time") AS "mean_upstream_response_time" FROM "telegraf"."autogen"."fcgi_metrics" WHERE time > now() - 10m GROUP BY time(10s) ORDER BY time DESC
  3. Expected 1000 counts. Actual 533.
  4. Removing "buffer=1m flush=1" from access_log directive fixes count to 1000, as expected
@vvxvv vvxvv changed the title Logparser plugin process half of lines when nginx access log buffer enabled Logparser plugin process half of lines when nginx access log buffer is enabled Dec 21, 2017
@danielnelson danielnelson added area/tail bug unexpected problem or unintended behavior labels Dec 22, 2017
@bolek2000
Copy link

bolek2000 commented Mar 19, 2018

I also discovered this effect. We use buffer=32k . I already discussed with Daniel in this thread: https://community.influxdata.com/t/telegraf-logparser-and-duplicate-data-points/3346
After investigating further I found that this problem always occurs, when data is unchronological. This is mostly the case when buffering is on. In most production environments I think buffering for webserver logs is mandatory, because of I/O optimization. So with the current state of logparser/telegraf we cannot have (near) realtime data, because we need to sort the logfiles before we can ingest them. I do it with a shell script. Otherwise we loose a differing amount of series.
Also I tested with $msec timestamp (e.g. 1521355646.617) in the access logs, but still I have the problem for often occuring tagsets, like busy URLs, because there are multiple requests per millisecond.
The only way around it I can think of would be a configurable buffer that stores logentries for a certain time period and then sorts them by timestamp before applying the internal offset and then sends the data to InfluxDB. By the way the offset for seconds and milliseconds timestamps works great (for chronological data), but does not help here.

> select count(*) from proxy_access_log_unsorted
name: proxy_access_log_unsorted
time count_request_time count_resp_bytes count_upstream_time
---- ------------------ ---------------- -------------------
0    1520089            1520089          648454
> select count(*) from proxy_access_log_sorted
name: proxy_access_log_sorted
time count_request_time count_resp_bytes count_upstream_time
---- ------------------ ---------------- -------------------
0    1520351            1520351          648482 

More info on the nginx side, but even flushing the buffer more frequently would not help for sure I guess, because different worker processe seem to have their own buffers:
https://stackoverflow.com/questions/41901948/nginx-logs-are-out-of-order-probably-due-to-buffered-logging
https://stackoverflow.com/questions/30761786/nginx-access-log-default-flush-time

@sjwang90
Copy link
Contributor

@vvxvv @bolek2000 Checking in if this is still an issue. We are evaluating enhancements to the inputs.tail and I wanted to see if this is still a problem.

@bolek2000
Copy link

I am not sure if this problem is solved (I guess no), because we built a workaround that is still in place. We have our own nginx access log preprocessing script, which sorts time slices of buffered logs chronologically and then appends the sorted lines to a new logfile, which gets ingested by inputs.tail plugin. I'm not sure if I will have the time to investigate this again in detail. But you could test how telegraf constructs the time stamps and tag sets in a scenario where the same tag sets reoccur at the same nginx timestamp. When log buffering is enabled, the nginx workers spit out the log lines unchronologically. The same nginx timestamp can reoccur multiple times, so if the timesamp+offset+tagset reoccurs then ist will be overwitten in InfluxDB and metrics get lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tail bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants