Logparser plugin process half of lines when nginx access log buffer is enabled #3613

vvxvv · 2017-12-21T12:46:42Z

Bug report

Relevant telegraf.conf:

[[inputs.logparser]]
files = ["/var/log/nginx/fastcgi-metrics.log"]
from_beginning = false
[inputs.logparser.grok]
patterns = ["%{NGINX_FCGI_METRICS}"]
measurement = "fcgi_metrics"
custom_patterns = '''
NGINX_FCGI_METRICS [%{HTTPDATE:ts:ts-httpd}] %{IPORHOST:server:tag} %{NUMBER:status:tag} %{NUMBER:upstream_status:tag} %{NUMBER:upstream_response_time:float} %{WORD:upstream_cache_status:tag}
'''

*** nginx.conf:
log_format telegraf '[$time_local] $server_name $status $upstream_status '
'$upstream_response_time $upstream_cache_status';

access_log /var/log/nginx/fastcgi-metrics.log telegraf buffer=1m flush=1;

System info:

OS: Debian Stretch
telegraf: v1.5

Steps to reproduce:

Run: h2load -n 1000 -c 10 --h1 http://...
SELECT count("upstream_response_time") AS "mean_upstream_response_time" FROM "telegraf"."autogen"."fcgi_metrics" WHERE time > now() - 10m GROUP BY time(10s) ORDER BY time DESC
Expected 1000 counts. Actual 533.
Removing "buffer=1m flush=1" from access_log directive fixes count to 1000, as expected

bolek2000 · 2018-03-19T20:36:56Z

I also discovered this effect. We use buffer=32k . I already discussed with Daniel in this thread: https://community.influxdata.com/t/telegraf-logparser-and-duplicate-data-points/3346
After investigating further I found that this problem always occurs, when data is unchronological. This is mostly the case when buffering is on. In most production environments I think buffering for webserver logs is mandatory, because of I/O optimization. So with the current state of logparser/telegraf we cannot have (near) realtime data, because we need to sort the logfiles before we can ingest them. I do it with a shell script. Otherwise we loose a differing amount of series.
Also I tested with $msec timestamp (e.g. 1521355646.617) in the access logs, but still I have the problem for often occuring tagsets, like busy URLs, because there are multiple requests per millisecond.
The only way around it I can think of would be a configurable buffer that stores logentries for a certain time period and then sorts them by timestamp before applying the internal offset and then sends the data to InfluxDB. By the way the offset for seconds and milliseconds timestamps works great (for chronological data), but does not help here.

> select count(*) from proxy_access_log_unsorted
name: proxy_access_log_unsorted
time count_request_time count_resp_bytes count_upstream_time
---- ------------------ ---------------- -------------------
0    1520089            1520089          648454
> select count(*) from proxy_access_log_sorted
name: proxy_access_log_sorted
time count_request_time count_resp_bytes count_upstream_time
---- ------------------ ---------------- -------------------
0    1520351            1520351          648482

More info on the nginx side, but even flushing the buffer more frequently would not help for sure I guess, because different worker processe seem to have their own buffers:
https://stackoverflow.com/questions/41901948/nginx-logs-are-out-of-order-probably-due-to-buffered-logging
https://stackoverflow.com/questions/30761786/nginx-access-log-default-flush-time

sjwang90 · 2020-10-20T17:45:53Z

@vvxvv @bolek2000 Checking in if this is still an issue. We are evaluating enhancements to the inputs.tail and I wanted to see if this is still a problem.

bolek2000 · 2020-10-20T20:03:40Z

I am not sure if this problem is solved (I guess no), because we built a workaround that is still in place. We have our own nginx access log preprocessing script, which sorts time slices of buffered logs chronologically and then appends the sorted lines to a new logfile, which gets ingested by inputs.tail plugin. I'm not sure if I will have the time to investigate this again in detail. But you could test how telegraf constructs the time stamps and tag sets in a scenario where the same tag sets reoccur at the same nginx timestamp. When log buffering is enabled, the nginx workers spit out the log lines unchronologically. The same nginx timestamp can reoccur multiple times, so if the timesamp+offset+tagset reoccurs then ist will be overwitten in InfluxDB and metrics get lost.

vvxvv changed the title ~~Logparser plugin process half of lines when nginx access log buffer enabled~~ Logparser plugin process half of lines when nginx access log buffer is enabled Dec 21, 2017

danielnelson added area/tail bug unexpected problem or unintended behavior labels Dec 22, 2017

This was referenced Mar 30, 2018

telegraf logparser sometimes doesn't realize new logfile creation or starts parsing in the middle of the file influxdata/influxdb#9664

Closed

Logparser plugin don't process new lines after telegraf configuration reload. #3573

Closed

sjwang90 closed this as completed Jul 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logparser plugin process half of lines when nginx access log buffer is enabled #3613

Logparser plugin process half of lines when nginx access log buffer is enabled #3613

vvxvv commented Dec 21, 2017

bolek2000 commented Mar 19, 2018 •

edited

Loading

sjwang90 commented Oct 20, 2020

bolek2000 commented Oct 20, 2020

Logparser plugin process half of lines when nginx access log buffer is enabled #3613

Logparser plugin process half of lines when nginx access log buffer is enabled #3613

Comments

vvxvv commented Dec 21, 2017

Bug report

Relevant telegraf.conf:

System info:

Steps to reproduce:

bolek2000 commented Mar 19, 2018 • edited Loading

sjwang90 commented Oct 20, 2020

bolek2000 commented Oct 20, 2020

bolek2000 commented Mar 19, 2018 •

edited

Loading