You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are no counter reset events in the datastream
Actual behavior:
There are unexpected counter reset events in the datastream
Additional info:
I have been tracking an issue where our monitoring graph suddenly shows large spikes, which started happening after upgrading from telegraf-1.9.1 to telegraf-1.9.4.
I have been running tcpdump traces to capture the traffic between prometheus and telegraf on one specific server which is exhibiting these symptoms, and am seeing counter reset events over the wire when there has been no reset. E.g. for the system_uptime value -- I saw the following extracted values from a tcpdump of the http requests:
Date: Thu, 14 Feb 2019 15:04:20 GMT
system_uptime{host="<REMOVED>"} 211225
Date: Thu, 14 Feb 2019 15:04:30 GMT
system_uptime{host="<REMOVED>"} 211225
Date: Thu, 14 Feb 2019 15:04:40 GMT
system_uptime{host="<REMOVED>"} 211235
Date: Thu, 14 Feb 2019 15:04:50 GMT
system_uptime{host="<REMOVED>"} 211255
Date: Thu, 14 Feb 2019 15:05:00 GMT
system_uptime{host="<REMOVED>"} 211245
Date: Thu, 14 Feb 2019 15:05:10 GMT
system_uptime{host="<REMOVED>"} 211275
As all of these values were for the same host, the 211245 uptime being received by prometheus after 211255 uptime has triggered a counter reset. Analysis of scrape durations on prometheus can find no instances where these exceeded our 10s scrape time.
I have been trying multiple versions to attempt to bisect the version introducing this issue, as this server had been running telegraf-1.9.1 for several weeks which was stable, the issue has only occurred since upgrading to telegraf-1.9.4, but downgrading to telegraf-1.9.2 also seemed to resolve the issue. telegraf-1.9.3 is definitely exhibiting the same issues as telegraf-1.9.4, as such I believe its been introduced with telegraf-1.9.3 and is still present in telegraf-1.9.4.
I'm seeing this counter reset across a wide variety of metrics, but not in any consistent manner so unfortunately its proving difficult to reproduce so any help would be appreciated.
The text was updated successfully, but these errors were encountered:
This must be caused by a change I made in the order the metrics as passed to the outputs. The metrics within a batch is now ordered from newest to oldest.
Relevant telegraf.conf:
/etc/telegraf.conf
:/etc/telegraf/telegraf.d/default_inputs.conf
:/etc/telegraf/telegraf.d/default_outputs.conf
:System info:
CentOS7, telegraf 1.9.4.
Steps to reproduce:
Unclear
Expected behavior:
There are no counter reset events in the datastream
Actual behavior:
There are unexpected counter reset events in the datastream
Additional info:
I have been tracking an issue where our monitoring graph suddenly shows large spikes, which started happening after upgrading from telegraf-1.9.1 to telegraf-1.9.4.
I have been running tcpdump traces to capture the traffic between prometheus and telegraf on one specific server which is exhibiting these symptoms, and am seeing counter reset events over the wire when there has been no reset. E.g. for the system_uptime value -- I saw the following extracted values from a tcpdump of the http requests:
As all of these values were for the same host, the 211245 uptime being received by prometheus after 211255 uptime has triggered a counter reset. Analysis of scrape durations on prometheus can find no instances where these exceeded our 10s scrape time.
I have been trying multiple versions to attempt to bisect the version introducing this issue, as this server had been running telegraf-1.9.1 for several weeks which was stable, the issue has only occurred since upgrading to telegraf-1.9.4, but downgrading to telegraf-1.9.2 also seemed to resolve the issue. telegraf-1.9.3 is definitely exhibiting the same issues as telegraf-1.9.4, as such I believe its been introduced with telegraf-1.9.3 and is still present in telegraf-1.9.4.
I'm seeing this counter reset across a wide variety of metrics, but not in any consistent manner so unfortunately its proving difficult to reproduce so any help would be appreciated.
The text was updated successfully, but these errors were encountered: