Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InfluxDB 0.10.1 with Telegraf 0.10.3 udp failing under increased load #5788

Closed
allen13 opened this issue Feb 23, 2016 · 6 comments
Closed

InfluxDB 0.10.1 with Telegraf 0.10.3 udp failing under increased load #5788

allen13 opened this issue Feb 23, 2016 · 6 comments

Comments

@allen13
Copy link

allen13 commented Feb 23, 2016

I am currently using using influxdb/telegraf to monitor two environments of differing size. One has about 80 servers the other 250. udp is failing in the larger environment for unknown reasons. The server itself has 56-cores and raided SSDs, in addition to bonded 10GB NICs. The only error indicator I could find was the batchesTxFail metric in the _internal db. It steadily rises at about 50 per minute.

Things that have been ruled out:

Network - the local telegraf can't write to it
Telegraf - nc -lu 8089 while influxdb is stopped shows telegraf writing the proper metrics
Memory - The machine has 256GB. Also these values have been set in the config:
cache-max-memory-size = 128849018880 (120GB)
cache-snapshot-memory-size = 107374182400 (100GB)
UDP Read buffer - The sysctl and config are set to read-buffer = 1073741824 (1GB)
Any other part of InfluxDB aside from udp - switching to http works great

Http may be the more reliable way to go for now but it would be nice to reduce resource usage with udp.

@jwilder
Copy link
Contributor

jwilder commented Feb 23, 2016

Hard to say off-hand what the issue is, but cache-snapshot-memory-size is way too high. That value is the threshold at which the in-memory cached values are compacted to TSM files. That should really be left at the default value of ~25MB.

@jonseymour
Copy link
Contributor

@allen13 if you are willing to run code currently in master (e.g. this isn't a critical production system) the code currently in master has #5758, which implements some cache throughput metrics which are described here #5499.

Comparing the cache throughput with the compacted bytes per second (see explanation in the base of #5499) may give you some idea about if whether your disks are falling behind your inbound traffic. That said, I'd agree with @jwilder that your snapshot memory size is probably too high.

Again only if you can afford to lose data, it might be informative to take a stack dump of the server that is under high UDP load. If there is a bottleneck somewhere in that path, it might be obvious from an analysis of the stack dumps. To obtain a stack dump, kill the influxd process with -QUIT. Note that this will stop the server and there is a small risk of data loss (e.g. the risk that the software which is designed to minimize data loss isn't working 100% reliably).

@jonseymour
Copy link
Contributor

BTW @allen13 - if you do switch to a master build make sure to backup first, particularly the meta directory because of this issue (#5772) and don't do it if you can't afford the risk of data loss

update: and just in case there is confusion - I don't speak for influx, I am just an interested user who contributed some of the cache statistics code.

@allen13
Copy link
Author

allen13 commented Feb 23, 2016

Data loss is not an issue I have wiped the whole /var/lib/influxdb a few times during troubleshooting. I'll definitely set the snapshot size back to default. Trying --QUIT next, master after that.

@allen13
Copy link
Author

allen13 commented Feb 23, 2016

Couldn't find anything obviously wrong in the -QUIT stack dump. Can't spend anymore time on debugging so I'm switching back to http until udp becomes more stable and performant. Thanks for the help!

@allen13 allen13 closed this as completed Feb 23, 2016
@jonseymour
Copy link
Contributor

@allen13 I am not 100% sure, but it might be hard for the UDP interface to ever perform as well as HTTP because the number of points that can fit in a legal UDP packet will be much smaller than what can be streamed in an HTTP payload. What you might be seeing is simply the costs associated with fragmenting data into UDP packets.

If this is the problem, then putting a batching UDP gateway between the UDP source and the influx HTTP endpoint might allow you take advantage of influx's HTTP performance and allow you to tune the memory associated with aggregating UDP packets much better. Of course, you probably need to write some code to achieve this.

That said, I have zero experience with UDP and influx so I can't really comment on how influx's UDP server performs - take everything I just said with a large grain of salt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants