Add simple threading to UDP packet handling, dump message when UDP buffers likely overflowed. #196

SpencerMalone · 2019-04-02T18:06:54Z

Followup to #187

What this does

If it's likely that the UDP buffers overflowed on linux, dump a message
Does a basic threading implementation on the listening and first step of packet handling for the UDP listener. In reality, this will eventually get blocked at the next stage of data processing, which needs threading as well, but it's a good first step.

How I tested this

There were two parts I wanted to test, first off did the reporting work, and secondly did the threading help. Here's my testing format:

Proving that UDP overflow prediction works:

Terminal 1:

docker build -t statsd-exporter
docker run --rm -p 9125:9125/udp -p 9102:9102 statsd-exporter --udp-listener.handlers=1 --udp-listener.threads=1

Terminal 2:

for i in {1..400}; do echo "deploys.test.myservice:1|c" > /dev/udp/0.0.0.0/9125; done

Then in terminal 1, I see the following:

time="2019-04-02T17:34:48Z" level=info msg="UDP buffer size is high enough that packet drops may have occurred. You may need to allocate more resources for the exporter, or increase the statsd.read-buffer parameter and the kernel parameter net.core.rmem_max" source="exporter.go:703"
time="2019-04-02T17:34:48Z" level=info msg="UDP buffer size is high enough that packet drops may have occurred. You may need to allocate more resources for the exporter, or increase the statsd.read-buffer parameter and the kernel parameter net.core.rmem_max" source="exporter.go:703"
time="2019-04-02T17:34:48Z" level=info msg="UDP buffer size is high enough that packet drops may have occurred. You may need to allocate more resources for the exporter, or increase the statsd.read-buffer parameter and the kernel parameter net.core.rmem_max" source="exporter.go:703"
time="2019-04-02T17:34:48Z" level=info msg="UDP buffer size is high enough that packet drops may have occurred. You may need to allocate more resources for the exporter, or increase the statsd.read-buffer parameter and the kernel parameter net.core.rmem_max" source="exporter.go:703"
time="2019-04-02T17:34:48Z" level=info msg="UDP buffer size is high enough that packet drops may have occurred. You may need to allocate more resources for the exporter, or increase the statsd.read-buffer parameter and the kernel parameter net.core.rmem_max" source="exporter.go:703"

At which point I pushed it a little more (verified it did not continue spamming the above message), and decided to verify in procfs:

  sl  local_address                         remote_address                        st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode ref pointer drops
 8761: 00000000000000000000000000000000:23A5 00000000000000000000000000000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 623849 2 0000000000000000 36596

Drops proven!

Proving that the threading works

This was simple, I redid the last test, but with --udp-listener.handlers=10000 --udp-listener.threads=1000. It did not complain about potential drops in my lower volume test. When I pushed it real hard, I still got it to whine, but at least we know it won't spew false positives.

Signed-off-by: SpencerMalone <malone.spencer@gmail.com>

Co-authored-by: AustinSchoen <austin.schoen@gmail.com> Signed-off-by: SpencerMalone <malone.spencer@gmail.com>

SpencerMalone · 2019-04-02T20:34:21Z

One thing I'd like to add to this pre-merge: some exposed metrics about the potential # of times it's overflowed

I also don't love the package name, trying to think of an alternative, up for suggestions.

I'd love to make this a bit cleaner, but am struggling with the how around that.

SpencerMalone · 2019-04-03T16:11:12Z

go.mod

@@ -10,6 +10,7 @@ require (
 	github.com/howeyc/fsnotify v0.0.0-20151003194602-f0c08ee9c607
 	github.com/kr/pretty v0.1.0 // indirect
 	github.com/mattn/go-isatty v0.0.4 // indirect
+	github.com/mdlayher/netlink v0.0.0-20190313131330-258ea9dff42c


I went with this package after it's use in https://github.com/prometheus/node_exporter

matthiasr

This looks pretty good already, thank you for putting it together!

don't love the package name, trying to think of an alternative, up for suggestions

bufwatch? bufferwatcher?

I agree that we need a metric for when it happens.

Would it make sense to restrict the whole buffer watching business to platforms where it will work? Or is this portable?

matthiasr · 2019-04-09T12:25:42Z

pkg/telemetry/buffer_linux.go

+	return buf.Bytes(), nil
+}
+
+func convert_addr_to_int32(ip []byte) (ret [4]uint32) {


please use camelCase instead of snake_case

matthiasr · 2019-04-09T12:28:02Z

exporter.go

+		case sem <- struct{}{}:
+			{
+				go func() {
+					l.handlePacket(data[0:n], e)


is it safe to call handlePacket concurrently?

why a semaphore and goroutines per handlePacket call, vs long-lived goroutines that consume off a (buffered) channel? IIUC, with the buffered channel it would be possible to try to write an event into this channel, but if the buffer is full, increment a drop counter and move on?

It is safe! At least, I ran this with a -race build with no complaints, and have ~22 deployments of this happily chugging along at a pretty high volume and no related crashes.

Re: Semaphore - I think that is a great suggestion, this was originally @AustinSchoen's work, lemme collab with him. The idea of reading UDP packets as fast as we can and dropping them if the processing can't keep up instead of blocking (but keeping track of that drop tally) is so elegant, very +++ to headed down that path. That should totally alleviate my need for procfs metrics.

all credit for that idea goes to @rtreffer!

I believe I went with a semaphore because it was the first example of a tunable queue of workers that I could find, there are multiple ways of handling this model. We were still early in the experimentation phase as to what amount and kind of concurrency we needed to handle the firehose of UDP packets coming at our exporters. Regardless I don't think we'd want to just drop data on the floor if we don't have the capacity to handle it in every situation. Those packets should back up into the protocol buffer while waiting to be read by the configured number of handlers.

In situations where a system is overloaded only for a few (mili)seconds it would be better to block read off the protocol buffer and wait to handle them than to drop and log that we lost packets IMO.

I don't think we'd want to just drop data on the floor if we don't have the capacity to handle it in every situation

I agree, and I don't propose that we do that. We can buffer again after the initial receive.

However, at some point we have to drop, and we want to know when that happens. I would rather guarantee that that happens in a place where we can easily observe it (i.e. in user-lang Go code) than introduce complex platform-dependent monitoring of the socket buffers themselves.

With "buffer again" I mean stuff them into a buffered channel, we don't have to get fancy about it.

matthiasr · 2019-04-09T12:30:08Z

exporter.go

+
+	if err != nil {
+		shouldWatchBuffer = false
+		log.Debugf("Unable to watch UDP buffer size due to: %v", err)


would this always happen on non-Linux platforms?

Would this always log? Yes, it would. I figured a debug log would be a good place to quietly drop that to minimize confusion over why they can't get the buffer tracking to work, but if you think it's unnecessary, I would be happy to more quietly short circuit that.

it's fine as it is, just trying to understand! Maybe add a comment to say how we expect the initialization to behave on non-Linux systems?

matthiasr · 2019-04-11T21:34:53Z

Agree! The downside of the OS buffer is that we have a hard time knowing when it overflowed. Go channels can buffer too, so by consuming the packets quickly and then adding another buffer before the expensive processing, we can have our cake and measure it too

…

On Thu, Apr 11, 2019, 21:06 Austin Schoen ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In exporter.go <#196 (comment)> : > buf := make([]byte, 65535) for { n, _, err := l.conn.ReadFromUDP(buf) if err != nil { log.Fatal(err) } - l.handlePacket(buf[0:n], e) + if shouldWatchBuffer && atomic.LoadUint32(&l.bufferWarning) != 1 { + go l.watchBufferSize() + } + + data := append([]byte(nil), buf[0:n]...) + select { + case sem <- struct{}{}: + { + go func() { + l.handlePacket(data[0:n], e) I believe I went with a semaphore because it was the first example of a tunable queue of workers that I could find, there are multiple ways of handling this model. We were still early in the experimentation phase as to what amount and kind of concurrency we needed to handle the firehose of UDP packets coming at our exporters. Regardless I don't think we'd want to just drop data on the floor if we don't have the capacity to handle it in every situation. Those packets should back up into the protocol buffer while waiting to be read by the configured number of handlers. In situations where a system is overloaded only for a few (mili)seconds it would be better to block read off the protocol buffer and wait to handle them than to drop and log that we lost packets IMO. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#196 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAICBo9VvSKKjVt6Qfeb3Xpc6pVfVk7iks5vf4exgaJpZM4cYzJs> .

SpencerMalone · 2019-06-08T00:31:56Z

I like the implementation in Add internal event queuing and flushing a lot more, I'm gonna see if that + the metric cache is enough that we don't need to dive into any of this threading nonsense on Monday, and will close this out + work on a followup to the internal event queue to add introspection there if so.

matthiasr · 2019-06-12T16:05:43Z

great, I'll close this PR then since it already has conflicts. thank you for the impulse and discussion!

Add tracking of when UDP buffers may have overflowed on Linux

e5aa936

Signed-off-by: SpencerMalone <malone.spencer@gmail.com>

SpencerMalone force-pushed the overflow-improvements branch from fd0adc9 to 5b125b5 Compare April 2, 2019 18:08

SpencerMalone mentioned this pull request Apr 2, 2019

Add linux UDP buffer telemetry using procfs #187

Closed

Add threading at the UDP listener and packet handling level

7cd5bab

Co-authored-by: AustinSchoen <austin.schoen@gmail.com> Signed-off-by: SpencerMalone <malone.spencer@gmail.com>

SpencerMalone force-pushed the overflow-improvements branch from 5b125b5 to 7cd5bab Compare April 2, 2019 20:11

SpencerMalone commented Apr 3, 2019

View reviewed changes

matthiasr reviewed Apr 9, 2019

View reviewed changes

matthiasr mentioned this pull request May 31, 2019

Add internal event queuing and flushing #227

Merged

matthiasr closed this Jun 12, 2019

matthiasr mentioned this pull request Oct 20, 2023

Performance improvement - process udp packets in separate goroutine in order to minimize packet drops #511

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add simple threading to UDP packet handling, dump message when UDP buffers likely overflowed. #196

Add simple threading to UDP packet handling, dump message when UDP buffers likely overflowed. #196

SpencerMalone commented Apr 2, 2019

SpencerMalone commented Apr 2, 2019 •

edited

Loading

SpencerMalone Apr 3, 2019

matthiasr left a comment

matthiasr Apr 9, 2019

matthiasr Apr 9, 2019

SpencerMalone Apr 10, 2019

matthiasr Apr 11, 2019

AustinSchoen Apr 11, 2019

matthiasr May 15, 2019

matthiasr May 15, 2019

matthiasr Apr 9, 2019

SpencerMalone Apr 10, 2019

matthiasr Apr 11, 2019

matthiasr commented Apr 11, 2019 via email

SpencerMalone commented Jun 8, 2019

matthiasr commented Jun 12, 2019

Add simple threading to UDP packet handling, dump message when UDP buffers likely overflowed. #196

Add simple threading to UDP packet handling, dump message when UDP buffers likely overflowed. #196

Conversation

SpencerMalone commented Apr 2, 2019

What this does

How I tested this

Proving that UDP overflow prediction works:

Proving that the threading works

SpencerMalone commented Apr 2, 2019 • edited Loading

Choose a reason for hiding this comment

matthiasr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthiasr commented Apr 11, 2019 via email

SpencerMalone commented Jun 8, 2019

matthiasr commented Jun 12, 2019

SpencerMalone commented Apr 2, 2019 •

edited

Loading