Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simple threading to UDP packet handling, dump message when UDP buffers likely overflowed. #196

Closed

Conversation

SpencerMalone
Copy link
Contributor

Followup to #187

What this does

  1. If it's likely that the UDP buffers overflowed on linux, dump a message
  2. Does a basic threading implementation on the listening and first step of packet handling for the UDP listener. In reality, this will eventually get blocked at the next stage of data processing, which needs threading as well, but it's a good first step.

How I tested this

There were two parts I wanted to test, first off did the reporting work, and secondly did the threading help. Here's my testing format:

Proving that UDP overflow prediction works:

Terminal 1:

docker build -t statsd-exporter
docker run --rm -p 9125:9125/udp -p 9102:9102 statsd-exporter --udp-listener.handlers=1 --udp-listener.threads=1

Terminal 2:

for i in {1..400}; do echo "deploys.test.myservice:1|c" > /dev/udp/0.0.0.0/9125; done

Then in terminal 1, I see the following:

time="2019-04-02T17:34:48Z" level=info msg="UDP buffer size is high enough that packet drops may have occurred. You may need to allocate more resources for the exporter, or increase the statsd.read-buffer parameter and the kernel parameter net.core.rmem_max" source="exporter.go:703"
time="2019-04-02T17:34:48Z" level=info msg="UDP buffer size is high enough that packet drops may have occurred. You may need to allocate more resources for the exporter, or increase the statsd.read-buffer parameter and the kernel parameter net.core.rmem_max" source="exporter.go:703"
time="2019-04-02T17:34:48Z" level=info msg="UDP buffer size is high enough that packet drops may have occurred. You may need to allocate more resources for the exporter, or increase the statsd.read-buffer parameter and the kernel parameter net.core.rmem_max" source="exporter.go:703"
time="2019-04-02T17:34:48Z" level=info msg="UDP buffer size is high enough that packet drops may have occurred. You may need to allocate more resources for the exporter, or increase the statsd.read-buffer parameter and the kernel parameter net.core.rmem_max" source="exporter.go:703"
time="2019-04-02T17:34:48Z" level=info msg="UDP buffer size is high enough that packet drops may have occurred. You may need to allocate more resources for the exporter, or increase the statsd.read-buffer parameter and the kernel parameter net.core.rmem_max" source="exporter.go:703"

At which point I pushed it a little more (verified it did not continue spamming the above message), and decided to verify in procfs:

  sl  local_address                         remote_address                        st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode ref pointer drops
 8761: 00000000000000000000000000000000:23A5 00000000000000000000000000000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 623849 2 0000000000000000 36596

Drops proven!

Proving that the threading works

This was simple, I redid the last test, but with --udp-listener.handlers=10000 --udp-listener.threads=1000. It did not complain about potential drops in my lower volume test. When I pushed it real hard, I still got it to whine, but at least we know it won't spew false positives.

Signed-off-by: SpencerMalone <malone.spencer@gmail.com>
Co-authored-by: AustinSchoen <austin.schoen@gmail.com>
Signed-off-by: SpencerMalone <malone.spencer@gmail.com>
@SpencerMalone SpencerMalone force-pushed the overflow-improvements branch from 5b125b5 to 7cd5bab Compare April 2, 2019 20:11
@SpencerMalone
Copy link
Contributor Author

SpencerMalone commented Apr 2, 2019

One thing I'd like to add to this pre-merge: some exposed metrics about the potential # of times it's overflowed

I also don't love the package name, trying to think of an alternative, up for suggestions.

I'd love to make this a bit cleaner, but am struggling with the how around that.

@@ -10,6 +10,7 @@ require (
github.com/howeyc/fsnotify v0.0.0-20151003194602-f0c08ee9c607
github.com/kr/pretty v0.1.0 // indirect
github.com/mattn/go-isatty v0.0.4 // indirect
github.com/mdlayher/netlink v0.0.0-20190313131330-258ea9dff42c
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with this package after it's use in https://github.com/prometheus/node_exporter

Copy link
Contributor

@matthiasr matthiasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good already, thank you for putting it together!

don't love the package name, trying to think of an alternative, up for suggestions

bufwatch? bufferwatcher?

I agree that we need a metric for when it happens.

Would it make sense to restrict the whole buffer watching business to platforms where it will work? Or is this portable?

return buf.Bytes(), nil
}

func convert_addr_to_int32(ip []byte) (ret [4]uint32) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use camelCase instead of snake_case

case sem <- struct{}{}:
{
go func() {
l.handlePacket(data[0:n], e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to call handlePacket concurrently?

why a semaphore and goroutines per handlePacket call, vs long-lived goroutines that consume off a (buffered) channel? IIUC, with the buffered channel it would be possible to try to write an event into this channel, but if the buffer is full, increment a drop counter and move on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is safe! At least, I ran this with a -race build with no complaints, and have ~22 deployments of this happily chugging along at a pretty high volume and no related crashes.

Re: Semaphore - I think that is a great suggestion, this was originally @AustinSchoen's work, lemme collab with him. The idea of reading UDP packets as fast as we can and dropping them if the processing can't keep up instead of blocking (but keeping track of that drop tally) is so elegant, very +++ to headed down that path. That should totally alleviate my need for procfs metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all credit for that idea goes to @rtreffer!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I went with a semaphore because it was the first example of a tunable queue of workers that I could find, there are multiple ways of handling this model. We were still early in the experimentation phase as to what amount and kind of concurrency we needed to handle the firehose of UDP packets coming at our exporters. Regardless I don't think we'd want to just drop data on the floor if we don't have the capacity to handle it in every situation. Those packets should back up into the protocol buffer while waiting to be read by the configured number of handlers.

In situations where a system is overloaded only for a few (mili)seconds it would be better to block read off the protocol buffer and wait to handle them than to drop and log that we lost packets IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'd want to just drop data on the floor if we don't have the capacity to handle it in every situation

I agree, and I don't propose that we do that. We can buffer again after the initial receive.

However, at some point we have to drop, and we want to know when that happens. I would rather guarantee that that happens in a place where we can easily observe it (i.e. in user-lang Go code) than introduce complex platform-dependent monitoring of the socket buffers themselves.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With "buffer again" I mean stuff them into a buffered channel, we don't have to get fancy about it.


if err != nil {
shouldWatchBuffer = false
log.Debugf("Unable to watch UDP buffer size due to: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this always happen on non-Linux platforms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this always log? Yes, it would. I figured a debug log would be a good place to quietly drop that to minimize confusion over why they can't get the buffer tracking to work, but if you think it's unnecessary, I would be happy to more quietly short circuit that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's fine as it is, just trying to understand! Maybe add a comment to say how we expect the initialization to behave on non-Linux systems?

@matthiasr
Copy link
Contributor

matthiasr commented Apr 11, 2019 via email

@SpencerMalone
Copy link
Contributor Author

I like the implementation in Add internal event queuing and flushing a lot more, I'm gonna see if that + the metric cache is enough that we don't need to dive into any of this threading nonsense on Monday, and will close this out + work on a followup to the internal event queue to add introspection there if so.

@matthiasr
Copy link
Contributor

great, I'll close this PR then since it already has conflicts. thank you for the impulse and discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants