Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Any plans to add TCP server support?. #276

Closed
cjagus opened this issue Nov 11, 2019 · 6 comments
Closed

Question: Any plans to add TCP server support?. #276

cjagus opened this issue Nov 11, 2019 · 6 comments

Comments

@cjagus
Copy link

cjagus commented Nov 11, 2019

No description provided.

@tiedotguy
Copy link
Collaborator

There's no immediate plans. The current timeline is:

  • Improving and unifying the HTTP clients
  • Adding clustering for the aggregation layer
  • Add streaming output (Kafka, Kinesis, etc)

At that point, it will probably make sense to re-assess the input layer, so we can have streams as an input, not just an output, which will hopefully make a TCP input better.

Is there a specific thing you need TCP for? If it's reliability, could you use the HTTP forwarding with on-host collection?

@cjagus
Copy link
Author

cjagus commented Nov 13, 2019

so we have local statsd installed in every machine already and we use repeater backend for sending metrics to aggregator, will try to use local gostatd and foward via HTTP,

Question: so if we use gostatd forwarder > server (aggregation) does it aggregates locally and forward to server or will it send all the metrics without any aggregation just like statsd repeater backend

@tiedotguy
Copy link
Collaborator

If you use a local forwarder on each host, and a central aggregation server, then the forwarder tier does what I call "consolidation" (as opposed to aggregation, mainly because they are slightly different).

Consolidation roughly means if you send statsd something like ..

http.request:1|c|#host:web1,service:frontend
http.timing:12|ms|#service:frontend
http.request:1|c|#host:web1,service:frontend
http.timing:15|ms|#service:frontend
http.request:1|c|#host:web1,service:frontend
http.timing:11|ms|#service:frontend

.. then the forwarder tier would send it to the aggregation server as ..

http.request:3|c|#host:web1,service:frontend
http.timing:12,15,11|ms|#service:frontend

(It's a bit more complicated than that, because a) it doesn't use statsd format, it has a protobuf format, and b) it needs to handle sample rates for both counters and timers)

.. so the counters are combined in to a single value (just like aggregation), and the timers have all the values sent as an array (which is not aggregation, but is much more efficient to process). Gauges are sent as a single value, even if there are multiple, which is also just like aggregation.

The aggregation tier can then do proper timer aggregation, because it sees data from all the hosts. You do need to have the flush interval on the forwarder tier more frequent - I suggest an order of magnitude faster than the flush interval on the aggregator. 1s vs 10s or 5s vs 1m, for example.

Hopefully the answers your question, and helps solve the bigger picture question. Let me know if it doesn't, and I'll get you sorted though.

Note that in forwarder mode, there's no backend on the forwarder layer.

@cjagus
Copy link
Author

cjagus commented Nov 15, 2019

Thanks for the detailed explanation, we started replacing statsd with gostatd on Tier 3 apps [planning to use http forwarder in next quarter] and for old apps will be using local statd repeater backend > cental gostatd. So we had heka before for central statd aggregation and we do the testing with gostatd, comparing with heka, we notice it starts dropping UDP packets and had to increase

--max-parsers
--max-workers
--max-readers```
etc, 
exec /var/gostatsd/gostatsd \
    --config-path /var/gostatsd/gostatsd.toml \
    --metrics-addr 10.14.94.88:9125 \
    --max-queue-size 60000 \
    --receive-batch-size 1000 \
    --max-readers 16 \
    --max-workers 16 \
    --max-parsers 16 \
    --percent-threshold 90 95 99 \
    --flush-interval 60s \
    --verbose

another thing I notice, eventhough --verbose is enabled, don't see any verbose logging in stdout
Do you have any recomendation to run gostatd as a central statsd with UDP input?

image

stats^ @tiedotguy

@tiedotguy
Copy link
Collaborator

If it's dropping packets, that's because something isn't processing fast enough, and back-pressure is propagating all the way through the system. I need to do a proper documentation write-up on where the pressure points are and how to monitor them.

I'll try and do a rough outline here (a bunch of it you may already know, since you're setting them already, but I'll probably turn this comment in to proper tuning documentation at some point), working backwards from the backend through to the network buffers in the kernel. I believe they are:

  • Backend: if gostatsd spends more than the flush interval sending to the backend (for example, if it takes 11 seconds to write all the data to the backend, and there's a 10 second flush interval), then it will skip the interval, and the next interval will collect twice as much data (ie, it will have 20 seconds worth of counters). This doesn't create back pressure, but it can create unexpected data. Metric to watch is statsd.flusher.total_time, it should be under your flush interval.

  • Aggregation: the number of aggregators you have is the --max-workers, they process metrics from a channel (queue). All metrics have an affinity for a specific aggregator. For example, http.request might go to aggregator 1 and http.timing might go to aggregator 2. This is so that each aggregator can see all the raw data for its space. If an aggregator can't keep up, the channel fills up, and back pressure is created to the parser. There is also a short period during flush where the aggregator stops reading the queue, and it processes data to be flushed. During this period, the pressure on the channel can spike. Metrics to watch are statsd.channel.max, with channel:dispatch_aggregator, for each aggregator_id, and statsd.aggregator.process_time, for each aggregator_id. The channel size is the --max-queue-size. Increasing this can appear to help things (specifically it can help absorb the spike during flush), however that's generally the extent of the benefit. If an aggregators max (or avg) is hitting that value, that's usually a sign of a hot aggregator (more on that below)

  • Parser: the parser is purely CPU bound, and responsible for parsing incoming datagrams and sending metrics to various channels. When an aggregators channel is blocked, the entire parser will stall until it unblocks. If all parsers stall, then that creates back pressure on the receivers. The only metric around this is the statsd.channel.max mentioned above. The number of parsers is controlled by `--max-parsers``

  • Receiver: pulls data from the network in batches, and sends it to parsers. This is network bound work. If the parsers are stalled, then the receivers will also stall, which creates pressure in the network stack. There's no internal metrics to indicate when this is occurring. Receivers have three things that can help with throughput: --conn-per-reader, which will cause each receiver to have its own network socket (if the OS supports it), the number of readers (--max-readers), and the number of datagrams it will attempt to read per batch (--receive-batch-size). statsd.receiver.avg_datagrams_in_batch can help determine if you're reading as fast as you can.

  • kernel: this is where packet loss ultimately manifests (or on the network equipment). Monitoring that is something you're obviously doing :)

So having said all of that, there's generally 2 sources of packet loss:

  • the receivers can't keep up, --conn-per-reader will usually help with this, increasing the number of readers, or the receive batch size
  • a hot aggregator is putting pressure through the entire system, resulting in idle receivers. A hot aggregator typically occurs when there's a small subset of metrics which is overwhelming it. Metrics have their name and host tag hashed, and the hash is used to determine which aggregator it goes to. As a result, if you have something like http.request tagged by host, and http.timing that's not tagged by host, and you have a very high volume of both, you'll find the http.request is distributed to all aggregators evenly (if you have enough hosts), but http.timing all goes to a single aggregator, which overwhelms it. It's also possible that the hash function happens to interact badly with the metrics generally. In this case (which is hard to identify) changing the number of aggregators can result in a different distribution.

When it's a hot aggregator, and you increase readers or parsers (or the aggregator channel size), what you're really doing is putting a little bit more buffer space in the system. Each reader and parser might be holding on to 10 datagrams, so by increasing the number of them, you're not actually improving throughput, you're just adding a bit of space to absorb the spike I mentioned previously. If you're that close to the edge that it helps, then it probably won't help for much longer.

As a general rule, gostatsd will scale linearly on the number of cores it has, until it hits a hot aggregator, then CPU will flatline at that point. This can best be measured by watching the CPU (CPU spare with packet loss means a hot aggregator, no CPU spare with packet loss means you need more cores).

This equation changes somewhat with HTTP forwarding. The collection will scale linearly, because the aggregators are purely for the backend. Because of the consolidation done on the collection hosts, the aggregator host has less work. It still has the same bottlenecks, but there's less actual work done. There's also a more efficient data structure used. For future work, #210 should remove most (all?) of the required aggregator affinity, allowing the system to scale linearly until the CPU or network saturates.

As for --verbose, that enables log.Debug which is only used in two places I think - the AWS cloud provider, and HTTP request logging for the forwarder. As a general rule, I've tried to build the system so that it can be monitored via its own internal metrics, rather than logs. When a host is processing a million metrics per second, I don't want to even pay the cost of asking "should I log this?" :)

Hope that's useful - as I mentioned, I'll probably reformat this comment in to proper documentation at some point, so I apologize for redundancies.

@tiedotguy
Copy link
Collaborator

Closing this off for now, but feel free to re-open or open a new issue if you have further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants