-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor scripts to facilitate dynamic monitoring #7
base: main
Are you sure you want to change the base?
Conversation
This commit is a pretty big overhaul to the current testnet scripts that will allow us to implement use cases that involve nodes being dynamically added to and removed from the network. It does this by swapping out Prometheus for a combination of InfluxDB and Telegraf. The "monitor" server now runs an InfluxDB instance that listens for incoming data from Telegraf agents installed on all of the nodes. The Telegraf agents are configured to poll the Tendermint test app's Prometheus endpoint (collecting all Prometheus metrics) as well as system metrics (CPU, memory, disk usage, etc.). Telegraf regularly pushes these metrics to the monitor's InfluxDB server. The InfluxDB server also provides a convenient web-based UI to explore stored data, with graphical visualization tools similar to what Prometheus provides. This commit also: - simplifies the testnet deployment process, - refactors the Ansible playbooks into roles, making them more reusable across playbooks, - uses Terraform to reliably generate the Ansible hosts file (and delete it automatically once the infrastructure's been destroyed), - refactors the Terraform scripts according to a more standardized layout, - updates the usage instructions in the README. Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm excited for a lot of this to land. Many of these fixes are just clear improvements over what's there at the moment. I have basically no objection to any ansible and terraform based changes as long as the crucial functionality is preserved. That functionality includes:
- Starting many nodes at a particular version
- Updating the version of tendermint running to see if a change provides a fix
- Restarting a node/set of nodes to see if a problem that is observed can be undone if goroutines are restarted and connections are re-established.
- Monitor all metrics including host-level metrics such as memory usage, CPU usage, in-use file descriptors etc.
Those are the main needs at the moment, although more may come up as we make more use of these tools.
To the point of the metrics integration: I have no specific problem with influxDB. I haven't worked with it personally so it would be a bit of a learning curve for me to use. The main reason I would like to suggest we look a bit more for a prometheus-based solution is that that's what most users are likely to use. As we run and re-run the code in these test networks, using the same tool that consumers of the code will use provides a few benefits.
First, it means that we're familiar with the tools someone asking us for debug help will be using. No translation is necessary between what the user hands us and what we already understand. When someone links us to a prometheus instance, the queries and such that we use to debug our code will be top of mind.
Second, it means that gaps in debugging and diagnosing issues from prometheus-based queries on our metrics will be quickly caught by us when using the tool.
While the setup presented here definitely works to solve the problem, I think we may want to continue to consider a prometheus-based approach. Prometheus has a digital ocean service discovery configuration field that appears to allow integration directly into the DO API. Using the __meta_digitalocean_tags
field, we should be able to keep only the instances marked with our testnet tags as scrape targets for prometheus.
|
||
[nodes] | ||
%{ for node in nodes ~} | ||
${node.name} ansible_host=${node.ipv4_address} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is ansible_host
a special variable of some kind that it knows to use as the IP address for connecting to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The node name is used here as a kind of logical identifier for the host, and the IP address is given explicitly to Ansible through ansible_host
. This allows for easier debugging of the generated hosts
file as the node names match exactly those specified in testnet.toml
, allowing easier correlation between logical testnet nodes and DO VMs.
# Extract the IP addresses of all of the nodes (excluding the monitoring | ||
# server) from the Ansible hosts file. IP addresses will be in the same order | ||
# as those generated in the docker-compose.yml file, and will be separated by | ||
# newlines. | ||
NEW_IPS=`cat ${ANSIBLE_HOSTS} | grep -v 'monitor' | grep 'ansible_host' | awk -F' ansible_host=' '{print $2}' | head -c -1 | tr '\n' ','` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What ensures the order of these matches the order of the OLD_IPS
? The previous iteration of this relied on sorting the hosts by node name to match the OLD_IPS
order but I'm not sure we're doing that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I was under the impression that the Docker compose file's ordering was deterministic - is that not the case? The hosts
file output is deterministic as far as I can tell.
@@ -0,0 +1,25 @@ | |||
--- | |||
# This playbook must be executed as root. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming you mean root on the machine you're deploying to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll clarify that 🙂
set -euo pipefail | ||
|
||
if [ -f "{{ tendermint.pid_file }}" ]; then | ||
kill `cat {{ tendermint.pid_file }}` || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the || true
here ensure that the kill
line does not return a non-zero code if the process doesn't exist? Otherwise this script will exit if the process is already gone and we'll never clean up the PID file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's sometimes possible that the process was already killed and this just makes that call idempotent. Otherwise it terminates the script immediately as per the set -euo pipefail
line and never cleans up the PID file.
|
||
ANSIBLE_HOSTS=$1 | ||
LOAD_RUNNER_CMD=${LOAD_RUNNER_CMD:-"go run github.com/tendermint/tendermint/test/e2e/runner@51685158fe36869ab600527b852437ca0939d0cc"} | ||
IP_LIST=`cat ${ANSIBLE_HOSTS} | grep -v 'monitor' | grep 'ansible_host' | awk -F' ansible_host=' '{print $2}' | head -c -1 | tr '\n' ','` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discovered recently that there's an ansible command that outputs the IP addresses of the items in the inventory.
ansible all --list-hosts -i ./ansible/hosts
I'm not certain if it works with the ansible_host
key used in the inventory file, but it's a bit more convenient, I've found, than just cat'ing the hosts file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know! I may refactor this to use that then 👍
@@ -1,13 +0,0 @@ | |||
[Unit] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The node-exporter is a process that runs on each host and queries the /proc
filesystem for things like: open TCP connections, in-use file descriptors, memory, network, and CPU usage. It exposes these as prometheus metrics on a /metrics
endpoint for scraping. I'm not seeing this duplicated in the new code, but this is very important for checking host-level metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at the input plugins section of the Telegraf config. Telegraf will automatically monitor all of those things and push those metrics, along with the polled Prometheus metrics from the Tendermint node, to the InfluxDB instance.
Which users do you mean here, and do we have evidence for this? Operators such as Cephalopod, for example, use Nagios for operational metric monitoring and are currently building an ELK stack for log monitoring.
Two things here:
Personally I don't have a strong preference either way as I've used neither of these systems before (InfluxDB v2 is basically a totally different product to InfluxDB v1, which I used several years ago). Learning how to construct queries in InfluxDB was about 5 mins' work. Even Prometheus' own comparison between InfluxDB and Prometheus shows very little meaningful difference between the two products. My main concern here is that the approach you're suggesting is not yet proven to work and I could end up wasting more time on implementing it. I already brought up the option of using InfluxDB nearly 2 weeks ago and, given there was no feedback on the idea, I assumed it would be fine to spend several days' worth of effort implementing it as such. |
A quick follow-up here, I've spent some time investigating options toward getting InfluxDB to successfully ingest the Tendermint logs so it provides more than just metrics monitoring, and it appears as though Telegraf's just not capable of grokking them without substantial effort. Therefore I'll refactor this to rather make use of Prometheus for now. A good follow-up to this would be to, in the longer term, consider eventually spinning up an ELK stack to handle both metric and log monitoring. |
Closes #2
See the updated README for a high-level overview of this change. I'd recommend testing this change from this branch before merging it.
This PR is a pretty big overhaul to the current testnet scripts that will allow us to implement use cases that involve nodes being dynamically added to and removed from the network. It does this by swapping out Prometheus for a combination of InfluxDB and Telegraf.
The "monitor" server now runs an InfluxDB instance that listens for incoming data from Telegraf agents installed on all of the nodes. The Telegraf agents are configured to poll the Tendermint test app's Prometheus endpoint (collecting all Prometheus metrics) as well as system metrics (CPU, memory, disk usage, etc.). Telegraf regularly pushes these metrics to the monitor's InfluxDB server.
The InfluxDB server also provides a convenient web-based UI to explore stored data, with graphical visualization tools similar to what Prometheus provides.
This PR also:
It's supposed to facilitate log collection too (by tailing the Tendermint logs, which are being output in JSON format), but that seems to be a little buggy right now. I'll look into fixing that ASAP.
Additionally, we should eventually be able to reuse most of the infrastructural config here in the follow-ups I'm planning to tendermint/tendermint#8754