Refactor scripts to facilitate dynamic monitoring #7

thanethomson · 2022-06-17T00:07:18Z

Closes #2

See the updated README for a high-level overview of this change. I'd recommend testing this change from this branch before merging it.

This PR is a pretty big overhaul to the current testnet scripts that will allow us to implement use cases that involve nodes being dynamically added to and removed from the network. It does this by swapping out Prometheus for a combination of InfluxDB and Telegraf.

The "monitor" server now runs an InfluxDB instance that listens for incoming data from Telegraf agents installed on all of the nodes. The Telegraf agents are configured to poll the Tendermint test app's Prometheus endpoint (collecting all Prometheus metrics) as well as system metrics (CPU, memory, disk usage, etc.). Telegraf regularly pushes these metrics to the monitor's InfluxDB server.

The InfluxDB server also provides a convenient web-based UI to explore stored data, with graphical visualization tools similar to what Prometheus provides.

This PR also:

simplifies the testnet deployment process,
refactors the Ansible playbooks into roles, making them more reusable across playbooks,
uses Terraform to reliably generate the Ansible hosts file (and delete it automatically once the infrastructure's been destroyed),
refactors the Terraform scripts according to a more standardized layout,
updates the usage instructions in the README.

It's supposed to facilitate log collection too (by tailing the Tendermint logs, which are being output in JSON format), but that seems to be a little buggy right now. I'll look into fixing that ASAP.

Additionally, we should eventually be able to reuse most of the infrastructural config here in the follow-ups I'm planning to tendermint/tendermint#8754

This commit is a pretty big overhaul to the current testnet scripts that will allow us to implement use cases that involve nodes being dynamically added to and removed from the network. It does this by swapping out Prometheus for a combination of InfluxDB and Telegraf. The "monitor" server now runs an InfluxDB instance that listens for incoming data from Telegraf agents installed on all of the nodes. The Telegraf agents are configured to poll the Tendermint test app's Prometheus endpoint (collecting all Prometheus metrics) as well as system metrics (CPU, memory, disk usage, etc.). Telegraf regularly pushes these metrics to the monitor's InfluxDB server. The InfluxDB server also provides a convenient web-based UI to explore stored data, with graphical visualization tools similar to what Prometheus provides. This commit also: - simplifies the testnet deployment process, - refactors the Ansible playbooks into roles, making them more reusable across playbooks, - uses Terraform to reliably generate the Ansible hosts file (and delete it automatically once the infrastructure's been destroyed), - refactors the Terraform scripts according to a more standardized layout, - updates the usage instructions in the README. Signed-off-by: Thane Thomson <connect@thanethomson.com>

Signed-off-by: Thane Thomson <connect@thanethomson.com>

williambanfield

I'm excited for a lot of this to land. Many of these fixes are just clear improvements over what's there at the moment. I have basically no objection to any ansible and terraform based changes as long as the crucial functionality is preserved. That functionality includes:

Starting many nodes at a particular version
Updating the version of tendermint running to see if a change provides a fix
Restarting a node/set of nodes to see if a problem that is observed can be undone if goroutines are restarted and connections are re-established.
Monitor all metrics including host-level metrics such as memory usage, CPU usage, in-use file descriptors etc.

Those are the main needs at the moment, although more may come up as we make more use of these tools.

To the point of the metrics integration: I have no specific problem with influxDB. I haven't worked with it personally so it would be a bit of a learning curve for me to use. The main reason I would like to suggest we look a bit more for a prometheus-based solution is that that's what most users are likely to use. As we run and re-run the code in these test networks, using the same tool that consumers of the code will use provides a few benefits.

First, it means that we're familiar with the tools someone asking us for debug help will be using. No translation is necessary between what the user hands us and what we already understand. When someone links us to a prometheus instance, the queries and such that we use to debug our code will be top of mind.

Second, it means that gaps in debugging and diagnosing issues from prometheus-based queries on our metrics will be quickly caught by us when using the tool.

While the setup presented here definitely works to solve the problem, I think we may want to continue to consider a prometheus-based approach. Prometheus has a digital ocean service discovery configuration field that appears to allow integration directly into the DO API. Using the __meta_digitalocean_tags field, we should be able to keep only the instances marked with our testnet tags as scrape targets for prometheus.

williambanfield · 2022-06-20T19:42:51Z

tf/hosts.tftpl

+
+[nodes]
+%{ for node in nodes ~}
+${node.name} ansible_host=${node.ipv4_address}


is ansible_host a special variable of some kind that it knows to use as the IP address for connecting to?

Yes. The node name is used here as a kind of logical identifier for the host, and the IP address is given explicitly to Ansible through ansible_host. This allows for easier debugging of the generated hosts file as the node names match exactly those specified in testnet.toml, allowing easier correlation between logical testnet nodes and DO VMs.

williambanfield · 2022-06-20T19:44:18Z

script/configgen.sh

+# Extract the IP addresses of all of the nodes (excluding the monitoring
+# server) from the Ansible hosts file. IP addresses will be in the same order
+# as those generated in the docker-compose.yml file, and will be separated by
+# newlines.
+NEW_IPS=`cat ${ANSIBLE_HOSTS} | grep -v 'monitor' | grep 'ansible_host' | awk -F' ansible_host=' '{print $2}' | head -c -1 | tr '\n' ','`


What ensures the order of these matches the order of the OLD_IPS? The previous iteration of this relied on sorting the hosts by node name to match the OLD_IPS order but I'm not sure we're doing that here.

Oh, I was under the impression that the Docker compose file's ordering was deterministic - is that not the case? The hosts file output is deterministic as far as I can tell.

williambanfield · 2022-06-20T20:02:05Z

ansible/deploy.yaml

@@ -0,0 +1,25 @@
+---
+# This playbook must be executed as root.


I'm assuming you mean root on the machine you're deploying to?

Yes, I'll clarify that 🙂

williambanfield · 2022-06-20T20:52:03Z

ansible/roles/testapp/templates/stop-testapp.sh.j2

+set -euo pipefail
+
+if [ -f "{{ tendermint.pid_file }}" ]; then
+  kill `cat {{ tendermint.pid_file }}` || true


Does the || true here ensure that the kill line does not return a non-zero code if the process doesn't exist? Otherwise this script will exit if the process is already gone and we'll never clean up the PID file.

Yeah it's sometimes possible that the process was already killed and this just makes that call idempotent. Otherwise it terminates the script immediately as per the set -euo pipefail line and never cleans up the PID file.

williambanfield · 2022-06-20T20:55:51Z

script/runload.sh

+
+ANSIBLE_HOSTS=$1
+LOAD_RUNNER_CMD=${LOAD_RUNNER_CMD:-"go run github.com/tendermint/tendermint/test/e2e/runner@51685158fe36869ab600527b852437ca0939d0cc"}
+IP_LIST=`cat ${ANSIBLE_HOSTS} | grep -v 'monitor' | grep 'ansible_host' | awk -F' ansible_host=' '{print $2}' | head -c -1 | tr '\n' ','`


I discovered recently that there's an ansible command that outputs the IP addresses of the items in the inventory.

ansible all --list-hosts -i ./ansible/hosts

I'm not certain if it works with the ansible_host key used in the inventory file, but it's a bit more convenient, I've found, than just cat'ing the hosts file.

Good to know! I may refactor this to use that then 👍

williambanfield · 2022-06-20T21:13:17Z

ansible/templates/prometheus-node-exporter.service

@@ -1,13 +0,0 @@
-[Unit]


The node-exporter is a process that runs on each host and queries the /proc filesystem for things like: open TCP connections, in-use file descriptors, memory, network, and CPU usage. It exposes these as prometheus metrics on a /metrics endpoint for scraping. I'm not seeing this duplicated in the new code, but this is very important for checking host-level metrics.

Take a look at the input plugins section of the Telegraf config. Telegraf will automatically monitor all of those things and push those metrics, along with the polled Prometheus metrics from the Tendermint node, to the InfluxDB instance.

thanethomson · 2022-06-21T11:22:16Z

The main reason I would like to suggest we look a bit more for a prometheus-based solution is that that's what most users are likely to use.

Which users do you mean here, and do we have evidence for this? Operators such as Cephalopod, for example, use Nagios for operational metric monitoring and are currently building an ELK stack for log monitoring.

First, it means that we're familiar with the tools someone asking us for debug help will be using. No translation is necessary between what the user hands us and what we already understand. When someone links us to a prometheus instance, the queries and such that we use to debug our code will be top of mind.

Two things here:

Is the vision for this repo is to be able to provide users with a comprehensive set of debugging tools for Tendermint testnets? I was under the impression that this is an internal tool for us to stress test Tendermint?
Just like Grafana, InfluxDB allows us to configure dashboards and export them as a JSON file that we can build into the testnet deployment scripts such that we don't need to interpret users' queries (still not sure who those users are?).

Personally I don't have a strong preference either way as I've used neither of these systems before (InfluxDB v2 is basically a totally different product to InfluxDB v1, which I used several years ago). Learning how to construct queries in InfluxDB was about 5 mins' work. Even Prometheus' own comparison between InfluxDB and Prometheus shows very little meaningful difference between the two products.

My main concern here is that the approach you're suggesting is not yet proven to work and I could end up wasting more time on implementing it. I already brought up the option of using InfluxDB nearly 2 weeks ago and, given there was no feedback on the idea, I assumed it would be fine to spend several days' worth of effort implementing it as such.

thanethomson · 2022-06-23T16:04:07Z

A quick follow-up here, I've spent some time investigating options toward getting InfluxDB to successfully ingest the Tendermint logs so it provides more than just metrics monitoring, and it appears as though Telegraf's just not capable of grokking them without substantial effort.

Therefore I'll refactor this to rather make use of Prometheus for now. A good follow-up to this would be to, in the longer term, consider eventually spinning up an ELK stack to handle both metric and log monitoring.

thanethomson requested a review from williambanfield June 17, 2022 00:07

thanethomson added 4 commits June 16, 2022 20:08

Ignore the correct secret file

30882c8

Signed-off-by: Thane Thomson <connect@thanethomson.com>

Remove unnecessary config folder creation

9fda607

Signed-off-by: Thane Thomson <connect@thanethomson.com>

Remove unused Ansible Makefile

480710c

Signed-off-by: Thane Thomson <connect@thanethomson.com>

Restore testnet.toml file to its original state

dc84bf4

Signed-off-by: Thane Thomson <connect@thanethomson.com>

williambanfield reviewed Jun 20, 2022

View reviewed changes

thanethomson mentioned this pull request Jun 23, 2022

Investigate ELK stack to replace Prometheus #8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor scripts to facilitate dynamic monitoring #7

Refactor scripts to facilitate dynamic monitoring #7

thanethomson commented Jun 17, 2022 •

edited

Loading

williambanfield left a comment •

edited

Loading

williambanfield Jun 20, 2022

thanethomson Jun 20, 2022

williambanfield Jun 20, 2022

thanethomson Jun 20, 2022

williambanfield Jun 20, 2022

thanethomson Jun 20, 2022

williambanfield Jun 20, 2022

thanethomson Jun 20, 2022

williambanfield Jun 20, 2022

thanethomson Jun 20, 2022

williambanfield Jun 20, 2022

thanethomson Jun 20, 2022

thanethomson commented Jun 21, 2022

thanethomson commented Jun 23, 2022

		@@ -0,0 +1,25 @@
		---
		# This playbook must be executed as root.

Refactor scripts to facilitate dynamic monitoring #7

Are you sure you want to change the base?

Refactor scripts to facilitate dynamic monitoring #7

Conversation

thanethomson commented Jun 17, 2022 • edited Loading

williambanfield left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thanethomson commented Jun 21, 2022

thanethomson commented Jun 23, 2022

thanethomson commented Jun 17, 2022 •

edited

Loading

williambanfield left a comment •

edited

Loading