Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: First iteration of a prometheus exporter for ara #483

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

dmsimard
Copy link
Contributor

As discussed on the issue for this topic: #177

It's not finished and still very much a WIP but I figured it might be worthwhile to iterate under a branch in a PR instead of the gist: https://gist.github.com/dmsimard/68c149eea34dbff325c9e4e9c39980a0

If prometheus_client is installed, there will be an ara prometheus command to expose prometheus metrics gathered and parsed from an ara instance:

usage: ara prometheus [-h] [--client <client>] [--server <url>] [--timeout <seconds>] [--username <username>] [--password <password>] [--ssl-cert <path/to/certificate>] [--ssl-key <path/to/key>] [--ssl-ca <path/to/cacert>] [--insecure]
                      [--playbook-limit PLAYBOOK_LIMIT] [--task-limit TASK_LIMIT] [--host-limit HOST_LIMIT] [--poll-frequency POLL_FREQUENCY] [--prometheus-port PROMETHEUS_PORT]

Exposes a prometheus exporter to provide metrics from an instance of ara

options:
  -h, --help            show this help message and exit
  --client <client>
                        API client to use, defaults to ARA_API_CLIENT or 'offline'
  --server <url>
                        API server endpoint if using http client, defaults to ARA_API_SERVER or 'http://127.0.0.1:8000'
  --timeout <seconds>
                        Timeout for requests to API server, defaults to ARA_API_TIMEOUT or 30
  --username <username>
                        API server username for authentication, defaults to ARA_API_USERNAME or None
  --password <password>
                        API server password for authentication, defaults to ARA_API_PASSWORD or None
  --ssl-cert <path/to/certificate>
                        If a client certificate is required, the path to the certificate to use, defaults to ARA_API_CERT or None
  --ssl-key <path/to/key>
                        If a client certificate is required, the path to the private key to use, defaults to ARA_API_KEY or None
  --ssl-ca <path/to/cacert>
                        Path to a certificate authority for trusting the API server certificate, defaults to ARA_API_CA or None
  --insecure            Ignore SSL certificate validation, defaults to ARA_API_INSECURE or False
  --playbook-limit PLAYBOOK_LIMIT
                        Max number of playbooks to request at once (default: 1000)
  --task-limit TASK_LIMIT
                        Max number of tasks to request at once (default: 2500)
  --host-limit HOST_LIMIT
                        Max number of hosts to request at once (default: 2500)
  --poll-frequency POLL_FREQUENCY
                        Seconds to wait until querying ara for new metrics (default: 60)
  --prometheus-port PROMETHEUS_PORT
                        Port on which the prometheus exporter will listen (default: 8001)

Heavily a work in progress and learning experience over which we will
iterate a number of times.

The intent is to make a prometheus exporter gather metrics from an ara
instance and expose them so that prometheus can scrape them.
- Added support for querying results through pagination
- Added support for paginating through pages of results
- Query everything at boot via result limit (i.e, ?limit=1000) and pagination
- Store the latest object timestamp such that next scrape will only pick up
   objects created after that using ?created_after=<timestamp>
- Move it under our existing ara CLI so it can re-use all the
  boilerplate about instanciating an API client with all the settings
- Add args for limits, poll frequency and port for the exporter to
  listen on
@softwarefactory-project-zuul
Copy link

Build failed.
https://ansible.softwarefactory-project.io/zuul/buildset/f9d8f487b49d447d8f37dc2007613d34

✔️ ara-tox-py3 SUCCESS in 4m 09s
ara-tox-linters FAILURE in 3m 32s
✔️ ara-basic-ansible-core-devel SUCCESS in 5m 33s (non-voting)
✔️ ara-basic-ansible-6 SUCCESS in 5m 09s
✔️ ara-basic-ansible-core-2.14 SUCCESS in 5m 35s
✔️ ara-basic-ansible-core-2.13 SUCCESS in 5m 03s
✔️ ara-basic-ansible-core-2.12 SUCCESS in 5m 04s
✔️ ara-basic-ansible-core-2.11 SUCCESS in 5m 20s
✔️ ara-basic-ansible-2.9 SUCCESS in 5m 08s
✔️ ara-container-images SUCCESS in 11m 19s

- Added --max-days to limit backfill at boot
- Added a bit of verbosity
- Adjust hosts to be scanned before tasks (there are way, way more tasks
  than hosts in terms of volume)
- First try at a playbook histogram containing the timestamp and
  duration
@dmsimard
Copy link
Contributor Author

dmsimard commented Feb 24, 2023

I've added a bit more context in the issue (#177 (comment)) and got two quick iterations in:

  • Added --max-days to limit backfill at boot
  • Added a bit of verbosity
  • Adjust hosts to be scanned before tasks (there are way, way more tasks than hosts in terms of volume)
  • First try at a playbook histogram containing the timestamp and duration

Edit: I've put up an example /metrics response from a single playbook's metric as an histogram in the gist: https://gist.github.com/dmsimard/68c149eea34dbff325c9e4e9c39980a0#file-playbooks_as_histogram-txt

It wants to group metrics based on their label uniqueness, I suppose in our case we want each playbook to be represented individually so we should include their id ? More on that later.

@softwarefactory-project-zuul
Copy link

Build failed.
https://ansible.softwarefactory-project.io/zuul/buildset/d069974d12c14515aded43c6df617003

✔️ ara-tox-py3 SUCCESS in 3m 24s
ara-tox-linters FAILURE in 3m 15s
✔️ ara-basic-ansible-core-devel SUCCESS in 5m 50s (non-voting)
✔️ ara-basic-ansible-6 SUCCESS in 5m 09s
✔️ ara-basic-ansible-core-2.14 SUCCESS in 5m 26s
✔️ ara-basic-ansible-core-2.13 SUCCESS in 5m 15s
✔️ ara-basic-ansible-core-2.12 SUCCESS in 5m 16s
✔️ ara-basic-ansible-core-2.11 SUCCESS in 6m 29s
✔️ ara-basic-ansible-2.9 SUCCESS in 5m 28s
✔️ ara-container-images SUCCESS in 11m 56s

Still heavily a work in progress but getting a better undertanding of
how things work.

Host and Tasks have now have gauges by status.
Disable playbook metrics temporarily until we revisit it with newfound
knowledge.
@dmsimard
Copy link
Contributor Author

I think my brain is starting to understand what is happening.

I've temporarily commented out the current iteration of the playbook metrics until I revisit it with newfound knowledge.

This latest iteration re-works the host and tasks metrics to have gauges per status such that we are able to do graphs like this, for example:

Prometheus task results in grafana

Screenshot from 2023-06-18 19-53-59

Prometheus host results in grafana

Screenshot from 2023-06-18 19-54-20

A snippet of what this looks like when querying the prometheus exporter:

# HELP ara_tasks_total Number of tasks recorded by ara in prometheus
# TYPE ara_tasks_total gauge
ara_tasks_total 403.0
# HELP ara_tasks_range Limit metric collection to the N most recent tasks
# TYPE ara_tasks_range gauge
ara_tasks_range 2500.0
# HELP ara_tasks_completed Completed Ansible tasks
# TYPE ara_tasks_completed gauge
ara_tasks_completed{action="command",duration="00:00:00.294820",name="Echo the �abc binary string",path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/smoke.yaml",playbook="30",results="1",status="completed",updated="2023-06-08T02:43:29.665787Z"} 1.0
ara_tasks_completed{action="debug",duration="00:00:00.155210",name="Task with non-ascii characters - ä, ö, ü",path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/smoke.yaml",playbook="30",results="1",status="completed",updated="2023-06-08T02:43:29.317583Z"} 1.0
ara_tasks_completed{action="gather_facts",duration="00:00:01.035601",name="Gathering Facts",path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/smoke.yaml",playbook="30",results="1",status="completed",updated="2023-06-08T02:43:29.098823Z"} 1.0
# HELP ara_tasks_failed Failed Ansible tasks
# TYPE ara_tasks_failed gauge
ara_tasks_failed{action="command",duration="00:00:00.455411",name="smoke-tests : Return false",path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/roles/smoke-tests/tasks/test-ops.yaml",playbook="30",results="1",status="failed",updated="2023-06-08T02:43:25.190901Z"} 1.0
ara_tasks_failed{action="fail",duration="00:00:00.210469",name="fail",path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/failed.yaml",playbook="29",results="1",status="failed",updated="2023-06-08T02:43:07.648379Z"} 1.0
ara_tasks_failed{action="fail",duration="00:00:00.219566",name="Generate a failure that will be rescued",path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/lookups.yaml",playbook="26",results="1",status="failed",updated="2023-06-08T02:32:51.180755Z"} 1.0
# ...

# HELP ara_hosts_total Hosts recorded by ara
# TYPE ara_hosts_total gauge
ara_hosts_total 43.0
# HELP ara_hosts_range Limit metric collection to the N most recent hosts
# TYPE ara_hosts_range gauge
ara_hosts_range 2500.0
# HELP ara_hosts_changed Number of changes on a host
# TYPE ara_hosts_changed gauge
ara_hosts_changed{name="localhost",playbook="30",updated="2023-06-08T02:43:29.848077Z"} 10.0
ara_hosts_changed{name="localhost",playbook="28",updated="2023-06-08T02:33:20.625359Z"} 1.0
ara_hosts_changed{name="localhost",playbook="26",updated="2023-06-08T02:32:54.179356Z"} 1.0
# HELP ara_hosts_failed Number of failures on a host
# TYPE ara_hosts_failed gauge
ara_hosts_failed{name="localhost",playbook="29",updated="2023-06-08T02:43:07.767992Z"} 1.0
ara_hosts_failed{name="localhost",playbook="24",updated="2023-06-08T02:32:18.773096Z"} 1.0
ara_hosts_failed{name="localhost",playbook="23",updated="2023-06-08T02:04:04.810142Z"} 1.0
# ...

@softwarefactory-project-zuul
Copy link

Build failed.
https://ansible.softwarefactory-project.io/zuul/buildset/75ed0374bc6e4344af27503fe6350e60

✔️ ara-tox-py3 SUCCESS in 9m 57s
ara-tox-linters FAILURE in 9m 48s
✔️ ara-basic-ansible-core-devel SUCCESS in 4m 59s (non-voting)
✔️ ara-basic-ansible-6 SUCCESS in 6m 11s
✔️ ara-basic-ansible-core-2.14 SUCCESS in 6m 01s
✔️ ara-basic-ansible-core-2.13 SUCCESS in 10m 57s
✔️ ara-basic-ansible-core-2.12 SUCCESS in 10m 38s
✔️ ara-basic-ansible-core-2.11 SUCCESS in 10m 51s
✔️ ara-basic-ansible-2.9 SUCCESS in 10m 50s
✔️ ara-container-images SUCCESS in 17m 13s

- Add a summary metric for tracking the duration of tasks.

This is what was intended when trying to do the playbook histogram so
we'll come back to that later.
@softwarefactory-project-zuul
Copy link

Build failed.
https://ansible.softwarefactory-project.io/zuul/buildset/4c6c9dea87f14d93aa1ec28b71ebc083

✔️ ara-tox-py3 SUCCESS in 4m 14s
ara-tox-linters FAILURE in 3m 12s
✔️ ara-basic-ansible-core-devel SUCCESS in 6m 20s (non-voting)
✔️ ara-basic-ansible-6 SUCCESS in 7m 07s
✔️ ara-basic-ansible-core-2.14 SUCCESS in 8m 02s
✔️ ara-basic-ansible-core-2.13 SUCCESS in 6m 20s
✔️ ara-basic-ansible-core-2.12 SUCCESS in 5m 32s
✔️ ara-basic-ansible-core-2.11 SUCCESS in 6m 17s
✔️ ara-basic-ansible-2.9 SUCCESS in 5m 40s
✔️ ara-container-images SUCCESS in 11m 13s

@softwarefactory-project-zuul
Copy link

Build succeeded.
https://ansible.softwarefactory-project.io/zuul/buildset/59731f5a132942749960db45ae05a18a

✔️ ara-tox-py3 SUCCESS in 4m 15s
✔️ ara-tox-linters SUCCESS in 3m 57s
✔️ ara-basic-ansible-core-devel SUCCESS in 7m 09s (non-voting)
✔️ ara-basic-ansible-6 SUCCESS in 6m 09s
✔️ ara-basic-ansible-core-2.14 SUCCESS in 6m 24s
✔️ ara-basic-ansible-core-2.13 SUCCESS in 6m 01s
✔️ ara-basic-ansible-core-2.12 SUCCESS in 6m 30s
✔️ ara-basic-ansible-core-2.11 SUCCESS in 6m 08s
✔️ ara-basic-ansible-2.9 SUCCESS in 6m 31s
✔️ ara-container-images SUCCESS in 11m 36s

- Substantial cleanup and cut on code duplication
- Fix linting and style
- Metric labels moved to default constants, leave the door opened for
  the possibility of customizing them
- Retrofit what we learned back to the playbook metrics
- Re-enable playbook metrics
@dmsimard dmsimard force-pushed the prometheus_exporter branch from feadacf to 7558a6f Compare June 20, 2023 05:18
@dmsimard
Copy link
Contributor Author

Lots of cleanup in this last iteration and I've done some tweaking on the grafana dashboard.

It looks like this now:
Screenshot from 2023-06-20 01-17-51

Screenshot from 2023-06-20 01-18-23

@softwarefactory-project-zuul
Copy link

Build failed.
https://ansible.softwarefactory-project.io/zuul/buildset/0eed3702b4444312b85e762bc95e51dc

✔️ ara-tox-py3 SUCCESS in 3m 12s
ara-tox-linters FAILURE in 3m 12s
✔️ ara-basic-ansible-core-devel SUCCESS in 6m 16s (non-voting)
✔️ ara-basic-ansible-6 SUCCESS in 5m 58s
✔️ ara-basic-ansible-core-2.14 SUCCESS in 5m 20s
✔️ ara-basic-ansible-core-2.13 SUCCESS in 6m 54s
✔️ ara-basic-ansible-core-2.12 SUCCESS in 4m 51s
✔️ ara-basic-ansible-core-2.11 SUCCESS in 6m 03s
✔️ ara-basic-ansible-2.9 SUCCESS in 5m 08s
✔️ ara-container-images SUCCESS in 11m 33s

- More cleanup
- Removed Gauges for each status of playbooks and tasks, they were not
  useful once understanding how to use Summaries and generated a lot of
  needless metrics in hindsight
- Added a package extra for [prometheus]
- First iteration of docs
- Add first iteration of grafana dashboard
@dmsimard dmsimard force-pushed the prometheus_exporter branch from b82da8c to 6283872 Compare June 21, 2023 03:41
@softwarefactory-project-zuul
Copy link

Build failed.
https://ansible.softwarefactory-project.io/zuul/buildset/fe23cb058a504bc48f68b007b1d4de91

✔️ ara-tox-py3 SUCCESS in 3m 15s
ara-tox-linters FAILURE in 3m 07s
✔️ ara-tox-docs SUCCESS in 7m 57s
✔️ ara-basic-ansible-core-devel SUCCESS in 5m 09s (non-voting)
✔️ ara-basic-ansible-6 SUCCESS in 5m 03s
✔️ ara-basic-ansible-core-2.14 SUCCESS in 11m 10s
✔️ ara-basic-ansible-core-2.13 SUCCESS in 5m 06s
✔️ ara-basic-ansible-core-2.12 SUCCESS in 5m 06s
✔️ ara-basic-ansible-core-2.11 SUCCESS in 4m 45s
✔️ ara-basic-ansible-2.9 SUCCESS in 5m 08s
✔️ ara-container-images SUCCESS in 10m 57s

@dmsimard
Copy link
Contributor Author

I feel this is ready for a first look to a wider audience so I've asked around for testing and feedback:

The final implementation may change before landing (for example if I screwed up in metric types) but this will be useful to make sure we did the right decisions and do the necessary changes before merging.

I am narrowing the scope of this first PR to playbooks, tasks and hosts for now. Results and plays can come in a later patch as necessary.

@softwarefactory-project-zuul
Copy link

Build failed.
https://ansible.softwarefactory-project.io/zuul/buildset/5332cbba06be4ca09a29ccbfe24bb719

✔️ ara-tox-py3 SUCCESS in 3m 50s
ara-tox-linters FAILURE in 3m 56s
✔️ ara-tox-docs SUCCESS in 3m 58s
✔️ ara-basic-ansible-core-devel SUCCESS in 6m 17s (non-voting)
✔️ ara-basic-ansible-8 SUCCESS in 6m 03s
✔️ ara-basic-ansible-core-2.15 SUCCESS in 6m 53s
✔️ ara-basic-ansible-core-2.14 SUCCESS in 5m 23s
✔️ ara-basic-ansible-2.9 SUCCESS in 6m 06s
✔️ ara-container-images SUCCESS in 12m 00s

@dmsimard dmsimard force-pushed the prometheus_exporter branch from 0ce3cf1 to c92b29b Compare July 21, 2023 03:12
@softwarefactory-project-zuul
Copy link

Build failed.
https://ansible.softwarefactory-project.io/zuul/buildset/51c4f4164d66409bbf48568389543706

✔️ ara-tox-py3 SUCCESS in 3m 49s
ara-tox-linters FAILURE in 3m 53s
✔️ ara-tox-docs SUCCESS in 3m 11s
✔️ ara-basic-ansible-core-devel SUCCESS in 6m 03s (non-voting)
✔️ ara-basic-ansible-8 SUCCESS in 6m 01s
✔️ ara-basic-ansible-core-2.15 SUCCESS in 7m 29s
✔️ ara-basic-ansible-core-2.14 SUCCESS in 7m 20s
✔️ ara-basic-ansible-2.9 SUCCESS in 5m 55s
✔️ ara-container-images SUCCESS in 11m 19s

@dmsimard dmsimard marked this pull request as draft September 9, 2023 15:49
@dmsimard dmsimard force-pushed the prometheus_exporter branch from c92b29b to 6283872 Compare October 23, 2023 00:14
@dmsimard
Copy link
Contributor Author

Nothing special pushed, just rebased on top of latest master.

@softwarefactory-project-zuul
Copy link

Build failed.
https://ansible.softwarefactory-project.io/zuul/buildset/7f750024dd7b42b2987983a14fc3a884

✔️ ara-tox-py3 SUCCESS in 4m 05s
ara-tox-linters FAILURE in 3m 50s
✔️ ara-tox-docs SUCCESS in 3m 15s
✔️ ara-basic-ansible-core-devel SUCCESS in 6m 55s (non-voting)
✔️ ara-basic-ansible-8 SUCCESS in 7m 00s
✔️ ara-basic-ansible-core-2.15 SUCCESS in 6m 58s
✔️ ara-basic-ansible-core-2.14 SUCCESS in 6m 21s
✔️ ara-container-images SUCCESS in 13m 52s

@dmsimard
Copy link
Contributor Author

I will eventually include it in the docs but in the meantime, I've come up with the following graph that explains how one might use the exporter:

                                         ┌──────────────────┐
       ┌────────────┐ promql ┌─────────┐ │ ansible-playbook │
       │ Prometheus │◄───────┤ Grafana │ │    (with ara)    │
       └──────┬─────┘        └─────────┘ └───────┬──────────┘
              │                                  │
              │ scrapes /metrics                 │ collects data
              │ & stores results                 │ & sends it
              │                                  │
   ┌──────────▼──────────┐               ┌───────▼────────┐
   │ Prometheus Exporter ├──────────────►│ ara API server │
   │ (prometheus_client) │ query metrics │    (django) ┌──┴─────────┐
   └─────────────────────┘               └─────────────┤ recorded   │
                                                       │  playbooks │
                                                       └────────────┘


ara doesn't provide monitoring or alerting out of the box (they are out of scope) but it records a number of granular metrics about Ansible playbooks, tasks and hosts, amongst other things.

Starting with version 1.6.2, ara provides an integration of `prometheus_client <https://github.com/prometheus/client_python>`_ that queries the ara API and then exposes these metrics for prometheus to scrape.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.6.2 didn't pan out, we went straight to 1.7.0. It can be included in a release as soon as it's ready.

help='Maximum number of days to backfill metrics for (default: 90)',
default=90,
type=int
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be interesting for the exporter to be able to filter queries like the general CLI commands work, for example ara playbook list (docs) has:

  --ansible_version <ansible_version>
                        List playbooks that ran with the specified Ansible
                        version (full or partial)
  --client_version <client_version>
                        List playbooks that were recorded with the specified
                        ara client version (full or partial)
  --server_version <server_version>
                        List playbooks that were recorded with the specified
                        ara server version (full or partial)
  --python_version <python_version>
                        List playbooks that were recorded with the specified
                        python version (full or partial)
  --user <user>         List playbooks that were run by the specified user
                        (full or partial)
  --controller <controller>
                        List playbooks that ran from the provided controller
                        (full or partial)
  --name <name>         List playbooks matching the provided name (full or
                        partial)
  --path <path>         List playbooks matching the provided path (full or
                        partial)
  --status <status>     List playbooks matching a specific status
                        ('completed', 'running', 'failed')

@voileux
Copy link

voileux commented Nov 17, 2023

Hi,
I was at ansible meetup in OVH building at montreal, your presentation was really good.
In prometheus, it's bad, when value of tag change during polling interval for one metric, it's better to transform the tag into metric.

I think you can transform for example this metric :
ara_tasks_completed{
action="command",
duration="00:00:00.294820",
name="Echo the �abc binary string",
path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/smoke.yaml",
playbook="30",
results="1",
status="completed",
updated="2023-06-08T02:43:29.665787Z"} 1.0

into several metric,
ara_tasks_status { action="command", name='Echo the abc binary string", path="/home/.......", playbook="30" } 1 (you can map value of integer to status name (1 for completed', 2 for running', 3 for 'failed)

ara_tasks_duration { action="command", name='Echo the abc binary string", path="/home/.......", playbook="30" } number seconds (or micro seconds if needed)

ara_tasks_results { action="command", name='Echo the abc binary string", path="/home/.......", playbook="30" } 1

We can work together to build correct metric, then we will produce correct python for exporter.

@dmsimard
Copy link
Contributor Author

Hi @voileux and thanks for reaching out!

What you suggest makes sense to me and it's worth looking into.

I don't have bandwidth to look into this /right now/ but I will revisit this in the near future.

@copolycube
Copy link

copolycube commented Nov 23, 2023

Hello,

depending on your goal here : it might be easier for you to limit the "exporter part" to what you want to monitor live (i.e. what you want to trigger alerts on)

And for the visualization aspects, directly connect grafana to your database with the specific grafana datasource:

something like :

flowchart TD
    G[Grafana] -->|promql <br/> visualize <b>alerts</b><br/> and correlate current metrics| P(Prometheus )
    G -->|db datasource <br/> visualize <b>metrics</b> <br/>current and historical| D
    W(alertmanager) -->|promql<br/>trigger alerts| P
    P-->|scrapes /metrics<br/> stores data| E(Prometheus Exporter<br/>prometheus_client)
    E --> |query metrics| D(ara API server <br/> django <br/>fa:fa-database recorded playbooks)
    A(ansible playbook) -->|collects data<br/>& sends it| D
Loading

instead of (from your previous schema here)

flowchart TD
    G[Grafana] -->|promql| P(Prometheus)
    P-->|scrapes /metrics<br/> stores data| E(Prometheus Exporter<br/>prometheus_client)
    E --> |query metrics| D(ara API server <br/> django <br/>fa:fa-database recorded playbooks)
    A(ansible playbook) -->|collects data<br/>& sends it| D

Loading

(edit: I forgot to put the mermaid keyword, and took this opportunity to add alertmanager & clarify the schema equivalent to the one you presented before)

This indeed requires you to rewrite your panels in grafana in order to make use of the proper SQL, and you will need to open the connection between grafana and your DB

Also it avoids to transform the whole content of the DB opentelemetry format and scraping it each time, which will scale better :-D

@dmsimard
Copy link
Contributor Author

Hi, I haven't revisited this in a little while but I wanted to say it was still on my radar and I plan to work on this some more in the near future.

@xlr-8
Copy link

xlr-8 commented Oct 2, 2024

Hello @dmsimard,

Thank you for the great project! Really nice to see / use!

I'm interested in taking over the topic if that's alright with you? And also willing to build the dashboard for grafana based on the metric gathered. I'm no expert, but I've used them a bit.

I've created a branch on my repo tried to take into account your suggestions & @voileux 's. However I'm currently stuck on the testing phase.

I've read your documentation / code, but I can't make the prometheus action exposed via CLI.
So far I've already:

  1. Build the container with buildah (why use buildah by the way?)
  2. Exported it to docker to make it run
  3. Started the container
  4. Re-install the ara (using: pip uninstall ara && pip install -e '.[server,prometheus]')

The project runs locally, I still have access to everything as before, but no way to get access to prometheus through the CLI:

# Can see the host for example but nothing for prom
 ara help | grep -E '(host|prometheus)'
  host delete    Deletes the specified host and associated resources
  host list      Returns a list of hosts based on search queries
  host metrics   Provides metrics about hosts
  host show      Returns a detailed view of a specified host

Logs of the previous steps:

> buildah images | grep ara-177
localhost/ara-api        ara-177   fcb72fa860d1   23 hours ago   295 MB
> docker images | grep ara-177
localhost/ara-api                               ara-177                        fcb72fa860d1   23 hours ago    280MB
> docker ps
CONTAINER ID   IMAGE                           COMMAND                  CREATED        STATUS        PORTS                                       NAMES
32dd5a2de5dc   localhost/ara-api:ara-177   "bash -c '/usr/local…"   21 hours ago   Up 21 hours   0.0.0.0:8000->8000/tcp, :::8000->8000/tcp   ara

I feel like this part of the documentation is a bit thin, and having to use/understand buildah/docker/tox (is it needed ?) or the overall parser is difficult to me. My feeling is that there's either some cache that I haven't cleaned and that it still uses some old version (ara 0.0.1.dev991) or that some import of the Prometheus is missing (in setup.cfg? or somewhere else) and therefore never called / reachable.

I'm also willing to help the doc on those part to help other people participate to it - but so far is still too blurry for me to write anything clear.

If you can fill in the blank it would be amazing!

Thanks!

@dmsimard
Copy link
Contributor Author

dmsimard commented Oct 2, 2024

Hi @xlr-8, thanks for your interest and for looking into this.

I haven't yet revisited this topic but I did talk about it at configuration management camp last year.
It wasn't recorded unfortunately but slides are available here: https://ara.recordsansible.org/presentations/cfg-mgmt-2024/ansible-metrics-in-prometheus.pdf (other presentations: https://ara.recordsansible.org/presentations/)

I am still interested in making this work :)

In the backup slides for last year's presentation there's a condensed how-to for testing this:

Demo: Trying out the exporter

# Install and run a prometheus exporter with metrics from ara
# https://github.com/ansible-community/ara/issues/177
# https://github.com/ansible-community/ara/pull/483
git clone https://github.com/dmsimard/ara
cd ara
git checkout prometheus_exporter

# Set up a virtualenv with ansible, ara and prometheus-client
tox -e ansible-integration --notest
source .tox/ansible-integration/bin/activate
pip install prometheus-client

# Metrics from localhost without needing to run a server
ara prometheus --max-days 1

# Metrics from a remote server running somewhere
ara prometheus --client http --server http://127.0.0.1:8000 --max-days 1

This should help you get started without needing to re-build container images after every change.
In terms of workflow, you basically:

  1. install ara from source (from git branch) including prometheus-client
  2. populate the ara database with data (real playbooks or you can use something like ansible-playbook tests/integration/hosts.yml with ANSIBLE_CALLBACK_PLUGINS=$(python3 -m ara.setup.callback_plugins)
  3. run the exporter
  4. have a prometheus scrape it

You can make changes to the exporter code and re-run it with the ara prometheus command for it to be effective.

The prometheus config supplied in the backup slides:
prometheus.yml:

global:
  scrape_interval: 15s
scrape_configs:
  - job_name: node
    static_configs:
      - targets: ['localhost:9100']
Start a Prometheus container:
{podman,docker} run -d --name prometheus \
  -p 9090:9090 \
  -v prometheus.yml:/etc/prometheus/prometheus.yml \
  quay.io/prom/prometheus

It's probably worthwhile for the branch to be rebased on top of the latest master by now. There hasn't been changes that would impact the prometheus implementation, I don't think, but there's been things like django updates and such.

I can take care of that if you'd like.

Otherwise:

Build the container with buildah (why use buildah by the way?)

Personal preference :)

I can be reached over matrix (or the slack bridge) and maybe IRC for discussion.

@xlr-8
Copy link

xlr-8 commented Oct 3, 2024

Awesome! Thank you so much for the detailed answer!

Personal preference :)

Alright, I figured perhaps there was some better integration with RedHat / RedHat like distros, as I could see you were using Fedora/CentOS.

No worries for the rebase, I'll take care of it.

I should take a look at it within the next few days ❤️

@dmsimard
Copy link
Contributor Author

@xlr-8 did you end up spending some cycles on this? It is coming back into my radar in the not-too-distant future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants