-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Director Prometheus scraper of origins' metrics #289
Conversation
128c4f5
to
c9aa107
Compare
When implementing the token auth for the director service discovery endpoint, the GHA tests are failing on my newly added unit tests. Ref: https://github.com/PelicanPlatform/pelican/actions/runs/6671169541/job/18132988943 The error is because when I verify the token, I used According to JWT doc, when doing I added one commit to fix this I'm not sure if we want to make this change, or just stay at |
a460a41
to
5b9cb16
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, meant to leave some one-off comments but it turned into a full-on review...
A few things embedded. The testing is a touch awkward right now but, in order to prevent a large PR from lingering, we might simply split this work across two releases.
Do you have any opinion on where we should split this PR? |
Co-authored-by: Brian P Bockelman <bockelman@gmail.com>
cccb583
to
4b42bff
Compare
Ah, I meant that we need to do further improvements to the functionality but it can be merged in the meantime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think this one is ready to go!
Testing Instructions
Since our current Prometheus setup has a fixed datastore location, it is very difficult to get more than one instance of service (director/origin/registry) to run Prometheus at the same time. To test this PR locally, you need to have some local patches to make it work. Here is how you can do it:
Code modification
First, make sure you specify the director's URL:
Next, in
web_ui/prometheus.go
,ConfigureEmbeddedPrometheus()
around line 218, you want to uncomment the code:This is to make a separate directory for the director's Prometheus datastore location.
With this done, you can go ahead and build the program.
Orchestrating services
Another special sauce to make it work locally is the order of running the programs. You want to run
director
first,registry
second,origin
last (this is important). After three services running, you want to killregistry
, killorigin
and finally runorigin
again. It's a bit counter-intuitive but the reason is that we want to ensure we can successfully register our origin at director first. As I tested, if we lunchorigin
prior to theregistry
, we will get this error when the director tries to fetch public keys from registry:This error only appears if you run services in the order of
director
origin
registry
. Might worth another PR to further investigate.Also, since
registry
process will lock the Prometheus datastore when it's running, when we then ranorigin
,origin
Prometheus instance will fail to start, and we can't scrape anything from it. So afterorigin
successfully registered atdirector
, we can turn offregistry
to release the datastore and then rebootorigin
.With
director
andorigin
running, switch to the terminal where you randirector
and confirm that the origin has been successfully registered (You may need to scroll up a bit). (You should be able to see something like:INFO[2023-10-27T15:19:41Z] Served Request client=127.0.0.1 daemon=gin fields.time=5.640583ms method=POST resource=/api/v1.0/director/registerOrigin status=200
)This is the end of the setup.
You want to test that you are able to access origin's prometheus endpoint first, say
https://localhost:<origin-web-port>/metrics
and it should return metrics, not 503 (that means the origin's datastore is locked).Now you should be able to go to director's Prometheus querying endpoint to verify that director is scraping metrics from origins:
https://localhost:8888/api/v1.0/prometheus/query?query=up
Or refer to a list of labels available by going to
https://localhost:8888/api/v1.0/prometheus/label/__name__/values
https://localhost:8888/api/v1.0/prometheus/query?query=xrootd_monitoring_packets_received
The "job" label of the returned data should be the hostname of the origin, and the "instance" label should be origin's web URL.
I set a couple of labels for origin's SD response which are shown above
"origin_*"
. Do you think we want to make all of them visible to the query? Or you want to hide some labels as internal only"__meta_origin_*"
? @bbockelmAlso, the current service discovery token has a relatively long life time, 1 hour, and the token will be refreshed at Prometheus for every 50 min. If you want to test the end-to-end token refresh mechanism, you may change the code at
director/director_api.go
around line 134 for the token life time, andweb_ui/prometheus.go
line 547 for the refresh interval. Don't forget to rebuild before running.