Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Director Prometheus scraper of origins' metrics #289

Merged
merged 11 commits into from
Nov 6, 2023

Conversation

haoming29
Copy link
Contributor

@haoming29 haoming29 commented Oct 25, 2023

Testing Instructions

Since our current Prometheus setup has a fixed datastore location, it is very difficult to get more than one instance of service (director/origin/registry) to run Prometheus at the same time. To test this PR locally, you need to have some local patches to make it work. Here is how you can do it:

Code modification

First, make sure you specify the director's URL:

Federation:
  DirectorUrl: "https://localhost:8888"

Next, in web_ui/prometheus.go, ConfigureEmbeddedPrometheus() around line 218, you want to uncomment the code:

if isDirector {
    err := os.MkdirAll("/var/lib/pelican/director-monitoring/data", 0750)
if err != nil {
    return errors.New("Failure when creating a directory for the monitoring data")
}
    cfg.serverStoragePath = "/var/lib/pelican/director-monitoring/data"
} else {
    cfg.serverStoragePath = param.Monitoring_DataLocation.GetString()
}

This is to make a separate directory for the director's Prometheus datastore location.

With this done, you can go ahead and build the program.

Orchestrating services

Another special sauce to make it work locally is the order of running the programs. You want to run director first, registry second, origin last (this is important). After three services running, you want to kill registry, kill origin and finally run origin again. It's a bit counter-intuitive but the reason is that we want to ensure we can successfully register our origin at director first. As I tested, if we lunch origin prior to the registry, we will get this error when the director tries to fetch public keys from registry:

DEBUG[2023-10-27T15:37:47Z] Attempting to fetch keys from  https://localhost:9999/api/v1.0/registry/origin1/.well-known/issuer.jwks 
DEBUG[2023-10-27T15:37:47Z] Constructed JWKS from fetching jwks: null    
WARNING[2023-10-27T15:37:47Z] Failed to verify token: cached object is not a Set (was <nil>) 

This error only appears if you run services in the order of director origin registry. Might worth another PR to further investigate.

Also, since registry process will lock the Prometheus datastore when it's running, when we then ran origin, origin Prometheus instance will fail to start, and we can't scrape anything from it. So after origin successfully registered at director, we can turn off registry to release the datastore and then reboot origin.

With director and origin running, switch to the terminal where you ran director and confirm that the origin has been successfully registered (You may need to scroll up a bit). (You should be able to see something like:
INFO[2023-10-27T15:19:41Z] Served Request client=127.0.0.1 daemon=gin fields.time=5.640583ms method=POST resource=/api/v1.0/director/registerOrigin status=200)

This is the end of the setup.

You want to test that you are able to access origin's prometheus endpoint first, say https://localhost:<origin-web-port>/metrics and it should return metrics, not 503 (that means the origin's datastore is locked).

Now you should be able to go to director's Prometheus querying endpoint to verify that director is scraping metrics from origins:

https://localhost:8888/api/v1.0/prometheus/query?query=up

Or refer to a list of labels available by going to https://localhost:8888/api/v1.0/prometheus/label/__name__/values

https://localhost:8888/api/v1.0/prometheus/query?query=xrootd_monitoring_packets_received

The "job" label of the returned data should be the hostname of the origin, and the "instance" label should be origin's web URL.

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "xrootd_monitoring_packets_received",
          "instance": "localhost:8444",
          "job": "origins",
          "origin_auth_url": "https://localhost:8444",
          "origin_lat": "0.0000",
          "origin_long": "0.0000",
          "origin_name": "fae8c2865de4",
          "origin_url": "https://localhost:8444"
        },
        "value": [
          1698432545.437,
          "24"
        ]
      }
    ]
  }
}

I set a couple of labels for origin's SD response which are shown above "origin_*". Do you think we want to make all of them visible to the query? Or you want to hide some labels as internal only "__meta_origin_*"? @bbockelm

Also, the current service discovery token has a relatively long life time, 1 hour, and the token will be refreshed at Prometheus for every 50 min. If you want to test the end-to-end token refresh mechanism, you may change the code at director/director_api.go around line 134 for the token life time, and web_ui/prometheus.go line 547 for the refresh interval. Don't forget to rebuild before running.

@haoming29 haoming29 force-pushed the director-scraping branch 4 times, most recently from 128c4f5 to c9aa107 Compare October 27, 2023 18:58
@haoming29 haoming29 linked an issue Oct 27, 2023 that may be closed by this pull request
@haoming29
Copy link
Contributor Author

haoming29 commented Oct 27, 2023

When implementing the token auth for the director service discovery endpoint, the GHA tests are failing on my newly added unit tests. Ref: https://github.com/PelicanPlatform/pelican/actions/runs/6671169541/job/18132988943

The error is because when I verify the token, I used config.LoadPublicKey() which returns a jwks. I didn't extract the first key out of the set but feed the jwks to the jwt.Parse() like this: tok, err := jwt.Parse([]byte(strToken), jwt.WithKeySet(*key), jwt.WithValidate(true))

According to JWT doc, when doing jwt.WithKeySet, you need to ensure kid is present in your public key and matches the one in the private key, but currently, we never add a kid to the public key at init_server_cred. Interestingly, this behavior can't be reproduced locally. It only appears in GHA.

I added one commit to fix this kid missing issue by changing the order we assign kid to keys and ensure we have a shared private jwk that has the same kid as the public jwk at the start of the server. Please refer to the code change made in init_server_cred.

I'm not sure if we want to make this change, or just stay at keys.Key(0). @bbockelm any thoughts on this?

web_ui/ui.go Outdated Show resolved Hide resolved
web_ui/prometheus.go Outdated Show resolved Hide resolved
web_ui/prometheus.go Outdated Show resolved Hide resolved
director/redirect.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@bbockelm bbockelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, meant to leave some one-off comments but it turned into a full-on review...

A few things embedded. The testing is a touch awkward right now but, in order to prevent a large PR from lingering, we might simply split this work across two releases.

@bbockelm bbockelm added the enhancement New feature or request label Oct 29, 2023
@bbockelm bbockelm added this to the v7.2.0 milestone Oct 29, 2023
@haoming29
Copy link
Contributor Author

Sorry, meant to leave some one-off comments but it turned into a full-on review...

A few things embedded. The testing is a touch awkward right now but, in order to prevent a large PR from lingering, we might simply split this work across two releases.

Do you have any opinion on where we should split this PR?

@haoming29 haoming29 changed the title Director scraping Director Prometheus scraper of origins' metrics Oct 31, 2023
@haoming29 haoming29 marked this pull request as ready for review October 31, 2023 20:11
@bbockelm
Copy link
Collaborator

bbockelm commented Nov 4, 2023

Do you have any opinion on where we should split this PR?

Ah, I meant that we need to do further improvements to the functionality but it can be merged in the meantime.

Copy link
Collaborator

@bbockelm bbockelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think this one is ready to go!

@haoming29 haoming29 merged commit a15f838 into PelicanPlatform:main Nov 6, 2023
6 checks passed
@haoming29 haoming29 deleted the director-scraping branch November 6, 2023 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Director-based Prometheus scraping
2 participants