Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topology Downtimes not picked up by the director #1248

Closed
bbockelm opened this issue May 8, 2024 · 3 comments · Fixed by #1260
Closed

Topology Downtimes not picked up by the director #1248

bbockelm opened this issue May 8, 2024 · 3 comments · Fixed by #1260
Assignees
Labels
bug Something isn't working cache Issue relating to the cache component critical High priority for next release origin Issue relating to the origin component
Milestone

Comments

@bbockelm
Copy link
Collaborator

bbockelm commented May 8, 2024

When a Pelican cache needs to put in a downtime, operators still create the downtime in topology (for example, that goes into the daily monitoring reports).

However, it seems that if there are Pelican server ads, then it overwrites the downtime from topology and we continue to use the cache. For example, check out sc-cache.chtc.wisc.edu that @matyasselmeci operates: it should be in a months-long downtime but it is actively receiving redirects.

Unlike other attributes where we always prefer the Pelican version, we should have the director enforce a downtime if either source (topology or the pelican origin/cache) has a downtime listed.

Marking as critical for 7.9.0 but I think it should be backported to 7.8.x as well.

@bbockelm bbockelm added bug Something isn't working critical High priority for next release cache Issue relating to the cache component origin Issue relating to the origin component labels May 8, 2024
@bbockelm bbockelm added this to the v7.9.0 milestone May 8, 2024
@haoming29
Copy link
Contributor

For this case, if a cache is registered at both sides, isn't it the admin's job to also disable the cache in the Pelican director? If the cache only lives in the topology, putting it in downtime removes it from the topology json, and there seems to have no explicit way of telling if a server is in downtime from the topology json.

As @matyasselmeci pointed out, we can add ?include_downed=1 to show all the servers, and we can do a difference to find the servers that are in downtime, but I'm not entirely sure this is the way we want to follow.

@bbockelm
Copy link
Collaborator Author

bbockelm commented May 8, 2024

For this case, if a cache is registered at both sides, isn't it the admin's job to also disable the cache in the Pelican director?

Which admin?

If you mean the cache admin: I don't want a cache admin to have to repeat themselves. They should only have to declare the downtime once. Since there's no way for a cache admin to declare a downtime in Pelican (see #1251), they have to do this via topology (plus our monitoring infrastructure only looks at topology right now).

If you're talking about the central services admin: I don't think they should be declaring downtimes for all caches.

@haoming29
Copy link
Contributor

That makes sense. I'll figure out a way to let director admin know that a downtime was fetched from Topology instead of set at the director, which can then be expanded to show a generic source of downtime: topology, director UI, director configuration, origin/cache server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cache Issue relating to the cache component critical High priority for next release origin Issue relating to the origin component
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants