Topology Downtimes not picked up by the director #1248

bbockelm · 2024-05-08T02:11:03Z

When a Pelican cache needs to put in a downtime, operators still create the downtime in topology (for example, that goes into the daily monitoring reports).

However, it seems that if there are Pelican server ads, then it overwrites the downtime from topology and we continue to use the cache. For example, check out sc-cache.chtc.wisc.edu that @matyasselmeci operates: it should be in a months-long downtime but it is actively receiving redirects.

Unlike other attributes where we always prefer the Pelican version, we should have the director enforce a downtime if either source (topology or the pelican origin/cache) has a downtime listed.

Marking as critical for 7.9.0 but I think it should be backported to 7.8.x as well.

haoming29 · 2024-05-08T16:17:49Z

For this case, if a cache is registered at both sides, isn't it the admin's job to also disable the cache in the Pelican director? If the cache only lives in the topology, putting it in downtime removes it from the topology json, and there seems to have no explicit way of telling if a server is in downtime from the topology json.

As @matyasselmeci pointed out, we can add ?include_downed=1 to show all the servers, and we can do a difference to find the servers that are in downtime, but I'm not entirely sure this is the way we want to follow.

bbockelm · 2024-05-08T18:37:42Z

For this case, if a cache is registered at both sides, isn't it the admin's job to also disable the cache in the Pelican director?

Which admin?

If you mean the cache admin: I don't want a cache admin to have to repeat themselves. They should only have to declare the downtime once. Since there's no way for a cache admin to declare a downtime in Pelican (see #1251), they have to do this via topology (plus our monitoring infrastructure only looks at topology right now).

If you're talking about the central services admin: I don't think they should be declaring downtimes for all caches.

haoming29 · 2024-05-08T18:44:25Z

That makes sense. I'll figure out a way to let director admin know that a downtime was fetched from Topology instead of set at the director, which can then be expanded to show a generic source of downtime: topology, director UI, director configuration, origin/cache server.

bbockelm added bug Something isn't working critical High priority for next release cache Issue relating to the cache component origin Issue relating to the origin component labels May 8, 2024

bbockelm added this to the v7.9.0 milestone May 8, 2024

bbockelm assigned haoming29 May 8, 2024

bbockelm mentioned this issue May 8, 2024

Explicitly ignore Pelican endpoints in topology #1258

Closed

haoming29 mentioned this issue May 9, 2024

Fetch topology downtime for cache servers #1260

Merged

jhiemstrawisc closed this as completed in #1260 May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topology Downtimes not picked up by the director #1248

Topology Downtimes not picked up by the director #1248

bbockelm commented May 8, 2024

haoming29 commented May 8, 2024

bbockelm commented May 8, 2024

haoming29 commented May 8, 2024

Topology Downtimes not picked up by the director #1248

Topology Downtimes not picked up by the director #1248

Comments

bbockelm commented May 8, 2024

haoming29 commented May 8, 2024

bbockelm commented May 8, 2024

haoming29 commented May 8, 2024