Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support differentiating between static and dynamic resource groups #5162

Closed
ddanielr opened this issue Dec 10, 2024 · 5 comments · Fixed by #5226
Closed

Support differentiating between static and dynamic resource groups #5162

ddanielr opened this issue Dec 10, 2024 · 5 comments · Fixed by #5226
Labels
enhancement This issue describes a new feature, improvement, or optimization.
Milestone

Comments

@ddanielr
Copy link
Contributor

ddanielr commented Dec 10, 2024

In the past accumulo has tracked dead tservers.

For resource groups that have dynamic scaling workloads, there should be an option to disable this tracking of "dead" servers.

We probably don't want to remove this functionality entirely because there might be a static group of tservers which could have issues and failures for those machines should be tracked.

@ddanielr ddanielr converted this from a draft issue Dec 10, 2024
@ddanielr ddanielr added the enhancement This issue describes a new feature, improvement, or optimization. label Dec 10, 2024
@ddanielr ddanielr changed the title Add variable to resource groups for "dynamic" vs "static" scaling of tservers so tracking "Dead tservers" is disabled per resource group. Support differentiating between static and dynamic resource groups Dec 11, 2024
@ddanielr ddanielr added this to the 4.0.0 milestone Dec 11, 2024
@dlmarion
Copy link
Contributor

From what I can tell, the logic in LiveTServerSet.checkServer only adds the tablet server to the dead list if the tablet server path in ZooKeeper exists but there is no lock data, or if the instance does not match (like the tserver was restarted at the same host/port location). The ZooKeeper node for the tserver is deleted after 10 minutes if there is no lock data.

This information is only shown on the Monitor. We could add a Monitor property that is a list of resource groups to ignore, then use that when calling DeadServerList.getList to skip dead servers with a matching resource group in the ServiceLockPath.

@dlmarion
Copy link
Contributor

Another way to approach this would be to remove the tablet servers ZooKeeper entry on a normal shutdown. I believe that the tserver did not do this in earlier (< 4.0) versions because the configuration when using cluster.yaml because the deployment was meant to be static.

In fact, this is a larger issue when a users uses something like Kubernetes. It's possible that old server paths might linger for all the server processes. We may want to consider removing the server paths from ZooKeeper for all server types when performing a normal shutdown.

@dlmarion
Copy link
Contributor

Another way to approach this would be to remove the tablet servers ZooKeeper entry on a normal shutdown.

In the ServiceLock.unlock code (below) in the case of Compactors, Scan Servers, and TabletServers path is /accumulo/<instanceId>/(compactors|sservers|tservers)/<resourceGroup>/<address> and localLock is the ephemeral node that is created. The code is currently only deleting the ephemeral node.

We could modify ServiceLock.unlock to accept a boolean, which when true could perform a recursive delete on path. We would only set it to true in the case of Compactors, ScanServers, and TabletServers when ServiceLock.unlock is called. This would mean that the LiveTServerSet running in the Manager would not find any TabletServers without locks when they are shut down gracefully. Graceful shutdown is being addressed in #5193.

public synchronized void unlock() throws InterruptedException, KeeperException {
if (lockNodeName == null) {
throw new IllegalStateException();
}
LockWatcher localLw = lockWatcher;
String localLock = lockNodeName;
lockNodeName = null;
lockWatcher = null;
final String pathToDelete = path + "/" + localLock;
LOG.debug("[{}] Deleting all at path {} due to unlock", vmLockPrefix, pathToDelete);
ZooUtil.recursiveDelete(zooKeeper, pathToDelete, NodeMissingPolicy.SKIP);
// Wait for the delete to happen on the server before exiting method
Timer start = Timer.startNew();
while (zooKeeper.exists(pathToDelete, null) != null) {
Thread.onSpinWait();
if (start.hasElapsed(10, SECONDS)) {
start.restart();
LOG.debug("[{}] Still waiting for zookeeper to delete all at {}", vmLockPrefix,
pathToDelete);
}
}
localLw.lostLock(LockLossReason.LOCK_DELETED);
}

@dlmarion
Copy link
Contributor

dlmarion commented Jan 6, 2025

I created #5226 to suppress the display of dead servers for configured resource groups in the Monitor. I think this is the simplest approach at this point. My earlier comments about removing the lock in ZooKeeper may not fully resolve the issue you raise.

@ddanielr
Copy link
Contributor Author

ddanielr commented Jan 8, 2025

I created #5226 to suppress the display of dead servers for configured resource groups in the Monitor. I think this is the simplest approach at this point. My earlier comments about removing the lock in ZooKeeper may not fully resolve the issue you raise.

Agreed, I think that #5226 is a good solution for this issue as it's not tied to the cluster.yaml definition and also covers the situation where the user might want to shut down tservers faster than the "graceful shutdown" process allows and would end up generating dead tserver locks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement This issue describes a new feature, improvement, or optimization.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants