Check free disk space on all nodes instead of just the downloading one #4184

Michal-Leszczynski · 2024-12-20T09:14:34Z

As per this comment about the test which makes one node run out of disk space and expects restore to fail with specific error message:

I'm not sure what's the point of asserting the error message at all.
Right now it is really vague and the real problem can be identified by looking at the Scylla logs or metrics.
Also, if we change the error message slightly, it will keep on causing pain in the future test runs.
I'm just pondering on the nature of this test and SM implementation.
What SM is doing is that before downloading each batch it checks if the node that is downloading the batch has at lest 10% free disk space - but it does not check it for the other nodes, because we don't know where the data will end up. That's the reason why we don't have the "not enough disk space" error anymore - the problem is that some node with enough disk space downloaded the batch, but it was unable to stream it to the node without the disk space.
Perhaps that's the real issue with SM and we should not only validate that the node which does the download has enough disk space, but that's it also the case for all other nodes in the cluster. Maybe the 10% is a little bit too strict for such check (with the current effort of reaching 90% disk utilization), but something like 5% might be just fine.

It doesn't really make sense that we check for the free disk space only on the node that does the download - we should do it for all the other nodes as well. Perhaps the checks for the other nodes might be less strict, but they are still required, as we need all of the nodes to have some free disk space in order to accept the data that the downloading node will stream to them.

Error message Manager returns for enospc scenario has been changed to more generic one (#1). So, it doesn't make much sense to verify it. Moreover, there is a plan to fix check free disk space behaviour and the whole test will probably require rework to be done (#2). refs: #1 - scylladb/scylla-manager#4087 #2 - scylladb/scylla-manager#4184

Michal-Leszczynski added the restore label Dec 20, 2024

Michal-Leszczynski mentioned this issue Dec 20, 2024

fix(manager): update expected error message for enospc restore test scylladb/scylla-cluster-tests#9590

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check free disk space on all nodes instead of just the downloading one #4184

Check free disk space on all nodes instead of just the downloading one #4184

Michal-Leszczynski commented Dec 20, 2024

Check free disk space on all nodes instead of just the downloading one #4184

Check free disk space on all nodes instead of just the downloading one #4184

Comments

Michal-Leszczynski commented Dec 20, 2024