-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After a successful schema and data restoration *to a different region*, the restored keyspace is completely empty #3525
Comments
@ShlomiBalalis - where can I find the manager log, so we can see what was restored? |
Out of curiosity, why is it using LCS?
|
the server is in the monitor tarball, the agents are in the db nodes |
This is simply part of the longevity scenario, but this is not the problematic keyspace anyway |
So the logs show that some data has actually been downloaded and loaded to the cluster. So right now I'm checking if it's a restore or repair problem (tested version of SM does not contain repair refactor, so this is not connected to those changes). |
Leading theory: I tried to restore the keyspace manually the old fasioned: downloading the sstables and refreshing, but we noticed something funny:
Then, I tried to change the replication factor of the keyspace, and noticed that while the region of the cluster under test is
The keyspace was set to replicate in us-east, which is probably the region of the originally backed up cluster:
Once I altered the region of the keyspace's region, I was able to query it just fine:
So, the difference in regions is probably the cause of the failures. |
So, restore only works in the same region, and there is a procedure to restore to a different region? |
Restoring tables has a requirement of having identical schema as in the backup. The dcs are also a part of the keyspace schema. So the fact that that restore does not work when you try to restore data into empty dc seems logical. The strange part here is that load&stream does not complain when it has to upload sstables to nodes from empty dc (we can add manual checks for that in SM). I would suspect, that in this scenario uploaded sstables should be lost, as they don't belong to any node in the cluster, but maybe L&S still stores it somewhere, even though it's impossible to query the data because of the "unavailable replicas" error. In your example you said that you used
|
But at least we know that this issue is not a regression and that IMHO restore works as described in the docs. |
Nope. a simple
In my case, I first loaded the data with refresh and only then altered the keyspace, and everything seemed fine afterwards (of course, it was only a preliminary check that the table contains data at all) |
That's strange because nodetool refresh docs says:
So I would expect that it worked partially / it's not reliable to use it in this way. So the approach with:
seems more promising. @asias, do you think that this approach is safe and should work? Context: |
Yeah, regardless of the fact that it worked (and I agree, it's strange it worked at all) this is probably the correct course of action as far as I can tell |
My local experiments confirms that the approach:
works fine, but they are just experiments and not proofs of reliability. |
@ShlomiBalalis ping |
@Mark-Gurevich can you please take over this? @Michal-Leszczynski mind taking ownership of this issue? |
IIUC we need to add to the |
@mikliapko is this something that you could take care of? I mean validating that procedure described in #3525 (comment) works fine with some proper test. When it's validated, we can add it to SM docs. |
Yep, as it's still happening, I will take a look into it |
@mikliapko it's happening in a test that disable raft topology, is the schema restore depended on raft topology ? PackagesScylla version: Installation detailsCluster size: 5 nodes (i4i.8xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
Starting from SM 3.3 and Scylla 6.0, SM restores schema by applying the output of
So right now this is a documented limitation, but we should make it possible to restore schema into a different DC setting or make it easier for the user to modify just the DC part of keyspace schema. |
Created an issue for that: #4049. |
Closing it as this behavior was expected and will be fixed as a part of #4049. |
Issue description
At
2023-08-14 13:27:09,663
, we started two restore tasks that uses a pre-created snapshot, that includes the keyspace5gb_sizetiered_2022_1
.First, a task to restore the schema:
The restore task has ended successfully:
At which point, restart all of the nodes (' services) in the cluster, one by one:
Afterwards, werestore the data:
Which also passed:
Afterwards, we also created a general repair task (since this code was not adjusted to the autmatic repair just yet):
Which passed:
Then, We executed a cassandra-stress to validate the data, which was DOA:
Looking into the data folders in the machines as well, it seems that they are completely empty:
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Kernel Version: 5.15.0-1040-aws
Scylla version (or git commit hash):
2022.2.12-20230727.f4448d5b0265
with build-ida87bfeb65d24abf65d074a3ba2e5b9664692d716
Cluster size: 4 nodes (i3.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-0624755b4db06e567
(aws: eu-west-1)Test:
longevity-200gb-48h-test_restore-nemesis
Test id:
84dfb4de-0573-4a01-8806-8b832bcafd91
Test name:
scylla-staging/Shlomo/longevity-200gb-48h-test_restore-nemesis
Test config file(s):
Logs and commands
$ hydra investigate show-monitor 84dfb4de-0573-4a01-8806-8b832bcafd91
$ hydra investigate show-logs 84dfb4de-0573-4a01-8806-8b832bcafd91
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: