After a successful schema and data restoration to a different region, the restored keyspace is completely empty #3525

ShlomiBalalis · 2023-08-16T08:01:31Z

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

At 2023-08-14 13:27:09,663, we started two restore tasks that uses a pre-created snapshot, that includes the keyspace 5gb_sizetiered_2022_1.
First, a task to restore the schema:

< t:2023-08-14 13:27:13,826 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool restore -c a92d1307-4ac0-43df-874a-98667733d8ae --restore-schema --location s3:manager-backup-tests-permanent-snapshots-us-east-1  --snapshot-tag sm_20230702201949UTC" finished with status 0
< t:2023-08-14 13:27:13,826 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > sctool output: restore/256d69cd-92e9-49d7-bed5-e82928acf970

The restore task has ended successfully:

< t:2023-08-14 13:28:17,197 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool  -c a92d1307-4ac0-43df-874a-98667733d8ae progress restore/256d69cd-92e9-49d7-bed5-e82928acf970" finished with status 0
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > sctool output: Restore progress
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Run:              4779b377-3aa6-11ee-a65d-0afbd2966d0b
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Status:           DONE - restart required (see restore docs)
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Start time:       14 Aug 23 13:27:13 UTC
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > End time: 14 Aug 23 13:28:07 UTC
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Duration: 54s
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Progress: 100% | 100%
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Snapshot Tag:     sm_20230702201949UTC
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > 
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ╭───────────────┬─────────────┬──────────┬──────────┬────────────┬────────╮
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ Keyspace      │    Progress │     Size │  Success │ Downloaded │ Failed │
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ├───────────────┼─────────────┼──────────┼──────────┼────────────┼────────┤
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_schema │ 100% | 100% │ 474.478k │ 474.478k │   474.478k │      0 │
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ╰───────────────┴─────────────┴──────────┴──────────┴────────────┴────────╯

At which point, restart all of the nodes (' services) in the cluster, one by one:

< t:2023-08-14 13:28:18,164 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl stop scylla-server.service"...
< t:2023-08-14 13:28:18,539 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-08-14T13:28:18+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-1   !NOTICE | sudo[13833]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl stop scylla-server.service
< t:2023-08-14 13:29:41,945 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl stop scylla-server.service" finished with status 0
< t:2023-08-14 13:29:42,734 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl start scylla-server.service"...
< t:2023-08-14 13:29:43,110 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-08-14T13:29:43+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-1   !NOTICE | sudo[13875]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl start scylla-server.service
< t:2023-08-14 13:29:47,335 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl start scylla-server.service" finished with status 0
< t:2023-08-14 13:30:49,093 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl stop scylla-server.service"...
< t:2023-08-14 13:30:49,149 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-08-14T13:30:49+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-2   !NOTICE | sudo[11111]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl stop scylla-server.service
< t:2023-08-14 13:32:15,063 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl stop scylla-server.service" finished with status 0
< t:2023-08-14 13:32:15,403 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl start scylla-server.service"...
< t:2023-08-14 13:32:15,846 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-08-14T13:32:15+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-2   !NOTICE | sudo[11168]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl start scylla-server.service
< t:2023-08-14 13:32:20,003 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl start scylla-server.service" finished with status 0
< t:2023-08-14 13:33:21,198 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl stop scylla-server.service"...
< t:2023-08-14 13:33:21,638 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-08-14T13:33:21+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-3   !NOTICE | sudo[11148]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl stop scylla-server.service
< t:2023-08-14 13:34:46,992 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl stop scylla-server.service" finished with status 0
< t:2023-08-14 13:34:47,310 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl start scylla-server.service"...
< t:2023-08-14 13:34:47,687 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-08-14T13:34:47+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-3   !NOTICE | sudo[11199]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl start scylla-server.service
< t:2023-08-14 13:34:51,981 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl start scylla-server.service" finished with status 0
< t:2023-08-14 13:35:53,665 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl stop scylla-server.service"...
< t:2023-08-14 13:35:54,077 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-08-14T13:35:53+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-4   !NOTICE | sudo[11277]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl stop scylla-server.service
< t:2023-08-14 13:37:09,635 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl stop scylla-server.service" finished with status 0
< t:2023-08-14 13:37:10,549 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl start scylla-server.service"...
< t:2023-08-14 13:37:11,016 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-08-14T13:37:10+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-4   !NOTICE | sudo[11324]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl start scylla-server.service
< t:2023-08-14 13:37:15,151 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl start scylla-server.service" finished with status 0

Afterwards, werestore the data:

< t:2023-08-14 13:38:20,276 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool restore -c a92d1307-4ac0-43df-874a-98667733d8ae --restore-tables --location s3:manager-backup-tests-permanent-snapshots-us-east-1  --snapshot-tag sm_20230702201949UTC" finished with status 0
< t:2023-08-14 13:38:20,282 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > sctool output: restore/ba67ff65-3170-4aa8-af74-efa2b694d89f

Which also passed:

< t:2023-08-14 13:44:08,432 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool  -c a92d1307-4ac0-43df-874a-98667733d8ae progress restore/ba67ff65-3170-4aa8-af74-efa2b694d89f" finished with status 0
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > sctool output: Restore progress
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Run:              d4fef9a2-3aa7-11ee-a65e-0afbd2966d0b
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Status:           DONE
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Start time:       14 Aug 23 13:38:20 UTC
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > End time: 14 Aug 23 13:43:47 UTC
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Duration: 5m27s
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Progress: 100% | 100%
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Snapshot Tag:     sm_20230702201949UTC
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > 
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ╭───────────────────────┬─────────────┬─────────┬─────────┬────────────┬────────╮
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ Keyspace              │    Progress │    Size │ Success │ Downloaded │ Failed │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ├───────────────────────┼─────────────┼─────────┼─────────┼────────────┼────────┤
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces         │        100% │       0 │       0 │          0 │      0 │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ 5gb_sizetiered_2022_1 │ 100% | 100% │ 17.133G │ 17.133G │    17.133G │      0 │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_auth           │ 100% | 100% │ 26.021k │ 26.021k │    26.021k │      0 │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed    │        100% │       0 │       0 │          0 │      0 │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ audit                 │        100% │       0 │       0 │          0 │      0 │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ╰───────────────────────┴─────────────┴─────────┴─────────┴────────────┴────────╯
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > 
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Post-restore repair progress:
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Run:              d4fef9a2-3aa7-11ee-a65e-0afbd2966d0b
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Status:           DONE
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Start time:       14 Aug 23 13:38:20 UTC
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > End time: 14 Aug 23 13:43:47 UTC
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Duration: 5m27s
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Progress: 100%
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Datacenters:      
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG >   - eu-west
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > 
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ╭────────────────────┬────────────────────────┬──────────┬──────────╮
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ Keyspace           │                  Table │ Progress │ Duration │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ├────────────────────┼────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_auth        │        role_attributes │ 100%     │ 4s       │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_auth        │           role_members │ 100%     │ 4s       │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_auth        │                  roles │ 100%     │ 4s       │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ├────────────────────┼────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed │         service_levels │ 100%     │ 6s       │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed │      view_build_status │ 100%     │ 5s       │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ├────────────────────┼────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces      │                 events │ 100%     │ 12s      │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces      │          node_slow_log │ 100%     │ 4s       │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces      │ node_slow_log_time_idx │ 100%     │ 2s       │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces      │               sessions │ 100%     │ 2s       │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces      │      sessions_time_idx │ 100%     │ 2s       │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ╰────────────────────┴────────────────────────┴──────────┴──────────╯

Afterwards, we also created a general repair task (since this code was not adjusted to the autmatic repair just yet):

< t:2023-08-14 13:44:11,117 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool repair -c a92d1307-4ac0-43df-874a-98667733d8ae" finished with status 0
< t:2023-08-14 13:44:11,117 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > sctool output: repair/9d82ba8f-053c-41f8-83dd-798d2e49bf4a

Which passed:

< t:2023-08-14 13:49:24,031 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool  -c a92d1307-4ac0-43df-874a-98667733d8ae progress repair/9d82ba8f-053c-41f8-83dd-798d2e49bf4a" finished with status 0
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > sctool output: Run:               a5d19efb-3aa8-11ee-a661-0afbd2966d0b
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Status:           DONE
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Start time:       14 Aug 23 13:44:10 UTC
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > End time: 14 Aug 23 13:48:58 UTC
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Duration: 4m47s
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Progress: 100%
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Datacenters:      
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG >   - eu-west
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > 
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ╭───────────────────────────────┬────────────────────────────────┬──────────┬──────────╮
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ Keyspace                      │                          Table │ Progress │ Duration │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ keyspace1                     │                      standard1 │ 100%     │ 3m39s    │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_auth                   │                role_attributes │ 100%     │ 1s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_auth                   │                   role_members │ 100%     │ 1s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_auth                   │               role_permissions │ 100%     │ 1s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_auth                   │                          roles │ 100%     │ 1s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed_everywhere │ cdc_generation_descriptions_v2 │ 100%     │ 0s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed            │      cdc_generation_timestamps │ 100%     │ 5s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed            │    cdc_streams_descriptions_v2 │ 100%     │ 5s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed            │                 service_levels │ 100%     │ 6s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed            │              view_build_status │ 100%     │ 5s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces                 │                         events │ 100%     │ 17s      │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces                 │                  node_slow_log │ 100%     │ 5s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces                 │         node_slow_log_time_idx │ 100%     │ 3s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces                 │                       sessions │ 100%     │ 3s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces                 │              sessions_time_idx │ 100%     │ 3s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ╰───────────────────────────────┴────────────────────────────────┴──────────┴──────────╯

Then, We executed a cassandra-stress to validate the data, which was DOA:

< t:2023-08-14 13:50:12,080 f:stress_thread.py l:287  c:sdcm.stress_thread   p:INFO  > cassandra-stress read no-warmup cl=QUORUM n=5242880 -schema 'keyspace=5gb_sizetiered_2022_1 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native   user=cassandra password=cassandra -rate threads=50 -col 'size=FIXED(64) n=FIXED(16)' -pop seq=1..5242880 -transport 'truststore=/etc/scylla/ssl_conf/client/cacerts.jks truststore-password=cassandra' -node 10.4.3.146,10.4.0.171,10.4.0.236,10.4.0.248 -errors skip-unsupported-columns

type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
WARN  13:50:19,052 Not using advanced port-based shard awareness with /10.4.0.171:9042 because we're missing port-based shard awareness port on the server
WARN  13:50:19,222 Not using advanced port-based shard awareness with /10.4.0.236:9042 because we're missing port-based shard awareness port on the server
java.io.IOException: Operation x10 on key(s) [4c4c3637324b38334f30]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
Failed to connect over JMX; not collecting these stats

com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
        at org.apache.cassandra.stress.Operation.error(Operation.java:141)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
        at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
        at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
        at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
        at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
        at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
java.io.IOException: Operation x10 on key(s) [343550504e4f30353430]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)

com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
        at org.apache.cassandra.stress.Operation.error(Operation.java:141)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
        at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
        at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
        at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)

Looking into the data folders in the machines as well, it seems that they are completely empty:

scyllaadm@longevity-200gb-48h-verify-limited--db-node-84dfb4de-1:/var/lib/scylla/data$ ll 5gb_sizetiered_2022_1/standard1-e08b7420191411ee8ec98425b74f1f5d/
total 0
drwxr-xr-x 4 scylla scylla 47 Aug 14 13:29 ./
drwxr-xr-x 3 scylla scylla 64 Aug 14 13:29 ../
drwxr-xr-x 2 scylla scylla 10 Aug 14 13:29 staging/
drwxr-xr-x 2 scylla scylla 10 Aug 14 13:41 upload/

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Kernel Version: 5.15.0-1040-aws
Scylla version (or git commit hash): 2022.2.12-20230727.f4448d5b0265 with build-id a87bfeb65d24abf65d074a3ba2e5b9664692d716

Cluster size: 4 nodes (i3.4xlarge)

Scylla Nodes used in this run:

longevity-200gb-48h-verify-limited--db-node-84dfb4de-4 (34.253.41.169 | 10.4.0.248) (shards: 14)
longevity-200gb-48h-verify-limited--db-node-84dfb4de-3 (34.245.91.14 | 10.4.0.236) (shards: 14)
longevity-200gb-48h-verify-limited--db-node-84dfb4de-2 (34.245.188.222 | 10.4.0.171) (shards: 14)
longevity-200gb-48h-verify-limited--db-node-84dfb4de-1 (52.214.177.207 | 10.4.3.146) (shards: 14)

OS / Image: ami-0624755b4db06e567 (aws: eu-west-1)

Test: longevity-200gb-48h-test_restore-nemesis
Test id: 84dfb4de-0573-4a01-8806-8b832bcafd91
Test name: scylla-staging/Shlomo/longevity-200gb-48h-test_restore-nemesis
Test config file(s):

longevity-200GB-48h-verifier-LimitedMonkey-tls.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 84dfb4de-0573-4a01-8806-8b832bcafd91
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 84dfb4de-0573-4a01-8806-8b832bcafd91

Logs:

db-cluster-84dfb4de.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/84dfb4de-0573-4a01-8806-8b832bcafd91/20230814_140710/db-cluster-84dfb4de.tar.gz
sct-runner-events-84dfb4de.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/84dfb4de-0573-4a01-8806-8b832bcafd91/20230814_140710/sct-runner-events-84dfb4de.tar.gz
sct-84dfb4de.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/84dfb4de-0573-4a01-8806-8b832bcafd91/20230814_140710/sct-84dfb4de.log.tar.gz
loader-set-84dfb4de.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/84dfb4de-0573-4a01-8806-8b832bcafd91/20230814_140710/loader-set-84dfb4de.tar.gz
monitor-set-84dfb4de.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/84dfb4de-0573-4a01-8806-8b832bcafd91/20230814_140710/monitor-set-84dfb4de.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

mykaul · 2023-08-16T08:49:37Z

@ShlomiBalalis - where can I find the manager log, so we can see what was restored?

mykaul · 2023-08-16T08:51:19Z

Out of curiosity, why is it using LCS?

2023-08-14T14:03:55+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-1     !INFO | scylla[13934]:  [shard 12] LeveledManifest - Leveled compaction strategy is restoring invariant of level 1 by compacting 2 sstables on behalf of keyspace1.standard1

ShlomiBalalis · 2023-08-16T08:55:31Z

@ShlomiBalalis - where can I find the manager log, so we can see what was restored?

the server is in the monitor tarball, the agents are in the db nodes

ShlomiBalalis · 2023-08-16T09:03:20Z

Out of curiosity, why is it using LCS?

2023-08-14T14:03:55+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-1     !INFO | scylla[13934]:  [shard 12] LeveledManifest - Leveled compaction strategy is restoring invariant of level 1 by compacting 2 sstables on behalf of keyspace1.standard1

This is simply part of the longevity scenario, but this is not the problematic keyspace anyway

Michal-Leszczynski · 2023-08-16T11:54:36Z

So the logs show that some data has actually been downloaded and loaded to the cluster.
The problem is that both automatic and manual repair (still present in this test scenario) didn't repair restored table.

So right now I'm checking if it's a restore or repair problem (tested version of SM does not contain repair refactor, so this is not connected to those changes).

ShlomiBalalis · 2023-08-16T17:28:30Z

Leading theory: I tried to restore the keyspace manually the old fasioned: downloading the sstables and refreshing, but we noticed something funny:
At first, I was trying to query the keyspace right after the restore, consistently failing:

cassandra@cqlsh:5gb_sizetiered_2022_1> select * from standard1;
NoHostAvailable:

Then, I tried to change the replication factor of the keyspace, and noticed that while the region of the cluster under test is eu-west:

$ nodetool status

Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.3.146  52.54 GB   256          ?       c37bdb3d-7a3b-477a-b4fb-a4a98684a2c5  1a
UN  10.4.0.236  49.08 GB   256          ?       5d5bd234-9aec-4146-b3cb-b8e2e1729fa4  1a
UN  10.4.0.248  44.61 GB   256          ?       2747eaee-4803-490a-ad78-03467dd1f7cc  1a
UN  10.4.0.171  44.33 GB   256          ?       1434be9a-d258-4dd2-9579-2cec850786c1  1a

The keyspace was set to replicate in us-east, which is probably the region of the originally backed up cluster:

cassandra@cqlsh> SELECT * FROM system_schema.keyspaces;

 keyspace_name                 | durable_writes | replication
-------------------------------+----------------+-------------------------------------------------------------------------------------
                   system_auth |           True |   {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'eu-west': '4'}
                 system_schema |           True |                             {'class': 'org.apache.cassandra.locator.LocalStrategy'}
                     keyspace1 |           True |   {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'eu-west': '3'}
            system_distributed |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '3'}
                        system |           True |                             {'class': 'org.apache.cassandra.locator.LocalStrategy'}
                         audit |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'}
                 system_traces |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '2'}
 system_distributed_everywhere |           True |                        {'class': 'org.apache.cassandra.locator.EverywhereStrategy'}
         5gb_sizetiered_2022_1 |           True |   {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'us-east': '3'}

Once I altered the region of the keyspace's region, I was able to query it just fine:

cassandra@cqlsh> ALTER KEYSPACE "5gb_sizetiered_2022_1" WITH replication = {'class': 'NetworkTopologyStrategy', 'eu-west': '1'};
cassandra@cqlsh> use "5gb_sizetiered_2022_1";
cassandra@cqlsh:5gb_sizetiered_2022_1> select * from standard1;

 key                    | C0                                                                                                                                 | C1                                                                                                                                 | C10                                                                                                                                | C11                                                                                                                                | C12                                                                                                                                | C13                                                                                                                                | C14                                                                                                                                | C15                                                                                                                                | C2                                                                                                                                 | C3                                                                                                                                 | C4                                                                                                                                 | C5                                                                                                                                 | C6                                                                                                                                 | C7                                                                                                                                 | C8                                                                                                                                 | C9
------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------
 0x343831364b4b33324e30 | 0x88cfad6776a64370624fca6a579b22909dd7d10500537a183446f9f429dea1d13fcb3716aba68f5023a974a1f6fef5fd3e3eb2856c8a08a38fc019de9fe45b90 | 0x68a108ecfcca46efb65d38269f75d23a1f43456a1cbe033baefc0cd2f3edbfa4289874dc57ae085e9cd830ae0644351a3c32c6d49140f81e714d715f2324cb75 | 0xa0a206e6889d81d48edda04842b35a248f3a608bd4588619ca39176f64ca53238913a404fba7ec3e67c071b35e13e2a39773610a5b541dc7a8f32cadc7c7eedf | 0xb7a13a1ad602380b4577dc5bb64865e54922862cf670bf288d3fb9afd69091477c623e9255e1d81068bd0707e01c0680cc306cb3693be8688c0db1c948ea38c3 | 0x7e1119d0cb8f34cd7141ba1ec7cf71eb64f254b0d46fc0f78b31fa3c1fe336eae57412dbaad94d4c728ca51140438d5e2521587f657d7dcfffdeeeb1218b2357 | 0x5dfd6b7923a1025f0085ecf43516aec54c25ced79dc5217267c060ddc927de711b0eec16116eeb2380f184bb1a7d6f9482bcdd1f4c75d7c4cacd42950746e4fa | 0x1ddb930018f516dc7e3ffafddc7ac358df4f3d2352931ae31982c55cc7e0d7dc9ea6067de7218a9e61f735f69ab3eb1ceb3b27e6300deee70c6c455cd20e6a14 | 0x6812d3acf717ba682498373953e77d64792930f029e5ab2b2c4a477098f9b49f0e6d35615e9e65b7736ee992ab3ff027227c73595e71f355b6b89e1ab1c7fb9d | 0x51ae72d57ce76acc6c69c90713f7d4fb9261efcfc833e73b30e383e70eb56aea4e11c2b51053e7479142041df5bd832fd6417e835a851378433e0de71bbdaee0 | 0xa9fa0041d3270b15f1700778bea29b99a7ea7c2172e338157ca41593f99e3a04a6a649c698bc01b888f6038b8740678554f41de84a3fb66390d300328068d204 | 0xec0375be958914ee5a7797c6921dfa0b309d95cf98fc9dd846dbfcc982d2ad0da27a7d17f7b1ff6c6fcca1c816fc47f5b96af5a50e1c28ae9e31351d250b6aab | 0x405d6d52c7782e3b8271a809e4138ece48bb4c0c203c65368008e778c23d1c2fe2a8105b89cf2141ddbb9090b1f69192af21afeba81c05d70880179a6300b745 | 0x3e4ed6b0621aa8ebfdf0035d417727357ef13ccc7e20bb8489f00dce99ce5b3690ccf2ec7759a4f0d5134fa3ac0471dad663a1a934cfc3cafe621f39dcf9c112 | 0xd046c4ed74a6821b7739342f48419f07b1a0d69175c239ccc0a504ddd0c440f02f233c9a898e2d59a3111479e0166cb4b7745b1322f9fddefd3bb197f8c60a34 | 0x553fff6a14fc45ba273ae9549962324615e90d9d79933eb7121eb5741c1c773da5503824f4d8b6584a4407cabbd5d6862f7eeb1bb76690fd14442834a45afd7f | 0xbb9b3a6ac0acfe1e67bb87870b45be7f9119439c691d39c8b69412c4ec8b513bcf6b4965af98e61711cb7da504b252cd716ce29a0d8772c3f1b89d349c467f8f
...

So, the difference in regions is probably the cause of the failures.

tzach · 2023-08-16T18:52:55Z

So, restore only works in the same region, and there is a procedure to restore to a different region?
This is acceptable, but it needs to be explicitly documented.

Michal-Leszczynski · 2023-08-17T08:05:09Z

Restoring tables has a requirement of having identical schema as in the backup. The dcs are also a part of the keyspace schema. So the fact that that restore does not work when you try to restore data into empty dc seems logical.

The strange part here is that load&stream does not complain when it has to upload sstables to nodes from empty dc (we can add manual checks for that in SM). I would suspect, that in this scenario uploaded sstables should be lost, as they don't belong to any node in the cluster, but maybe L&S still stores it somewhere, even though it's impossible to query the data because of the "unavailable replicas" error.

In your example you said that you used nodetool refresh for uploading sstables, but did you use it with the -las option?
I'm curious if a work around in this case should look like:

restore schema
(change replication of keyspace with non-existing dc - but do we have a guarantee that the restore tables will work when using different keyspace schema?)
restore tables
(or maybe here is the right place for changing keyspace replication - perhaps uploaded data is still stored somewhere in the cluster and now it is safe to alter keyspace)

Michal-Leszczynski · 2023-08-17T08:17:22Z

But at least we know that this issue is not a regression and that IMHO restore works as described in the docs.

ShlomiBalalis · 2023-08-20T13:28:49Z

Restoring tables has a requirement of having identical schema as in the backup. The dcs are also a part of the keyspace schema. So the fact that that restore does not work when you try to restore data into empty dc seems logical.

The strange part here is that load&stream does not complain when it has to upload sstables to nodes from empty dc (we can add manual checks for that in SM). I would suspect, that in this scenario uploaded sstables should be lost, as they don't belong to any node in the cluster, but maybe L&S still stores it somewhere, even though it's impossible to query the data because of the "unavailable replicas" error.

In your example you said that you used nodetool refresh for uploading sstables, but did you use it with the -las option?

Nope. a simple nodetool refresh -- 5gb_sizetiered_2022_1 standard1

I'm curious if a work around in this case should look like:

* restore schema

* (change replication of keyspace with non-existing dc - but do we have a guarantee that the restore tables will work when using different keyspace schema?)

* restore tables

* (or maybe here is the right place for changing keyspace replication - perhaps uploaded data is still stored somewhere in the cluster and now it is safe to alter keyspace)

In my case, I first loaded the data with refresh and only then altered the keyspace, and everything seemed fine afterwards (of course, it was only a preliminary check that the table contains data at all)

Michal-Leszczynski · 2023-08-21T08:49:11Z

Nope. a simple nodetool refresh -- 5gb_sizetiered_2022_1 standard1

That's strange because nodetool refresh docs says:

Scylla node will ignore the partitions in the sstables which are not assigned to this node. For example, if sstable are copied from a different node.

So I would expect that it worked partially / it's not reliable to use it in this way. So the approach with:

restore schema
alter restored keyspace replication strategy (change dc names)
restore data

seems more promising. @asias, do you think that this approach is safe and should work?

Context:
We have a backup from some cluster with only dc1. We want to restore it to a different cluster with only dc2.
Normally, SM would first restore all schema from the backup (this requires cluster restart) and then it would proceed with restoring non-schema SSTables via load&stream. The problem is that we restore SSTables into keyspace replicated only in dc1 and we don't have any nodes from this dc in restore destination cluster, so even though restore procedure ends "successfully", the data is not there. Is it safe to use load&steam on SSTables when backed-up and restore destination clusters have identical table schema, but have different keyspace schema (keyspace name is the same, but there are different dc names in replication strategies)?

ShlomiBalalis · 2023-08-22T12:44:40Z

Nope. a simple nodetool refresh -- 5gb_sizetiered_2022_1 standard1

That's strange because nodetool refresh docs says:
Scylla node will ignore the partitions in the sstables which are not assigned to this node. For example, if sstable are copied from a different node.
So I would expect that it worked partially / it's not reliable to use it in this way. So the approach with:

restore schema

alter restored keyspace replication strategy (change dc names)

restore data

Yeah, regardless of the fact that it worked (and I agree, it's strange it worked at all) this is probably the correct course of action as far as I can tell

Michal-Leszczynski · 2023-08-25T12:51:33Z

My local experiments confirms that the approach:

restore schema
alter restored keyspace replication strategy (change dc names)
restore data

works fine, but they are just experiments and not proofs of reliability.
@ShlomiBalalis could we rerun this test scenario with the additional alter keyspace step in the middle of both restores?

Michal-Leszczynski · 2023-09-07T14:54:09Z

@ShlomiBalalis ping

fgelcer · 2023-09-24T10:06:02Z

@ShlomiBalalis ?

bhalevy · 2023-12-05T08:14:54Z

My local experiments confirms that the approach:

restore schema

alter restored keyspace replication strategy (change dc names)

restore data

works fine, but they are just experiments and not proofs of reliability. @ShlomiBalalis could we rerun this test scenario with the additional alter keyspace step in the middle of both restores?

@Mark-Gurevich can you please take over this?
If needed, let's open an issue in SCT to add this as workaround until this issue is fixed.

@Michal-Leszczynski mind taking ownership of this issue?

Mark-Gurevich · 2023-12-07T09:14:54Z

IIUC we need to add to the disrupt_mgmt_restore nemesis code additional alter keyspace in middle of both restores?
From a brief view of the code I didn't find where this can be added. Needs further deep dive.

Michal-Leszczynski · 2024-03-28T11:24:52Z

@mikliapko is this something that you could take care of? I mean validating that procedure described in #3525 (comment) works fine with some proper test. When it's validated, we can add it to SM docs.

roydahan · 2024-09-24T13:36:01Z

Still happens: https://argus.scylladb.com/test/f1ff65fd-8324-4264-8d28-8c7122fca836/runs?additionalRuns[]=5986619f-8479-4267-a92f-19c6b604f84b

mikliapko · 2024-09-24T14:20:32Z

@mikliapko is this something that you could take care of? I mean validating that procedure described in #3525 (comment) works fine with some proper test. When it's validated, we can add it to SM docs.

Yep, as it's still happening, I will take a look into it

fruch · 2024-09-29T14:58:38Z

@mikliapko is this something that you could take care of? I mean validating that procedure described in #3525 (comment) works fine with some proper test. When it's validated, we can add it to SM docs.

Yep, as it's still happening, I will take a look into it

@mikliapko it's happening in a test that disable raft topology, is the schema restore depended on raft topology ?

Packages

Scylla version: 6.3.0~dev-20240927.c17d35371846 with build-id a9b08d0ce1f3cf99eb39d7a8372848fa2840dc1d
Kernel Version: 6.8.0-1016-aws

Installation details

Cluster size: 5 nodes (i4i.8xlarge)

Scylla Nodes used in this run:

longevity-mv-si-4d-master-db-node-34c4d009-9 (18.201.159.126 | 10.4.8.218) (shards: 30)
longevity-mv-si-4d-master-db-node-34c4d009-8 (54.170.27.136 | 10.4.11.177) (shards: 30)
longevity-mv-si-4d-master-db-node-34c4d009-7 (18.202.195.66 | 10.4.8.202) (shards: 30)
longevity-mv-si-4d-master-db-node-34c4d009-6 (3.252.132.22 | 10.4.9.92) (shards: 30)
longevity-mv-si-4d-master-db-node-34c4d009-5 (18.201.83.7 | 10.4.9.76) (shards: 30)
longevity-mv-si-4d-master-db-node-34c4d009-4 (34.244.233.144 | 10.4.8.101) (shards: 30)
longevity-mv-si-4d-master-db-node-34c4d009-3 (34.246.198.146 | 10.4.9.17) (shards: 30)
longevity-mv-si-4d-master-db-node-34c4d009-2 (54.216.167.207 | 10.4.11.30) (shards: 30)
longevity-mv-si-4d-master-db-node-34c4d009-11 (54.154.171.167 | 10.4.11.79) (shards: 30)
longevity-mv-si-4d-master-db-node-34c4d009-10 (3.249.103.86 | 10.4.9.9) (shards: 30)
longevity-mv-si-4d-master-db-node-34c4d009-1 (34.245.208.246 | 10.4.11.237) (shards: 30)

OS / Image: ami-087d814d9b6773015 (aws: undefined_region)

Test: longevity-mv-si-4days-streaming-test
Test id: 34c4d009-73b1-490b-83e5-03f6705be5eb
Test name: scylla-master/tier1/longevity-mv-si-4days-streaming-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

longevity-mv-si-4days.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 34c4d009-73b1-490b-83e5-03f6705be5eb
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 34c4d009-73b1-490b-83e5-03f6705be5eb

Logs:

longevity-mv-si-4d-master-db-node-34c4d009-4 - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-4-34c4d009.tar.gz
longevity-mv-si-4d-master-db-node-34c4d009-1 - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-1-34c4d009.tar.gz
longevity-mv-si-4d-master-db-node-34c4d009-6 - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-6-34c4d009.tar.gz
longevity-mv-si-4d-master-db-node-34c4d009-8 - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-8-34c4d009.tar.gz
longevity-mv-si-4d-master-db-node-34c4d009-3 - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-3-34c4d009.tar.gz
longevity-mv-si-4d-master-db-node-34c4d009-10 - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-10-34c4d009.tar.gz
db-cluster-34c4d009.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240929_050633/db-cluster-34c4d009.tar.gz
sct-runner-events-34c4d009.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240929_050633/sct-runner-events-34c4d009.tar.gz
sct-34c4d009.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240929_050633/sct-34c4d009.log.tar.gz
loader-set-34c4d009.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240929_050633/loader-set-34c4d009.tar.gz
monitor-set-34c4d009.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240929_050633/monitor-set-34c4d009.tar.gz

Jenkins job URL
Argus

Michal-Leszczynski · 2024-09-30T13:02:06Z

Starting from SM 3.3 and Scylla 6.0, SM restores schema by applying the output of DESC SCHEMA WITH INTERNALS.
The problem is the keyspace definition contains dc names - that's why this test fails with the following error:

"M":"Run ended with ERROR","task":"restore/09af96b8-68b1-4bf6-928b-7fd01aa266f4","status":"ERROR","cause":"restore data: create \"100gb_sizetiered_6_0\" (\"100gb_sizetiered_6_0\") with CREATE KEYSPACE \"100gb_sizetiered_6_0\" WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'us-east': '3'} AND durable_writes = true: Unrecognized strategy option {us-east} passed to org.apache.cassandra.locator.NetworkTopologyStrategy for keyspace 100gb_sizetiered_6_0","duration":"5.618998928s"

So right now this is a documented limitation, but we should make it possible to restore schema into a different DC setting or make it easier for the user to modify just the DC part of keyspace schema.

Michal-Leszczynski · 2024-09-30T13:06:59Z

Created an issue for that: #4049.

Michal-Leszczynski · 2024-12-19T15:53:32Z

Closing it as this behavior was expected and will be fixed as a part of #4049.

ShlomiBalalis changed the title ~~After a successful data restoration, the restored keyspace is completely empty~~ After a successful schema and data restoration, the restored keyspace is completely empty Aug 16, 2023

mykaul changed the title ~~After a successful schema and data restoration, the restored keyspace is completely empty~~ After a successful schema and data restoration *to a different region*, the restored keyspace is completely empty Aug 21, 2023

ShlomiBalalis mentioned this issue Aug 27, 2023

[disrupt_mgmt_restore] c-s fails with: "Error while computing token map for keyspace 100gb_sizetiered_5_2 with datacenter" scylladb/scylla-cluster-tests#6543

Closed

2 tasks

enaydanov mentioned this issue Nov 15, 2023

sstable - writer fails/crashes if schema is altered during the write operation scylladb/scylladb#16065

Closed

2 tasks

bhalevy mentioned this issue Dec 5, 2023

Improper describe command error during add_remove_dc nemesis scylladb/scylla-cluster-tests#6878

Closed

3 tasks

Michal-Leszczynski self-assigned this Dec 5, 2023

Mark-Gurevich self-assigned this Dec 20, 2023

ShlomiBalalis mentioned this issue Feb 19, 2024

docs: Issue in page Restore schema #3719

Open

Michal-Leszczynski mentioned this issue Sep 30, 2024

Make it possible to restore schema into a different DC setting #4049

Open

Michal-Leszczynski closed this as completed Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After a successful schema and data restoration to a different region, the restored keyspace is completely empty #3525

After a successful schema and data restoration to a different region, the restored keyspace is completely empty #3525

ShlomiBalalis commented Aug 16, 2023 •

edited

Loading

Logs:

mykaul commented Aug 16, 2023

mykaul commented Aug 16, 2023

ShlomiBalalis commented Aug 16, 2023

ShlomiBalalis commented Aug 16, 2023

Michal-Leszczynski commented Aug 16, 2023 •

edited

Loading

ShlomiBalalis commented Aug 16, 2023 •

edited

Loading

tzach commented Aug 16, 2023

Michal-Leszczynski commented Aug 17, 2023

Michal-Leszczynski commented Aug 17, 2023

ShlomiBalalis commented Aug 20, 2023

Michal-Leszczynski commented Aug 21, 2023

ShlomiBalalis commented Aug 22, 2023

Michal-Leszczynski commented Aug 25, 2023

Michal-Leszczynski commented Sep 7, 2023

fgelcer commented Sep 24, 2023

bhalevy commented Dec 5, 2023

Mark-Gurevich commented Dec 7, 2023

Michal-Leszczynski commented Mar 28, 2024

roydahan commented Sep 24, 2024

mikliapko commented Sep 24, 2024

fruch commented Sep 29, 2024

Logs:

Michal-Leszczynski commented Sep 30, 2024

Michal-Leszczynski commented Sep 30, 2024

Michal-Leszczynski commented Dec 19, 2024

After a successful schema and data restoration *to a different region*, the restored keyspace is completely empty #3525

After a successful schema and data restoration *to a different region*, the restored keyspace is completely empty #3525

Comments

ShlomiBalalis commented Aug 16, 2023 • edited Loading

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

mykaul commented Aug 16, 2023

mykaul commented Aug 16, 2023

ShlomiBalalis commented Aug 16, 2023

ShlomiBalalis commented Aug 16, 2023

Michal-Leszczynski commented Aug 16, 2023 • edited Loading

ShlomiBalalis commented Aug 16, 2023 • edited Loading

tzach commented Aug 16, 2023

Michal-Leszczynski commented Aug 17, 2023

Michal-Leszczynski commented Aug 17, 2023

ShlomiBalalis commented Aug 20, 2023

Michal-Leszczynski commented Aug 21, 2023

ShlomiBalalis commented Aug 22, 2023

Michal-Leszczynski commented Aug 25, 2023

Michal-Leszczynski commented Sep 7, 2023

fgelcer commented Sep 24, 2023

bhalevy commented Dec 5, 2023

Mark-Gurevich commented Dec 7, 2023

Michal-Leszczynski commented Mar 28, 2024

roydahan commented Sep 24, 2024

mikliapko commented Sep 24, 2024

fruch commented Sep 29, 2024

Packages

Installation details

Logs:

Michal-Leszczynski commented Sep 30, 2024

Michal-Leszczynski commented Sep 30, 2024

Michal-Leszczynski commented Dec 19, 2024

After a successful schema and data restoration to a different region, the restored keyspace is completely empty #3525

After a successful schema and data restoration to a different region, the restored keyspace is completely empty #3525

ShlomiBalalis commented Aug 16, 2023 •

edited

Loading

Michal-Leszczynski commented Aug 16, 2023 •

edited

Loading

ShlomiBalalis commented Aug 16, 2023 •

edited

Loading