Q3 Disaster Recovery Exercise #211

kiwixbot · 2024-07-01T00:30:40Z

In order to strengthen our infrastructure, we want to:

Document our Disaster Recovery Plan
Simulate a Disaster and recover the service to ensure documentation exists and is up to date

This ticket requests the assignees to:

Pick a Service from Kiwix/openZIM/Offspot
Redeploy it completly using its backup, using the documentation or writing it.

Note: this is an automatic reminder intended for the assignee(s).

rgaudin · 2024-08-06T11:27:20Z

We've decided to go with WP1. We'll probably need to coordinate with @audiodude as only backup in on our infra AFAIK. The rest runs on Wikimedia Cloud

audiodude · 2024-08-06T13:34:30Z

Happy to help in whatever way necessary.

Note that when I migrated from toolforge DB to Trove DB, restoring the database took about 20 hours.

benoit74 · 2024-09-26T11:22:06Z

OK, now that deadline for Q3 is only few days in front of us, it is now time to move on this issue 🤣

Do you have any views on how we should run this disaster recovery exercise? I don't have much tbh, plenty of ideas as usual but no solid PoV. Alternatives / questions I have:

should / can we ask for a temporary Trove DB for the test?
should / can we ask for a new cloud machine on Wikimedia Cloud to also confirm we know how to reinstall the software / middleware properly and check we do not rely on local files / old or specific middleware?
should we do it in a totally isolated new cloud machine on any cloud provider, including the DB?

The fact that it DB transfer took hours is not a concern from my PoV, but an important information to take into account. If I'm not mistaken, this was even worse when you tried to export / re-import (i.e. not pushing data "on-the-fly"). Maybe we should first try to restore a backup to a random empty database just to have an idea of how long it is going to take.

On this point, the first concern I have is that I do not achieve to list borgbase backups with usual procedure:

~ docker run -v $PWD/data/restore:/restore:rw -e BW_CLIENTID=$BW_CLIENTID -e BW_CLIENTSECRET=$BW_CLIENTSECRET -e BW_PASSWORD=$BW_PASSWORD ghcr.io/kiwix/borg-backup restore --name wp1db --list
Could not find dir, "/root/.config/Bitwarden CLI"; creating it instead.
Could not find data file, "/root/.config/Bitwarden CLI/data.json"; creating it instead.
You are not logged in.You are logged in!

To unlock your vault, use the `unlock` command. ex:
$ bw unlockLogged-in ; unlocking vault
SSH key retrieval success
You have logged out.# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
Repo exists with name wp1db
Use repo path : j18bm1f3@box-eu29.borgbase.com:repo


List avaible archives ...
Remote: Warning: Permanently added the ECDSA host key for IP address '94.130.217.50' to the list of known hosts.
Failed to create/acquire the lock /srv/repos/j18bm1f3/repo/lock.exclusive (timeout).

Looking (quickly tbh) at borg doc, it looks like this happens when a backup is currently ongoing. Does it make sense? Do we have any monitoring of this backup job and/or ideas of when it is not running?

audiodude · 2024-09-26T15:47:39Z

OK, now that deadline for Q3 is only few days in front of us, it is now time to move on this issue 🤣

Do you have any views on how we should run this disaster recovery exercise? I don't have much tbh, plenty of ideas as usual but no solid PoV. Alternatives / questions I have:

should / can we ask for a temporary Trove DB for the test?

That might be possible, we can certainly ask.

should / can we ask for a new cloud machine on Wikimedia Cloud to also confirm we know how to reinstall the software / middleware properly and check we do not rely on local files / old or specific middleware?

Same thing, we can ask. As far as I understand, WP1 only needs a Docker host, a config file with DB credentials, and the backup config file. However, I don't think we currently backup those configs, maybe that's worth pursuing?

should we do it in a totally isolated new cloud machine on any cloud provider, including the DB?

I think it would make more sense to do it on WMF servers if possible, since that's where we would be restoring in a live scenario.

The fact that it DB transfer took hours is not a concern from my PoV, but an important information to take into account. If I'm not mistaken, this was even worse when you tried to export / re-import (i.e. not pushing data "on-the-fly"). Maybe we should first try to restore a backup to a random empty database just to have an idea of how long it is going to take.

I also used a specific config on Trove to speed things up, which is saved, but which I don't think I could reproduce from scratch. The config was switched once the db was live.

On this point, the first concern I have is that I do not achieve to list borgbase backups with usual procedure:

~ docker run -v $PWD/data/restore:/restore:rw -e BW_CLIENTID=$BW_CLIENTID -e BW_CLIENTSECRET=$BW_CLIENTSECRET -e BW_PASSWORD=$BW_PASSWORD ghcr.io/kiwix/borg-backup restore --name wp1db --list
Could not find dir, "/root/.config/Bitwarden CLI"; creating it instead.
Could not find data file, "/root/.config/Bitwarden CLI/data.json"; creating it instead.
You are not logged in.You are logged in!

To unlock your vault, use the `unlock` command. ex:
$ bw unlockLogged-in ; unlocking vault
SSH key retrieval success
You have logged out.# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
Repo exists with name wp1db
Use repo path : j18bm1f3@box-eu29.borgbase.com:repo


List avaible archives ...
Remote: Warning: Permanently added the ECDSA host key for IP address '94.130.217.50' to the list of known hosts.
Failed to create/acquire the lock /srv/repos/j18bm1f3/repo/lock.exclusive (timeout).

Looking (quickly tbh) at borg doc, it looks like this happens when a backup is currently ongoing. Does it make sense? Do we have any monitoring of this backup job and/or ideas of when it is not running?

I get the same result when running that command, though I doubt the backup is running right now.

benoit74 · 2024-09-27T09:17:19Z

@rgaudin shall we proceed with the request for a "temporary" cloud instance and Trove DB similar to existing ones to Wikimedia Cloud for the test or do you have other views?

rgaudin · 2024-09-27T10:45:19Z

There's a magic formula to spell first

❯ docker run -v $PWD/data/restore:/restore:rw -e BW_CLIENTID=$BW_CLIENTID -e BW_CLIENTSECRET=$BW_CLIENTSECRET -e BW_PASSWORD=$BW_PASSWORD ghcr.io/kiwix/borg-backup restore --name wp1db --list
Could not find dir, "/root/.config/Bitwarden CLI"; creating it instead.
Could not find data file, "/root/.config/Bitwarden CLI/data.json"; creating it instead.
You are not logged in.You are logged in!

To unlock your vault, use the `unlock` command. ex:
$ bw unlockLogged-in ; unlocking vault
SSH key retrieval success
You have logged out.# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
Repo exists with name wp1db
Use repo path : j18bm1f3@box-eu29.borgbase.com:repo


List avaible archives ...
Remote: Warning: Permanently added the ECDSA host key for IP address '94.130.217.50' to the list of known hosts.
Warning: Attempting to access a previously unknown unencrypted repository!
Do you want to continue? [yN] yes (from BORG_UNKNOWN_UNENCRYPTED_REPO_ACCESS_IS_OK)
wp1db__backup__2022-12-31T04:00:51   Sat, 2022-12-31 04:00:53 [6bf09f64fa4fd04215bb07f47e2bea7a217e83ce664b09bf6f5af4c35bdf5db8]
wp1db__backup__2023-09-30T04:00:05   Sat, 2023-09-30 04:00:07 [96629ca95856f311b80619ffeecc17a2bbdf2ea730221e07092763741d402e1d]
wp1db__backup__2023-10-31T04:12:10   Tue, 2023-10-31 04:12:13 [0ef5c9bfee8d5694136d32895a21965a1d79471a2ab2b0552ff9617a8b692579]
wp1db__backup__2023-11-30T04:04:01   Thu, 2023-11-30 04:04:03 [b7f6962236a5755a27b838f6848d24251ddd38238e3c513a02881defe8aa581c]
wp1db__backup__2023-12-31T04:04:44   Sun, 2023-12-31 04:04:46 [8d15e174014926902fd876fc819d42bb586e7eba55884fc2fd2c8579998e14dd]
wp1db__backup__2024-01-31T04:03:19   Wed, 2024-01-31 04:03:22 [8d62d768980cb035c7adf2ce33dc2764a7936c7997b071f314844d19f5bda95d]
wp1db__backup__2024-02-29T04:02:32   Thu, 2024-02-29 04:02:34 [84a9b277004bd01cf4aa6a53cd042bd75e2e7039c56b34b0c2e0fc7c3fdb725b]
wp1db__backup__2024-03-31T04:01:32   Sun, 2024-03-31 04:01:34 [dcbbde096d5197347e2e46364cd7780532c49141fbf1ce1bf9a32e80574bb5b5]
wp1db__backup__2024-04-30T04:01:11   Tue, 2024-04-30 04:01:14 [37f9ff42a244eb80aebee3ada9e28c98baae24ee5d0696f4aec3160a073db393]
wp1db__backup__2024-05-31T04:00:16   Fri, 2024-05-31 04:00:18 [7452d4dcb9fc0c13e15603f8ff3d4473010e89853b2b7bc65ebb63acd8c887c8]
wp1db__backup__2024-06-30T04:00:14   Sun, 2024-06-30 04:00:16 [8cbfb528818eab790864c3146309ff31610bf5d271b93224cdeafc455dd018fe]
wp1db__backup__2024-07-31T04:00:15   Wed, 2024-07-31 04:00:18 [674522e6068a25058b45f93276f94739b8e45e3de811b6f4d1b62297b588b713]
wp1db__backup__2024-08-18T04:00:15   Sun, 2024-08-18 04:00:17 [545f1b98993760a575e5c945485a629d2ec210f4dd2b8802d36386e80be68459]
wp1db__backup__2024-08-25T04:03:45   Sun, 2024-08-25 04:03:50 [08a7a3251ee48b0213dcda918879c91181e4ccff1e5badf3a11b8f57c9cd6e68]
wp1db__backup__2024-08-31T04:02:36   Sat, 2024-08-31 04:02:38 [50ebfaea3a3b200f100eaf9c5e907d9901208b85c031b1a66cc986ac5b05c809]
wp1db__backup__2024-09-01T04:02:33   Sun, 2024-09-01 04:02:35 [5bc83b55763da5a59752667f117fd722a083af17d025aa2b1aef5277c538f991]
wp1db__backup__2024-09-08T04:03:08   Sun, 2024-09-08 04:03:11 [ca7f72810ebb27ca7ae9cb4cd42de4b56c7565506d4d9327de5af9f019ad15fa]
wp1db__backup__2024-09-15T04:02:31   Sun, 2024-09-15 04:02:33 [cdafb3033230bd7faf889bbc62cc7490d7954c8ee4f3160f7503e5753ff11d54]
wp1db__backup__2024-09-19T04:03:47   Thu, 2024-09-19 04:03:50 [5e75b291d30e3e72f40156a709bee9466c295cdf1189462da03307d1303f73c2]
wp1db__backup__2024-09-21T04:02:48   Sat, 2024-09-21 04:02:50 [e975e5474e74bca87a4a84a3cd41086cfc069bfd7c0f4495180ee3128c98a1cd]
wp1db__backup__2024-09-22T04:02:49   Sun, 2024-09-22 04:02:51 [a42c49081042b9fead7a804c391a8caefd4273cb26154164b3742e270294982a]
wp1db__backup__2024-09-23T04:04:21   Mon, 2024-09-23 04:04:23 [8b58553fecd32ce127fc48688a90de61b093bd0835ddd5b1772cd6caffbcebb2]
wp1db__backup__2024-09-24T04:00:15   Tue, 2024-09-24 04:00:18 [082a91301b5ea3726fbeb7422db96a7263b58a4e4a223c66d7462ec3d374ce30]
wp1db__backup__2024-09-25T04:05:21   Wed, 2024-09-25 04:05:23 [558356f0e60dc4cc6b15c1e797d20657c7536b216b72205b19cb8645042d36e4]
wp1db__backup__2024-09-26T04:02:42   Thu, 2024-09-26 04:02:44 [cbe210407a4f42f4be96e4d20f9d0c82af30b67938beafba2638ab5f1c1ee024]
wp1db__backup__2024-09-27T04:02:29   Fri, 2024-09-27 04:02:32 [dfc000dfd6800d57740252e336b5d2da47eaaab8c5eac1722c3229bb5a61f3b5]

I get the same result when running that command, though I doubt the backup is running right now.

That's the thing. We're using the long lasting borg-backup in the docker which handles the sleeping and all. I stopped the container to test so that might have been the reason.

@rgaudin shall we proceed with the request for a "temporary" cloud instance and Trove DB similar to existing ones to Wikimedia Cloud for the test or do you have other views?

Nope, we want this to be as close as a real recovery so that sounds right

benoit74 · 2024-09-27T17:47:21Z

@audiodude can you request the trove DB and cloud instance to wikimedia (I assume it would be easier for you because you know how to do it, I might be wrong) and link the phabricator ticket here?

audiodude · 2024-09-29T13:49:36Z

I was able to start a mwcurator2 instance without making any requests. However it has less vCPUs and RAM than the current instance, but to be fair, the original mwcurator does all of the selection for download.kiwix.org, so it's quite possible we don't need all of the resources to run just the web server.

audiodude · 2024-09-29T14:14:45Z

Requested more DB quota here: https://phabricator.wikimedia.org/T375977

audiodude · 2024-09-29T14:17:55Z

I've also started a doc with steps: https://docs.google.com/document/d/1MCWgykLk9H9WxCcGNO7FlpW717uBAosa0-cvDXt1--4/edit?usp=sharing

kiwixbot added the recovery label Jul 1, 2024

kiwixbot assigned rgaudin, kelson42 and benoit74 Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q3 Disaster Recovery Exercise #211

Q3 Disaster Recovery Exercise #211

kiwixbot commented Jul 1, 2024 •

edited by rgaudin

Loading

rgaudin commented Aug 6, 2024

audiodude commented Aug 6, 2024

benoit74 commented Sep 26, 2024

audiodude commented Sep 26, 2024

benoit74 commented Sep 27, 2024

rgaudin commented Sep 27, 2024

benoit74 commented Sep 27, 2024

audiodude commented Sep 29, 2024

audiodude commented Sep 29, 2024

audiodude commented Sep 29, 2024

Q3 Disaster Recovery Exercise #211

Q3 Disaster Recovery Exercise #211

Comments

kiwixbot commented Jul 1, 2024 • edited by rgaudin Loading

rgaudin commented Aug 6, 2024

audiodude commented Aug 6, 2024

benoit74 commented Sep 26, 2024

audiodude commented Sep 26, 2024

benoit74 commented Sep 27, 2024

rgaudin commented Sep 27, 2024

benoit74 commented Sep 27, 2024

audiodude commented Sep 29, 2024

audiodude commented Sep 29, 2024

audiodude commented Sep 29, 2024

kiwixbot commented Jul 1, 2024 •

edited by rgaudin

Loading