Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q3 Disaster Recovery Exercise #211

Open
1 of 2 tasks
kiwixbot opened this issue Jul 1, 2024 · 10 comments
Open
1 of 2 tasks

Q3 Disaster Recovery Exercise #211

kiwixbot opened this issue Jul 1, 2024 · 10 comments
Assignees
Labels

Comments

@kiwixbot
Copy link

kiwixbot commented Jul 1, 2024

In order to strengthen our infrastructure, we want to:

  • Document our Disaster Recovery Plan
  • Simulate a Disaster and recover the service to ensure documentation exists and is up to date

This ticket requests the assignees to:

  • Pick a Service from Kiwix/openZIM/Offspot
  • Redeploy it completly using its backup, using the documentation or writing it.

See Zimfarm documentation.

Note: this is an automatic reminder intended for the assignee(s).

@rgaudin
Copy link
Member

rgaudin commented Aug 6, 2024

We've decided to go with WP1. We'll probably need to coordinate with @audiodude as only backup in on our infra AFAIK. The rest runs on Wikimedia Cloud

@audiodude
Copy link

Happy to help in whatever way necessary.

Note that when I migrated from toolforge DB to Trove DB, restoring the database took about 20 hours.

@benoit74
Copy link
Collaborator

OK, now that deadline for Q3 is only few days in front of us, it is now time to move on this issue 🤣

Do you have any views on how we should run this disaster recovery exercise? I don't have much tbh, plenty of ideas as usual but no solid PoV. Alternatives / questions I have:

  • should / can we ask for a temporary Trove DB for the test?
  • should / can we ask for a new cloud machine on Wikimedia Cloud to also confirm we know how to reinstall the software / middleware properly and check we do not rely on local files / old or specific middleware?
  • should we do it in a totally isolated new cloud machine on any cloud provider, including the DB?

The fact that it DB transfer took hours is not a concern from my PoV, but an important information to take into account. If I'm not mistaken, this was even worse when you tried to export / re-import (i.e. not pushing data "on-the-fly"). Maybe we should first try to restore a backup to a random empty database just to have an idea of how long it is going to take.

On this point, the first concern I have is that I do not achieve to list borgbase backups with usual procedure:

~ docker run -v $PWD/data/restore:/restore:rw -e BW_CLIENTID=$BW_CLIENTID -e BW_CLIENTSECRET=$BW_CLIENTSECRET -e BW_PASSWORD=$BW_PASSWORD ghcr.io/kiwix/borg-backup restore --name wp1db --list
Could not find dir, "/root/.config/Bitwarden CLI"; creating it instead.
Could not find data file, "/root/.config/Bitwarden CLI/data.json"; creating it instead.
You are not logged in.You are logged in!

To unlock your vault, use the `unlock` command. ex:
$ bw unlockLogged-in ; unlocking vault
SSH key retrieval success
You have logged out.# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
Repo exists with name wp1db
Use repo path : j18bm1f3@box-eu29.borgbase.com:repo


List avaible archives ...
Remote: Warning: Permanently added the ECDSA host key for IP address '94.130.217.50' to the list of known hosts.
Failed to create/acquire the lock /srv/repos/j18bm1f3/repo/lock.exclusive (timeout).

Looking (quickly tbh) at borg doc, it looks like this happens when a backup is currently ongoing. Does it make sense? Do we have any monitoring of this backup job and/or ideas of when it is not running?

@audiodude
Copy link

OK, now that deadline for Q3 is only few days in front of us, it is now time to move on this issue 🤣

Do you have any views on how we should run this disaster recovery exercise? I don't have much tbh, plenty of ideas as usual but no solid PoV. Alternatives / questions I have:

  • should / can we ask for a temporary Trove DB for the test?

That might be possible, we can certainly ask.

  • should / can we ask for a new cloud machine on Wikimedia Cloud to also confirm we know how to reinstall the software / middleware properly and check we do not rely on local files / old or specific middleware?

Same thing, we can ask. As far as I understand, WP1 only needs a Docker host, a config file with DB credentials, and the backup config file. However, I don't think we currently backup those configs, maybe that's worth pursuing?

  • should we do it in a totally isolated new cloud machine on any cloud provider, including the DB?

I think it would make more sense to do it on WMF servers if possible, since that's where we would be restoring in a live scenario.

The fact that it DB transfer took hours is not a concern from my PoV, but an important information to take into account. If I'm not mistaken, this was even worse when you tried to export / re-import (i.e. not pushing data "on-the-fly"). Maybe we should first try to restore a backup to a random empty database just to have an idea of how long it is going to take.

I also used a specific config on Trove to speed things up, which is saved, but which I don't think I could reproduce from scratch. The config was switched once the db was live.

On this point, the first concern I have is that I do not achieve to list borgbase backups with usual procedure:

~ docker run -v $PWD/data/restore:/restore:rw -e BW_CLIENTID=$BW_CLIENTID -e BW_CLIENTSECRET=$BW_CLIENTSECRET -e BW_PASSWORD=$BW_PASSWORD ghcr.io/kiwix/borg-backup restore --name wp1db --list
Could not find dir, "/root/.config/Bitwarden CLI"; creating it instead.
Could not find data file, "/root/.config/Bitwarden CLI/data.json"; creating it instead.
You are not logged in.You are logged in!

To unlock your vault, use the `unlock` command. ex:
$ bw unlockLogged-in ; unlocking vault
SSH key retrieval success
You have logged out.# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
Repo exists with name wp1db
Use repo path : j18bm1f3@box-eu29.borgbase.com:repo


List avaible archives ...
Remote: Warning: Permanently added the ECDSA host key for IP address '94.130.217.50' to the list of known hosts.
Failed to create/acquire the lock /srv/repos/j18bm1f3/repo/lock.exclusive (timeout).

Looking (quickly tbh) at borg doc, it looks like this happens when a backup is currently ongoing. Does it make sense? Do we have any monitoring of this backup job and/or ideas of when it is not running?

I get the same result when running that command, though I doubt the backup is running right now.

@benoit74
Copy link
Collaborator

@rgaudin shall we proceed with the request for a "temporary" cloud instance and Trove DB similar to existing ones to Wikimedia Cloud for the test or do you have other views?

@rgaudin
Copy link
Member

rgaudin commented Sep 27, 2024

There's a magic formula to spell first

❯ docker run -v $PWD/data/restore:/restore:rw -e BW_CLIENTID=$BW_CLIENTID -e BW_CLIENTSECRET=$BW_CLIENTSECRET -e BW_PASSWORD=$BW_PASSWORD ghcr.io/kiwix/borg-backup restore --name wp1db --list
Could not find dir, "/root/.config/Bitwarden CLI"; creating it instead.
Could not find data file, "/root/.config/Bitwarden CLI/data.json"; creating it instead.
You are not logged in.You are logged in!

To unlock your vault, use the `unlock` command. ex:
$ bw unlockLogged-in ; unlocking vault
SSH key retrieval success
You have logged out.# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
# box-eu29.borgbase.com:22 SSH-2.0-OpenSSH_8.7
Repo exists with name wp1db
Use repo path : j18bm1f3@box-eu29.borgbase.com:repo


List avaible archives ...
Remote: Warning: Permanently added the ECDSA host key for IP address '94.130.217.50' to the list of known hosts.
Warning: Attempting to access a previously unknown unencrypted repository!
Do you want to continue? [yN] yes (from BORG_UNKNOWN_UNENCRYPTED_REPO_ACCESS_IS_OK)
wp1db__backup__2022-12-31T04:00:51   Sat, 2022-12-31 04:00:53 [6bf09f64fa4fd04215bb07f47e2bea7a217e83ce664b09bf6f5af4c35bdf5db8]
wp1db__backup__2023-09-30T04:00:05   Sat, 2023-09-30 04:00:07 [96629ca95856f311b80619ffeecc17a2bbdf2ea730221e07092763741d402e1d]
wp1db__backup__2023-10-31T04:12:10   Tue, 2023-10-31 04:12:13 [0ef5c9bfee8d5694136d32895a21965a1d79471a2ab2b0552ff9617a8b692579]
wp1db__backup__2023-11-30T04:04:01   Thu, 2023-11-30 04:04:03 [b7f6962236a5755a27b838f6848d24251ddd38238e3c513a02881defe8aa581c]
wp1db__backup__2023-12-31T04:04:44   Sun, 2023-12-31 04:04:46 [8d15e174014926902fd876fc819d42bb586e7eba55884fc2fd2c8579998e14dd]
wp1db__backup__2024-01-31T04:03:19   Wed, 2024-01-31 04:03:22 [8d62d768980cb035c7adf2ce33dc2764a7936c7997b071f314844d19f5bda95d]
wp1db__backup__2024-02-29T04:02:32   Thu, 2024-02-29 04:02:34 [84a9b277004bd01cf4aa6a53cd042bd75e2e7039c56b34b0c2e0fc7c3fdb725b]
wp1db__backup__2024-03-31T04:01:32   Sun, 2024-03-31 04:01:34 [dcbbde096d5197347e2e46364cd7780532c49141fbf1ce1bf9a32e80574bb5b5]
wp1db__backup__2024-04-30T04:01:11   Tue, 2024-04-30 04:01:14 [37f9ff42a244eb80aebee3ada9e28c98baae24ee5d0696f4aec3160a073db393]
wp1db__backup__2024-05-31T04:00:16   Fri, 2024-05-31 04:00:18 [7452d4dcb9fc0c13e15603f8ff3d4473010e89853b2b7bc65ebb63acd8c887c8]
wp1db__backup__2024-06-30T04:00:14   Sun, 2024-06-30 04:00:16 [8cbfb528818eab790864c3146309ff31610bf5d271b93224cdeafc455dd018fe]
wp1db__backup__2024-07-31T04:00:15   Wed, 2024-07-31 04:00:18 [674522e6068a25058b45f93276f94739b8e45e3de811b6f4d1b62297b588b713]
wp1db__backup__2024-08-18T04:00:15   Sun, 2024-08-18 04:00:17 [545f1b98993760a575e5c945485a629d2ec210f4dd2b8802d36386e80be68459]
wp1db__backup__2024-08-25T04:03:45   Sun, 2024-08-25 04:03:50 [08a7a3251ee48b0213dcda918879c91181e4ccff1e5badf3a11b8f57c9cd6e68]
wp1db__backup__2024-08-31T04:02:36   Sat, 2024-08-31 04:02:38 [50ebfaea3a3b200f100eaf9c5e907d9901208b85c031b1a66cc986ac5b05c809]
wp1db__backup__2024-09-01T04:02:33   Sun, 2024-09-01 04:02:35 [5bc83b55763da5a59752667f117fd722a083af17d025aa2b1aef5277c538f991]
wp1db__backup__2024-09-08T04:03:08   Sun, 2024-09-08 04:03:11 [ca7f72810ebb27ca7ae9cb4cd42de4b56c7565506d4d9327de5af9f019ad15fa]
wp1db__backup__2024-09-15T04:02:31   Sun, 2024-09-15 04:02:33 [cdafb3033230bd7faf889bbc62cc7490d7954c8ee4f3160f7503e5753ff11d54]
wp1db__backup__2024-09-19T04:03:47   Thu, 2024-09-19 04:03:50 [5e75b291d30e3e72f40156a709bee9466c295cdf1189462da03307d1303f73c2]
wp1db__backup__2024-09-21T04:02:48   Sat, 2024-09-21 04:02:50 [e975e5474e74bca87a4a84a3cd41086cfc069bfd7c0f4495180ee3128c98a1cd]
wp1db__backup__2024-09-22T04:02:49   Sun, 2024-09-22 04:02:51 [a42c49081042b9fead7a804c391a8caefd4273cb26154164b3742e270294982a]
wp1db__backup__2024-09-23T04:04:21   Mon, 2024-09-23 04:04:23 [8b58553fecd32ce127fc48688a90de61b093bd0835ddd5b1772cd6caffbcebb2]
wp1db__backup__2024-09-24T04:00:15   Tue, 2024-09-24 04:00:18 [082a91301b5ea3726fbeb7422db96a7263b58a4e4a223c66d7462ec3d374ce30]
wp1db__backup__2024-09-25T04:05:21   Wed, 2024-09-25 04:05:23 [558356f0e60dc4cc6b15c1e797d20657c7536b216b72205b19cb8645042d36e4]
wp1db__backup__2024-09-26T04:02:42   Thu, 2024-09-26 04:02:44 [cbe210407a4f42f4be96e4d20f9d0c82af30b67938beafba2638ab5f1c1ee024]
wp1db__backup__2024-09-27T04:02:29   Fri, 2024-09-27 04:02:32 [dfc000dfd6800d57740252e336b5d2da47eaaab8c5eac1722c3229bb5a61f3b5]

I get the same result when running that command, though I doubt the backup is running right now.

That's the thing. We're using the long lasting borg-backup in the docker which handles the sleeping and all. I stopped the container to test so that might have been the reason.

@rgaudin shall we proceed with the request for a "temporary" cloud instance and Trove DB similar to existing ones to Wikimedia Cloud for the test or do you have other views?

Nope, we want this to be as close as a real recovery so that sounds right

@benoit74
Copy link
Collaborator

@audiodude can you request the trove DB and cloud instance to wikimedia (I assume it would be easier for you because you know how to do it, I might be wrong) and link the phabricator ticket here?

@audiodude
Copy link

I was able to start a mwcurator2 instance without making any requests. However it has less vCPUs and RAM than the current instance, but to be fair, the original mwcurator does all of the selection for download.kiwix.org, so it's quite possible we don't need all of the resources to run just the web server.

@audiodude
Copy link

Requested more DB quota here: https://phabricator.wikimedia.org/T375977

@audiodude
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants