-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Q3 Disaster Recovery Exercise #211
Comments
We've decided to go with WP1. We'll probably need to coordinate with @audiodude as only backup in on our infra AFAIK. The rest runs on Wikimedia Cloud |
Happy to help in whatever way necessary. Note that when I migrated from toolforge DB to Trove DB, restoring the database took about 20 hours. |
OK, now that deadline for Q3 is only few days in front of us, it is now time to move on this issue 🤣 Do you have any views on how we should run this disaster recovery exercise? I don't have much tbh, plenty of ideas as usual but no solid PoV. Alternatives / questions I have:
The fact that it DB transfer took hours is not a concern from my PoV, but an important information to take into account. If I'm not mistaken, this was even worse when you tried to export / re-import (i.e. not pushing data "on-the-fly"). Maybe we should first try to restore a backup to a random empty database just to have an idea of how long it is going to take. On this point, the first concern I have is that I do not achieve to list borgbase backups with usual procedure:
Looking (quickly tbh) at borg doc, it looks like this happens when a backup is currently ongoing. Does it make sense? Do we have any monitoring of this backup job and/or ideas of when it is not running? |
That might be possible, we can certainly ask.
Same thing, we can ask. As far as I understand, WP1 only needs a Docker host, a config file with DB credentials, and the backup config file. However, I don't think we currently backup those configs, maybe that's worth pursuing?
I think it would make more sense to do it on WMF servers if possible, since that's where we would be restoring in a live scenario.
I also used a specific config on Trove to speed things up, which is saved, but which I don't think I could reproduce from scratch. The config was switched once the db was live.
I get the same result when running that command, though I doubt the backup is running right now. |
@rgaudin shall we proceed with the request for a "temporary" cloud instance and Trove DB similar to existing ones to Wikimedia Cloud for the test or do you have other views? |
There's a magic formula to spell first
That's the thing. We're using the long lasting borg-backup in the docker which handles the sleeping and all. I stopped the container to test so that might have been the reason.
Nope, we want this to be as close as a real recovery so that sounds right |
@audiodude can you request the trove DB and cloud instance to wikimedia (I assume it would be easier for you because you know how to do it, I might be wrong) and link the phabricator ticket here? |
I was able to start a |
Requested more DB quota here: https://phabricator.wikimedia.org/T375977 |
I've also started a doc with steps: https://docs.google.com/document/d/1MCWgykLk9H9WxCcGNO7FlpW717uBAosa0-cvDXt1--4/edit?usp=sharing |
In order to strengthen our infrastructure, we want to:
This ticket requests the assignees to:
See Zimfarm documentation.
Note: this is an automatic reminder intended for the assignee(s).
The text was updated successfully, but these errors were encountered: