Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy karm and purchase karm+1 #94

Closed
gravitystorm opened this issue Jul 20, 2016 · 24 comments
Closed

Deploy karm and purchase karm+1 #94

gravitystorm opened this issue Jul 20, 2016 · 24 comments

Comments

@gravitystorm
Copy link
Collaborator

In light of the problems with Ramoth being unable to cope with load (see #91 and #92 ), and our long-running concerns with Ramoth's replication, I think it is urgent to move karm into production. I don't want to hijack the discussion on #78 so let's discuss it here.

My plan would be as follows:

  • Deploy karm immediately with the currently-most-likely-to-perform configuration, based on benchmarks carried out so far.
  • Purchase karm+1, as a similar/identical machine to karm.
  • Continue performance optimisations on karm+1.

Thoughts?

@Firefishy
Copy link
Member

+1 sounds good to me.

@tomhughes
Copy link
Member

So a quick summary of my plan to move forward:

  • Do a test restore from backup into 9.1 and see how long a pg_upgrade takes
  • If that is slow see if a test restore into 9.5 is faster

Assuming a pg_upgrade looks OK then:

  • Do a base backup from katla to 9.1 on karm
  • Do a master flip to make karm master
  • Move all load to karm
  • Take a (hopefully short) down time to pg_upgrade to 9.5
  • Upgrade katla and ramoth to 16.04
  • Either pg_upgrade katla and ramoth and resync, or do new base backups to them

If pg_upgrade doesn't look OK then we'll need to consider whether we want to take long enough downtime to restore into 9.5 on karm, or just stick with 9.1 for now.

@zerebubuth
Copy link
Collaborator

The restore into 9.5 took about 20h.

Should we get karm up and running as a regular 9.1 replica and wait for "karm+1" so that we can benchmark on that machine?

I thought the purpose of this ticket was to replace ramoth with karm, rather than do the full 9.5 upgrade. Hopefully a "karm+1" would allow us to benchmark and de-risk that while not having to worry about ramoth flaking out.

@tomhughes
Copy link
Member

Well it was always my intention to move to 9.5 as part of deploying karm if that was feasible. Give me a few days to see how well a pg_upgrade works.

@tomhughes
Copy link
Member

I restored to 9.1 and have just done a test upgrade to 9.5 with the following command:

pg_upgradecluster -v 9.5 -m upgrade -k 9.1 main

Total time to upgrade was 25 seconds.

I have also built the chef infrastructure to support building multiple versions of libpgosm.so and linking them into the postgres library directory and have changed the live database (as well as karm before the upgrade) to use $libdir to reference the functions.

@zerebubuth
Copy link
Collaborator

Total time to upgrade was 25 seconds.

That's great! And I guess it means the majority of the data is in a disk format which either didn't change or is backwards-compatible.

Do you think we could fit this into a scheduled ~1h downtime in a couple of weeks time?

@pnorman
Copy link
Collaborator

pnorman commented Aug 1, 2016

That's great! And I guess it means the majority of the data is in a disk format which either didn't change or is backwards-compatible.

It should only be the system tables

@pnorman
Copy link
Collaborator

pnorman commented Oct 11, 2016

karm is in production. Should this be closed or kept open for karm+1 discussion?

@tomhughes
Copy link
Member

No it isn't.

@Firefishy
Copy link
Member

Karm is now serving Read Only API traffic.

@gravitystorm
Copy link
Collaborator Author

I'm going to close this, and open a new ticket for purchasing karm+1

@tomhughes
Copy link
Member

This is not complete yet.

@tomhughes
Copy link
Member

To explain, deployment of karm will not be complete until we make it master and do the postgres upgrade which I plan to do once the IC power outage on 7th January is out of the way.

@zerebubuth
Copy link
Collaborator

Now that the IC outage is out of the way, is there anything more blocking this other than scheduling a few hours downtime to do the upgrade?

@tomhughes
Copy link
Member

tomhughes commented Feb 24, 2017

No. I mean I need to write myself a detailed list of steps to take but other than that it's just a case of scheduling the downtime.

@zerebubuth
Copy link
Collaborator

Okay. Do you have any preference for when? i.e: Sunday morning / Friday evening?

Given that we expect this to be quite quick, perhaps we could schedule for some time around 11th March? I don't know whether Easter is a good time (people not mapping because it's a holiday) or a bad time (people mapping because it's a holiday), so probably best to avoid that if we can.

@tomhughes
Copy link
Member

Sunday morning is probably the best bet and Easter itself won't work for me as I'll either be on holiday or at my parents... Something like 12th March should work?

@tomhughes
Copy link
Member

tomhughes commented Mar 5, 2017

Preliminary work for next weekend has hopefully been completed now.

Both ramoth and karm have been switched to use postgresql.org packages and ramoth has been upgraded to Ubuntu 16.04 this weekend.

All three machines have postgres 9.6 installed and the database functions have been built and installed to the postgres library directory. Some directory rearrangement has been done on ramoth and katla to make paths more consistent and setup a location ready for the 9.6 system tablespace.

Checklist for the day, based on the last time we did a master flip, my upgrade experiments last year and the postgres documentation for upgrading slaves:

  • Take site offline
  • Perform master/slave flip
    • Step chef on katla
    • Stop postgres on katla
    • Stop chef on karm
    • Stop postgres on karm
    • Switch katla and karm roles in chef
    • Run chef on karm, wait for run to complete and check postgres is up and working
    • Run chef on katla, wait for run to complete and check postgres is up and working
    • Run chef on ramoth, wait for run to complete and check postgres is up and working
  • Upgrade database to postgres 9.5
    • Stop chef on karm, katla and ramoth
    • Stop postgres on karm
    • Stop postgres on katla and ramoth
    • Upgrade karm
      • sudo pg_upgradecluster -v 9.5 -m upgrade -k 9.1 main
    • Update chef to postgres 9.5
    • Run chef on karm, wait for run to complete and check postgres is up and working
    • Stop chef on karm
    • Stop postgres on karm
    • Sync tablespaces to slaves
      • cd /var/lib/postgresql
      • sudo rsync --verbose --archive --delete --hard-links --size-only --keep-dirlinks 9.1 9.5 root@katla.bm.openstreetmap.org:/var/lib/postgresql
      • sudo rsync --verbose --archive --delete --hard-links --size-only --keep-dirlinks 9.1 9.5 root@ramoth.ic.openstreetmap.org:/var/lib/postgresql
      • cd /store/arrays/nvme
      • sudo rsync --verbose --archive --delete --hard-links --size-only --keep-dirlinks postgresql-openstreetmap root@katla.bm.openstreetmap.org:/store/arrays/fast
      • sudo rsync --verbose --archive --delete --hard-links --size-only --keep-dirlinks postgresql-ssd root@katla.bm.openstreetmap.org:/store/arrays/ssd
      • sudo rsync --verbose --archive --delete --hard-links --size-only --keep-dirlinks postgresql-openstreetmap root@ramoth.ic.openstreetmap.org:/store/arrays/openstreetmap
      • sudo rsync --verbose --archive --delete --hard-links --size-only --keep-dirlinks postgresql-ssd root@ramoth.ic.openstreetmap.org:/store/arrays/openstreetmap
    • Start postgres on karm and check postgres is up and working
    • Run chef on katla, wait for run to complete and check postgres is up and working
    • Run chef on ramoth, wait for run to complete and check postgres is up and working
    • Drop old cluster "pg_dropcluster 9.1 main" on each machine
  • Bring site online with all traffic on karm
  • Do OS upgrade on katla
  • Bring katla online for read traffic

@Komzpa
Copy link

Komzpa commented Mar 6, 2017

cross-checking with my notes, https://gist.github.com/Komzpa/994d5aaf340067ccec0e - pg_upgradecluster must be followed by vacuumdb --all --analyze-in-stages to recollect statistics, as pr_upgradecluster drops it and some indexes can stop being used.

@tomhughes
Copy link
Member

@Komzpa yes it gives you a script to run as part of the upgrade which is why I didn't mention that separately

@gardster
Copy link

gardster commented Mar 6, 2017

@tomhughes is it misspelling, that inside the checklist you mentioned 'Stop postgres on karm' two times?

@tomhughes
Copy link
Member

@gardster it was yes - now fixed.

@tomhughes
Copy link
Member

Master switch and database upgrade is complete - the OS upgrade on katla is now underway.

Note that we went to 9.5 not 9.6 in the end because we have indexes on xmin which 9.6 no longer allows.

@tomhughes
Copy link
Member

Upgrade is complete and karm is now fully deployed so I'm going to close this - we have #131 tracking the purchase of karm+1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants