Skip to content

Latest commit

 

History

History
368 lines (257 loc) · 15.5 KB

README.md

File metadata and controls

368 lines (257 loc) · 15.5 KB
Upgrades Newton to Rocky JCB style

////////////////////////////////////////////////////////////////////////////////
// PRE SCHEDULING REQUIREMENTS
////////////////////////////////////////////////////////////////////////////////

1. Please check the environment health at the time as creating the
   maintenance plan (MaaS, Dell/HP hardware monitors and OpenStack
   service states)

2. All customizations done in playbooks and containers are not migrated.
   These changed have to be reimplemented, if required and applied
   via playbooks since ALL containers of the OpenStack control plane are getting
   destroyed and rebuild. These customizations need to be applied post upgrade

3. Customers using VXLAN with l2pop (default setup) are expected to experience
   prolonged downtime during the upgrade.
   The downtime can only be prevented when using
   a) VXLAN via multicast
   b) Migration to VLAN provider
   Rackspace is preferring option b) to prevent this issue.
   Either option needs to be executed before scheduling this maintenance,
   unless the prolonged downtime is acceptable

4. If swift proxy are running inside containers, the swift/object storage service
   will be unavailable for the duration of the maintenance

5. All hosts and containers need to be running Ubuntu 16.04 (Xenial)
   A host OS upgrade to Xenial will need to be scheduled.

////////////////////////////////////////////////////////////////////////////////
// Maintenance Template for upgrading RPC Newton to openstack-Ansible Rocky release
////////////////////////////////////////////////////////////////////////////////

////////////////////////////////////////////////////////////////////////////////
// MAINTENANCE PREP
////////////////////////////////////////////////////////////////////////////////

 (1) Maintenance objective:
    - Update Newton (RPC14) to Rocky (OSA 18) version

 (1a) What should we check to confirm the solution is functioning as expected?
    - Environment is running a RPC18, Rocky version
    - Galera cluster is functioning
    - Rabbit cluster is functioning
    - OpenStack services are functioning
    - Instances are reachable

 (2) Departments involved:
    - RPCO
    - Ensure all teams assigned to this maintenance are available

 (3) Owning department: RPCO

 (4) Amount of time estimated for maintenance:

    Less than 30 compute nodes: Up to 8 hours

--------------------------------------------------------------------------------
 (5) Maintenance Steps:
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
 (5.1) Maintenance Prep:
--------------------------------------------------------------------------------

-  Configure session on deployment node

   # Configure Ctrl b + H shortcut to enable session logging
   grep -q tmux.log 2>/dev/null ~/.tmux.conf || cat << _EOF >> ~/.tmux.conf
bind-key H pipe-pane -o "exec cat >>$HOME/'#W-tmux.log'" \; display-message 'Toggled logging to $HOME/#W-tmux.log'
_EOF

   tmux new -s newton-rocky-jump

   # To enable screen logging press: Ctrl + b followed by H

   export PS1='${debian_chroot:+($debian_chroot)}\u@\h:\w $(date +'%s') \$ '
   export BKUPDIR=/root/upgrade-backups; mkdir $BKUPDIR
   export ANSIBLE_FORKS=50

-  5.1.1 Create upgrade directory and set environment

!!! /root/upgrades already there
git clone https://github.com/rcbops/rocky-leap /root/upgrades

-  5.1.2 Clean up apt sources.list

/etc/apt/sources.list

deb http://us.archive.ubuntu.com/ubuntu/ xenial main restricted
deb http://us.archive.ubuntu.com/ubuntu/ xenial-updates main restricted
deb http://us.archive.ubuntu.com/ubuntu/ xenial universe
deb http://us.archive.ubuntu.com/ubuntu/ xenial-updates universe
deb http://us.archive.ubuntu.com/ubuntu/ xenial multiverse
deb http://us.archive.ubuntu.com/ubuntu/ xenial-updates multiverse
deb http://us.archive.ubuntu.com/ubuntu/ xenial-backports main restricted universe multiverse
deb http://security.ubuntu.com/ubuntu xenial-security main restricted
deb http://security.ubuntu.com/ubuntu xenial-security universe
deb http://security.ubuntu.com/ubuntu xenial-security multiverse

-  5.1.3 Clean up apt sources.list.d

Remove any sources that point to openstack sources like:
   deb http://ubuntu-cloud.archive.canonical.com/ubuntu xenial-updates/newton main

run apt-get update verify there are no errors

-  5.1.2 Setup monitoring suppression

   ## https://rba.rackspace.com/suppression-manager/ ##

-  5.1.3 Backup the existing playbooks

   tar czf $BKUPDIR/rpc-openstack-$(date +%F-%H%M%S).tar.gz /opt/rpc-openstack  2>/dev/null
   Remove symlink /opt/openstack-ansible

-  5.1.4 For environments with Swift: Connect to the proxy node and verify the swift cluster

	Swift is decoupled, this enviornment only uses it as a storage location and does not manage it

   Swift is deployed decoupled from all Ultime environments
   # Deployment Node:
   #ssh $(grep swift_proxy_container /etc/openstack_deploy/openstack_hostnames_ips.yml |sort -u |head -n1 |awk -F\" '{print $2}')

     #grep -q venvs /etc/init/swift-proxy-server.conf && source $( awk '/venvs.*activate/ { print $2 }' /etc/init/swift-proxy-server.conf )
     #source /root/openrc
     #swift-recon -arlud --md5 --human-readable
     #exit

-  5.1.5 Check OpenStack endpoints with the openstack_user_config.yml configuration

   ( source /usr/local/bin/openstack-ansible.rc; ansible utility_container[0] -m shell -a '. ~/openrc; openstack endpoint list' ) | tee $BKUPDIR/openstack_endpoints.txt

   # Verify that the internalURL endpoints match the configuration at internal_lb_vip_address
   # inside the openstack_user_config.yml

   # Verify that the publicURL endpoints match the configuration at external_lb_vip_address
   # inside the openstack_user_config.yml

   # If SSL and self signed certs are used for any endpoint please enable the
   # insecure flag for the OpenStack clients via:

   grep -q 'openrc_insecure:' 2>/dev/null /etc/openstack_deploy/user_*.yml ||
     echo 'openrc_insecure: true' >> /etc/openstack_deploy/user_osa_variables_overrides.yml

   # WARNING:
   # If this check is not executed properly, the OpenStack endpoints will be duplicated as the result
   # causing client side endpoint selection issues !
   #
   # Additionally please verify that configured public OpenStack endpoints can be accessed
   # from inside the OpenStack containers.

--------------------------------------------------------------------------------

-  5.1.6 Install OpenStack python clients if not already installed

   # Ensure that a correct pip.conf is in place to install release matching OpenStack clients

   ( source /usr/local/bin/openstack-ansible.rc; ansible utility_container[0] -m synchronize -a 'mode=pull src=/root/.pip dest=/root' )
   grep -q os-release ~/.pip/pip.conf 2>/dev/null && pip install --upgrade python-novaclient python-openstackclient \
     python-neutronclient python-heatclient python-cinderclient
     
-  5.1.7 Check if customer is using a custom 'dhcp_domain' or 'dns_domain'

   # Validate whether they're using DOMAIN for Neutron/Nova metadata -- if this doesn't exist, move to 5.2
   egrep -Ri 'dhcp_domain|dns_domain' /etc/openstack_deploy/user*
   
   # Check if there is a custom 'nova_nova_conf' or 'neutron_neutron_conf' -- duplicating these will be a problem
   egrep -Ri 'nova_nova_conf|neutron_neutron_conf' /etc/openstack_deploy/user*
   
   # If there is a 'nova_nova_conf' or 'neutron_neutron_conf' add the missing bits below to it, otherwise add this in entirety
   vi /etc/openstack_deploy/user_osa_variables_overrides.yml
   dhcp_domain: "" 
   nova_nova_conf_overrides:
     DEFAULT:
       dhcp_domain: "{{ dhcp_domain }}"
   neutron_neutron_conf_overrides:   
     DEFAULT:     
       dns_domain: "{{ dhcp_domain }}"    

-  5.1.8 Configure proxy overrides for pip and apt when operated behind a company proxy server

  cat << EOF >> /etc/openstack_deploy/user_osa_variables_overrides.yml

  ### Configure proxy overrides for PIP and APT
  proxy_env_url: ""
  proxy_custom_ca_cert: "/etc/ssl/certs/"

  ### The following overrides are automatically generated from the OSA inventory
  no_proxy_env: "localhost,monitoring.api.rackspacecloud.com,{{ internal_lb_vip_address }},{{ external_lb_vip_address }},{% for host in groups['all_containers'] %}{{ hostvars[host]['container_address'] }}{% if not loop.last %},{% endif %}{% endfor %}"

  deployment_environment_variables:
    HTTP_PROXY: "{{ proxy_env_url }}"
    HTTPS_PROXY: "{{ proxy_env_url }}"
    NO_PROXY: "{{ no_proxy_env }}"
    http_proxy: "{{ proxy_env_url }}"
    https_proxy: "{{ proxy_env_url }}"
    no_proxy: "{{ no_proxy_env }}"


  pip_install_options: " --timeout 120 --cert /etc/ssl/certs/ca-certificates.crt --cert {{ proxy_custom_ca_cert }} --trusted-host={{ internal_lb_vip_address }} --trusted-host=files.pythonhosted.org --trusted-host=pythonhosted.org --trusted-host=pypi.org --trusted-host=pypi.python.org --trusted-host=git.openstack.org "

  pip_get_pip_options: "{{ pip_install_options }}"

  repo_build_venv_pip_install_options: >-
    {{ pip_install_options }}
    --timeout 120
    --find-links {{ repo_build_release_path }}

  EOF

  Edit proxy_env_url and proxy_custom_ca_cert inside /etc/openstack_deploy/user_osa_variables_overrides.yml


###### NOTE
# In cases where the repo build process fails with
#    distutils.errors.DistutilsError: Download error for https://files.pythonhosted.org

# Please run the following command to monkey patch the python code:

# ansible repo_all -m lineinfile -a 'dest=/usr/local/lib/python2.7/dist-packages/setuptools/package_index.py regexp="^(.*)verify_ssl=True(.*)$" line="\1verify_ssl=False\2" backup=yes backrefs=yes'


--------------------------------------------------------------------------------
 (5.2) Maintenance Prep (Steps to be completed during maintenance time):
--------------------------------------------------------------------------------

-  5.2.1 Download code and configure

  # cd /root
  # git clone https://github.com/rcbops/rocky-leap /root/upgrades
  # cd /root/upgrades
  
  Edit defaults.yml to set a different db backup location

-  5.2.2 Decrypt Ansible vault files
 
   -- Non encrypted
   ansible-vault decrypt /etc/openstack_deploy/user*secret*.yml
   ansible-vault decrypt /etc/openstack_deploy/user*ldap*.yml

-  5.2.3 Run pre upgrade checks

   # Archive galera backup on deployment host
 #  rsync -av $(source /usr/local/bin/openstack-ansible.rc; ansible --list-hosts galera_container[0] |awk '/galera_container-/ {print $1}'; ):/var/backup/galera-backup-$(date +'%F')*.xbstream ${BKUPDIR}/

-  5.2.4 Environment pre upgrade configuration and verification

   sed -i -e '/^rpc_release:.*/d' /etc/openstack_deploy/user*.yml
   sed -i -e '/^keystone_cache_backend_argument:.*/d' /etc/openstack_deploy/user*.yml

 #  grep -q ceph_client_package_state /etc/openstack_deploy/*.yml || \
 #    echo "ceph_client_package_state: present" |tee -a /etc/openstack_deploy/user_rpco_variables_overrides.yml

   ##########################################
   # Cleanup the nova database
   #
   # In case the script execution stops with
   # foreign key errors, please restart it.
   # Depending on the volume of data the
   # database has to prune, it can run several
   # minutes.
   ( source /usr/local/bin/openstack-ansible.rc;
   ansible -m synchronize galera_container[0] -a 'mode=push src=/opt/openstack-ops/playbooks/files/rpc-o-support/nova-instance-cleanup.sh dest=/root/' && \
    ansible -m shell galera_container[0] -a 'bash -x /root/nova-instance-cleanup.sh' )
 
-  5.2.4
  Remove all rpco references in /etc/openstack_deploy/user_osa_variables_overrides.yml or convert to OSA equivalents
  
  add lvm_type to all cinder nodes in openstack_user_config.yml . Default (for thick-provisioning) or auto (for thin-provisioning) () 
  
  openstack_user_config.yml
   XXXXXX-cinder01:
     container_vars:
       cinder_backends:
         lvm:
           volume_backend_name: LVM_iSCSI
           volume_driver: cinder.volume.drivers.lvm.LVMVolumeDriver
           volume_group: cinder-volumes
           lvm_type: default     #This has to be set in order to get thick provisioning
 
-  5.2.5  Environment cleanup
 
 Verify no instances outside of ACTIVE, RUNNING, STOPPED.  Clean up any error states or transient instance states
 # nova list --all-t | egrep -iv 'ACTIVE|RUNNING|STOPPED'
 
 Verify all volumes in AVAILABLE or IN-USE state.  Clean up any volumes outside of these states
 # cinder list --all-t | egrep -iv 'AVAILABLE|IN-USE'

--------------------------------------------------------------------------------
 (5.3) Starting the upgrade to Rocky
--------------------------------------------------------------------------------

-  5.3.1 F5 LB only: Change F5 monitors to half-open TCP checks during upgrade

-  5.3.2 Upgrade to Rocky
   cd /root/upgrades
   ./jump.sh
    
--------------------------------------------------------------------------------
 (5.4) POST Deployment QC
--------------------------------------------------------------------------------

-  5.4.1 Verify Cloud state

   # The post upgrade RPC-O cloud checks are automated inside the rpc-post-upgrades.yml
   # playbook which includes the following checks:
   # - Galera DB cluster check
   # - RabbitMQ cluster check
   # - Nova, Neutron, Cinder Service states
   # - Elasticsearch Health indexes

   ( cd /opt/rpc-upgrades/playbooks && openstack-ansible -ebackup_dir=$BKUPDIR rpc-post-upgrades.yml --ask-vault-pass )


   # In case Swift is installed please verify overall health
   # via swift-recon

   # Deployment Node:
   ssh $(grep swift_proxy_container /etc/openstack_deploy/openstack_hostnames_ips.yml |sort -u |head -n1 |awk -F\" '{print $2}')

     grep -q venvs /etc/init/swift-proxy-server.conf && source $( awk '/venvs.*activate/ { print $2 }' /etc/init/swift-proxy-server.conf )
     source /root/openrc
     swift-recon -arlud --md5 --human-readable
     exit

-  5.4.2 Create a test instance and verify it can connect out

-  5.4.3 Create a cinder volume

-  5.4.4 Attach the volume to the test instance

-  5.4.5 Create a filesystem on the volume, write some data to it, verify it took, unmount

   # fdisk /dev/vdb
   # mkfs.ext4 /dev/vdb
   # mount /dev/vdb /mnt
   # echo "this is a test file" > /mnt/test-file.txt
   # cat /mnt/test-file.txt
   # umount /mnt

-  5.4.6 Delete the cinder volume

-  5.4.7 Delete the instance

-  5.4.8 Verify customer instances are reachable

-  5.4.9 Test Horizon functionality


--------------------------------------------------------------------------------
 (5.5) Update Rackspace Support tools
--------------------------------------------------------------------------------

-  5.5.2 Reinstall support environment

   ( source /usr/local/bin/openstack-ansible.rc; \
    ansible neutron_agents_container:utility_container -m copy -a 'src=/root/.ssh/rpc_support dest=/root/.ssh/rpc_support mode=600' && \
    ansible neutron_agents_container:utility_container -m copy -a 'src=/root/.ssh/rpc_support.pub dest=/root/.ssh/rpc_support.pub mode=600' )

   ( source /usr/local/bin/openstack-ansible.rc; cd /opt/openstack-ops/playbooks; openstack-ansible main.yml )

--------------------------------------------------------------------------------
 (5.6) Update MAAS Monitoring
--------------------------------------------------------------------------------sd