-
Notifications
You must be signed in to change notification settings - Fork 19
Incident Response
Angel Rey edited this page Dec 23, 2020
·
1 revision
When you get that call at 3am and you need to jump into the prod environment to save the day, what are the things to know, and how can you go about fixing the system?
supervisorctl status # see if any services are stopped or have errors
tail /opt/oddslingers.poker/data/logs/http-worker.log # see errors when starting django
supervisorctl restart all
- backup the system, ALWAYS ALWAYS ALWAYS create a backup/snapshot/db_dump before running custom SSH commands on a production server, especially when under pressure to fix things quickly (see db backup instructions)
-
where are all the files? get a lay of the land and figure out where these keys things are:
- the main code repo:
/opt/oddslingers.poker
- logs:
/data/logs
- config files:
/opt/oddslingers.poker/env/prod.env
- database: is it on the same machine or a separate server? (check env/{ODDSLINGERS_ENV}.env and env/secrets.env)
- system resources:
- check running processes with
htop
,systemctl status
, andsupervisorctl status
- check disk space remaining with
ncdu -h /
- network connectivity
iftop
mtr
- check running processes with
- the main code repo:
- Figure out exactly what you're trying to do, and explain it to another team member before proceeding, e.g.:
- fix wrong code deployed or broken deploy -> deploy new code
- fix slow server due to high resource consumption: processes, connections, cpu, disk, etc -> identify bad process with
htop
, stop it safely, and fix underlying issue
- See if there's a tool already built to help achieve the goal you want, e.g.
- if you need to redeploy, don't fuss with files manually, just find the deploy command and run it
- if you need to restart a service, use supervisord, don't just
killall
and run the proc manually, you'll end up conflicting with other services
See the Production article for more information.