Skip to content

Trouble Shooting Bibdata

carolyncole edited this page Dec 14, 2021 · 8 revisions

Resolution

  1. Check Honeybadger for recent errors.
    1. Solr Errors: See Solr Debugging
    2. Postgres Errors: See Postgres Debugging
    3. No Errors: See Web Infrastructure Debugging
  2. Viewing the Load Balancer can help to diagnose issues

Solr Debugging

If you're getting Solr errors, other applications are probably down. Use the following steps to determine which solr server is having issues and then follow the steps to restart the solr instance on that box.

  1. Check the Solr Health Dashboard

    • Find the boxes which have very high Heap Memory. There are straight lines at the top (max heap) and moving lines at the bottom (current heap). Current heap should be far lower than max.

    This happens pretty commonly and the server will need to be restarted.

  2. Check the solr monitor on datadog

    If one of the hosts is showing red then you will need to note the server and restart it

  3. Check the solr console

    You can look at the solr console from your local machine using capistrano

    cd <local pul_solr directory>
    bundle exec cap solr8-production solr:console
    

    If the graphs to the right are full, or there are lots of errors in the log. also look at the cloud (to the left) graph (under cloud) for red shards. Make note of the machine that has red shards...

    Restart the solr service on the machine with red shards (It may take a minute or two to stop)

Restarting Solr on a server

  • SSH into those boxes and restart solr, e.g (lib-solr-prod4):
    ssh pulsys@lib-solr-prod4
    sudo service solr restart
    
  • If restarting via service doesn't succeed after a minute or two, ctrl+c out of that command, do ps aux | grep solr, find the Solr process ID, and do
    kill -9 <solr-pid>
    sudo service solr restart
    

Postgres Debugging

Ensure that other machines that use this postgres cluster are also broken.

Those are: https://catalog.princeton.edu, https://abid.princeton.edu, and https://oawaiver.princeton.edu/

  1. If they aren't, then log on to the bibdata machines bibdata-alma1 and bibdata-alma2 and restart nginx, like so:

    ssh pulsys@bibdata-alma1
    sudo service nginx restart
    

    This would be a very unlikely scenario, and may need more in depth troubleshooting.

  2. If other services ARE down and your errors say that it can't connect topostgres, then postgres may be down.

    Check the logs to see if you're seeing anything like disk space errors:

    ssh pulsys@lib-postgres-prod1
    sudo tail -n 5000 /var/log/postgresql/postgresql-13-main.log
    

    Assuming postgres has just somehow broken, SSH into lib-postgres-prod1 and restart postgres.

    ssh pulsys@lib-postgres-prod1
    sudo -u postgres /usr/lib/postgresql/13/bin/pg_ctl -D /var/lib/postgresql/13/main restart
    

    If this does not resolve it you may have to reboot the server. Be ready to contact Operations if it does not come back up in the next 15 minutes.

    sudo /sbin/reboot

    This scenario is also very unlikely.

Web Infrastructure Debugging

If you're not getting Honeybadger errors then it means the Rails application isn't erroring. Either the load balancer has detected the site is unhealthy, or nginx has gone down on the boxes.

  1. Check the Rails logs, see if any requests are failing: Link to Logs

  2. If requests are failing and you aren't getting Honeybadger errors, there's probably something wrong with the boxes. Disk space, read-only file systems, or similar. Operations will probably need to fix these issues.

  3. If there are no requests failing, or no requests coming through at all, nginx may be broken. Check the passenger log on bibdata-alma1 and bibdata-alma2 for errors.

    ssh pulsys@bibdata-alma1
    sudo tail -n 1000 /var/log/nginx/error.log
    
  4. If you find errors, restart nginx on these boxes: sudo service nginx restart. It may take some time for the load balancer to recognize these boxes are healthy again.

Viewing The Load Balancer

To check the load balancer:

ssh -L 8080:localhost:8080 pulsys@lib-adc2
ip a

If you see inet 128.112.203.146 in the list eno1 list you are on the correct machine otherwise exit and ssh into lib-adc1

go to the dashboard to view the state of the load balancer