Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CC-1206: Add a check for zombie DJ workers #251

Merged
merged 1 commit into from
Aug 26, 2017
Merged

Conversation

radamanthus
Copy link
Contributor

Description of your patch

Properly recover when a DelayedJob worker was terminated and is lingering as a zombie process.

There is a bug here:

if [ -e $LOCK_FILE ]; then
LAST_LOCK_PID=`cat $LOCK_FILE`
if [ -n $LAST_LOCK_PID -a -z "`ps axo pid|grep $LAST_LOCK_PID`" -a -f $LOCK_FILE ];then
sleep 1
logger -t "monit-dj:$WORKER[$$]" "Removing stale lock file for $WORKER ($LAST_LOCK_PID)"
rm $LOCK_FILE 2>&1
else
logger -t "monit-dj:$WORKER[$$]" "Monit already messing with $WORKER ($LAST_LOCK_PID)"
RESULT=1
exit_cleanly
fi
fi

The test

if [ -n $LAST_LOCK_PID -a -z " ps axo pid|grep $LAST_LOCK_PID" -a -f $LOCK_FILE ];then 

tests if $LAST_LOCK_PID is defined and there’s no running process with that pid, but there is a lock file. It goes to the "Monit already messing with..." block if there is a running process, even if it's a zombie process.

This PR adds an additional test to check if the PID matches a running process but the process is a zombie.

Recommended Release Notes

Updates the delayed_job4 recipe to properly handle zombie workers

Estimated risk

Low

Components involved

DelayedJob custom chef recipe

Description of testing done

See QA instructions

QA Instructions

NOTE: These are the same as the QA instructions for PR #224. This PR can be tested with #224.

Test on configuration A_dj

Configuration A
rails_activejob_example (delayed_job branch) App
Unicorn
Ruby 2.3
RubyGems 2.6.5
Postgres 9.5
US East Virginia
Solo

Boot the test environment under the QA stack
Enable the delayed_job recipe.
Modify the recipe to install DelayedJob on the solo instance
Modify the recipe and set a very low worker memory limit (e.g. low enough to always trigger the memory limit even with zero workload, e.g. 10MB)
Run chef
Observe the delayed_job processes by running ps -ef | grep elay
Make sure:

  • monit is terminating the DelayedJob process
  • new workers are being started
  • no orphan processes are being left behind

@radamanthus radamanthus requested a review from crigor August 24, 2017 02:09
Copy link
Contributor

@crigor crigor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@radamanthus radamanthus merged commit f2b870c into next-release Aug 26, 2017
radamanthus added a commit to engineyard/ey-cloud-recipes that referenced this pull request Nov 16, 2017
Backport changes made in V5 to improve worker termination
- Make sure DJ workers do not become orphans (engineyard/ey-cookbooks-stable-v5#224)
- Add a check for zombie DJ workers (engineyard/ey-cookbooks-stable-v5#251, engineyard/ey-cookbooks-stable-v5#265)
radamanthus added a commit to engineyard/ey-cloud-recipes that referenced this pull request Feb 11, 2018
Backport changes made in V5 to improve worker termination
- Make sure DJ workers do not become orphans (engineyard/ey-cookbooks-stable-v5#224)
- Add a check for zombie DJ workers (engineyard/ey-cookbooks-stable-v5#251, engineyard/ey-cookbooks-stable-v5#265)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants