Troubleshooting

When running Goose in Production, you might run into some issues. Here's a playbook for them:

Latency goes up

Latency is defined as the time taken to transmit data to/from the processing component
- execution.latency: time between enqueue -> start of execution
- scheduled.latency: time between theoretical schedule time -> start execution
When latency goes up, first check system level metrics of Message Brokers & verify their performance
If Message Brokers are fine, check Job Enqueue rate, Failure rate & Execution time
- Consider scaling up the number of workers if Jobs are getting enqueued at a higher rate than normal
- If Failure rate or Execution times are unusual, investigate issue with code/third party APIs
If scheduling latency is high, consider lowering scheduler-polling-interval-sec in Redis

A Job might be causing process crashes
To find the Poison Job, track jobs.recovered metric & look for :function tag
- If a Job causes workers to crash repeatedly, it'll be recovered & tagged by Goose