-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide internal stats for better monitoring/alerting #42
Comments
So there are few stats about jobs available that can be exported easily: Que.job_stats
=> [{"queue"=>"", "job_class"=>"ActiveJob::QueueAdapters::QueAdapter::JobWrapper", "count"=>36, "count_working"=>0, "count_errored"=>36, "highest_error_count"=>11, "oldest_run_at"=>2017-08-08 12:21:52 +0000}]
Que.worker_states # only when some work is being processed
=> [{"priority"=>100,
"run_at"=>2017-08-08 12:23:02 +0000,
"job_id"=>227,
"job_class"=>"ActiveJob::QueueAdapters::QueAdapter::JobWrapper",
"args"=>[{"job_class"=>"ProcessEntryJob", "job_id"=>"92f6a5eb-ce46-485c-9ff5-fa43e87286d7", "provider_job_id"=>nil, "queue_name"=>"default", "priority"=>nil, "arguments"=>[{"_aj_globalid"=>"gid://zync/Entry/91"}], "executions"=>0, "locale"=>"en"}],
"error_count"=>0,
"last_error"=>nil,
"queue"=>"",
"pg_backend_pid"=>45435,
"pg_state"=>"idle",
"pg_state_changed_at"=>2017-08-08 12:23:02 +0000,
"pg_last_query"=>
"SELECT a.attname\n" +
" FROM (\n" +
" SELECT indrelid, indkey, generate_subscripts(indkey, 1) idx\n" +
" FROM pg_index\n" +
" WHERE indrelid = '\"integrations\"'::regclass\n" +
" AND indisprimary\n" +
" ) i\n" +
" JOIN pg_attribute a\n" +
" ON a.attrelid = i.indrelid\n" +
" AND a.attnum = i.indkey[i.idx]\n" +
" ORDER BY i.idx\n",
"pg_last_query_started_at"=>2017-08-08 12:23:02 +0000,
"pg_transaction_started_at"=>nil,
"pg_waiting_on_lock"=>false}] Everything that is not in database and lives only in memory would not be easy to export as it would live only in memory and for example could not use puma proxy mode which starts several processes. Puma has control endpoint which can have stats, but no latency. Just number of workers running etc. What metrics exactly are you interested in? We have custom logger that in theory could aggregate some stats, but then there is the issue with running multiple processes in one pod and this would not work (without exporting to something like statsd). |
OpenShift is moving to use Prometheus. The Innovation week project proved Prometheus to be the best option for us, and so we're moving our monitoring to be based on it too. I think this is one of those issues where maybe it's not the most optimal locally for one project, but it's the best globally across our many projects and different infrastructure pieces (and is now starting to become part of the base platform all our stuff will run on).... so for standardization, making it "shared knowledge" and easing Operations... I'd like us to move to enable Prometheus monitoring on as many of our workloads as possible. TBD "how". If there was a way the export of the stats from the app could be fairly generic (a bunch of text and numbers in flat files, or in STDOUT....or something) that is independent from any particular monitoring solution (but could be picked up by an exported on the machine) then that would be idea.... If the idea solution can't be done, I think we should live with the necessary evil (maybe wrap the stats exporting code in a wrapper class to avoid polluting app code directly with Prometheus?) and make progress, enable monitoring while standardizing and easing our Ops lives. |
Would be nice to know:
So we can create alerts, based on high latency, num of jobs failed, to many retries, etc. If you have some total counters, I can "try" to extract some rate limits based on increase during time... |
@jmprusi looks like those stats can't really be extracted from what I pasted in #42 (comment). Guess only the "number of jobs running" as that is basically count of elements in the Rest of the stats can be collected via our log subscribed internally which already logs all this info to the standard log. The remaining issue is how to publish those. Using the local memory as the prometheus ruby client does is not compatible with the puma cluster mode. We are not running the cluster mode right now, but possibly could in the future. One issue I see with having it in local memory is if the process crashes or gets killed for whatever reason the information is lost. I guess the easiest option for now is to just use local memory and investigate use of statsd in the future. |
Any plans to use the push gateway ? https://github.com/prometheus/client_ruby#pushgateway |
Deployment of pushgateway is more or less easy, if you think you'd need let me know and I'll deploy it. |
I'd maybe try and keep it simple, and start with the library and with a reasonably frequent scrape, and not too many problems of processes dying.... we could get something useful but simple without the Push Gateway? If needed, then OK..... but maybe not make it too complicated to start with, and see if we have problems with that approach or not? |
#69 exposed job stats in a text format for prometheus |
While trying to monitor Zync, we can only rely on using the "/status/live" endpoint, which doesn't provide internal information...
So, my question:
Some ideas:
Thanks.
The text was updated successfully, but these errors were encountered: