Provide internal stats for better monitoring/alerting #42

jmprusi · 2017-08-08T12:18:49Z

While trying to monitor Zync, we can only rely on using the "/status/live" endpoint, which doesn't provide internal information...

So, my question:

Would be possible to provide internal stats (retries, requests, failed request, latencies...) ?

Some ideas:

Push stats to a "statsd" server.
Publish metrics in prometheus format (https://github.com/prometheus/client_ruby)
Publish stats in a internal socket ...

Thanks.

mikz · 2017-08-08T12:36:22Z

So there are few stats about jobs available that can be exported easily:

Que.job_stats
=> [{"queue"=>"", "job_class"=>"ActiveJob::QueueAdapters::QueAdapter::JobWrapper", "count"=>36, "count_working"=>0, "count_errored"=>36, "highest_error_count"=>11, "oldest_run_at"=>2017-08-08 12:21:52 +0000}]

Que.worker_states # only when some work is being processed
=> [{"priority"=>100,
  "run_at"=>2017-08-08 12:23:02 +0000,
  "job_id"=>227,
  "job_class"=>"ActiveJob::QueueAdapters::QueAdapter::JobWrapper",
  "args"=>[{"job_class"=>"ProcessEntryJob", "job_id"=>"92f6a5eb-ce46-485c-9ff5-fa43e87286d7", "provider_job_id"=>nil, "queue_name"=>"default", "priority"=>nil, "arguments"=>[{"_aj_globalid"=>"gid://zync/Entry/91"}], "executions"=>0, "locale"=>"en"}],
  "error_count"=>0,
  "last_error"=>nil,
  "queue"=>"",
  "pg_backend_pid"=>45435,
  "pg_state"=>"idle",
  "pg_state_changed_at"=>2017-08-08 12:23:02 +0000,
  "pg_last_query"=>
   "SELECT a.attname\n" +
   "  FROM (\n" +
   "         SELECT indrelid, indkey, generate_subscripts(indkey, 1) idx\n" +
   "           FROM pg_index\n" +
   "          WHERE indrelid = '\"integrations\"'::regclass\n" +
   "            AND indisprimary\n" +
   "       ) i\n" +
   "  JOIN pg_attribute a\n" +
   "    ON a.attrelid = i.indrelid\n" +
   "   AND a.attnum = i.indkey[i.idx]\n" +
   " ORDER BY i.idx\n",
  "pg_last_query_started_at"=>2017-08-08 12:23:02 +0000,
  "pg_transaction_started_at"=>nil,
  "pg_waiting_on_lock"=>false}]

Everything that is not in database and lives only in memory would not be easy to export as it would live only in memory and for example could not use puma proxy mode which starts several processes.

Puma has control endpoint which can have stats, but no latency. Just number of workers running etc.

What metrics exactly are you interested in?

We have custom logger that in theory could aggregate some stats, but then there is the issue with running multiple processes in one pod and this would not work (without exporting to something like statsd).

andrewdavidmackenzie · 2017-08-08T14:04:33Z

OpenShift is moving to use Prometheus.

The Innovation week project proved Prometheus to be the best option for us, and so we're moving our monitoring to be based on it too.

I think this is one of those issues where maybe it's not the most optimal locally for one project, but it's the best globally across our many projects and different infrastructure pieces (and is now starting to become part of the base platform all our stuff will run on).... so for standardization, making it "shared knowledge" and easing Operations... I'd like us to move to enable Prometheus monitoring on as many of our workloads as possible.

TBD "how".
I don't really like linking in the monitoring solution into the application code (but that might end up being a necessary evil...).

If there was a way the export of the stats from the app could be fairly generic (a bunch of text and numbers in flat files, or in STDOUT....or something) that is independent from any particular monitoring solution (but could be picked up by an exported on the machine) then that would be idea....

If the idea solution can't be done, I think we should live with the necessary evil (maybe wrap the stats exporting code in a wrapper class to avoid polluting app code directly with Prometheus?) and make progress, enable monitoring while standardizing and easing our Ops lives.

jmprusi · 2017-08-08T14:09:56Z

Would be nice to know:

Numbers of jobs running
Latencies per job
Ok per job
Retries per job
Failed per job

So we can create alerts, based on high latency, num of jobs failed, to many retries, etc.

If you have some total counters, I can "try" to extract some rate limits based on increase during time...

mikz · 2017-08-08T15:03:07Z

@jmprusi looks like those stats can't really be extracted from what I pasted in #42 (comment).

Guess only the "number of jobs running" as that is basically count of elements in the worker_states.
Jobs that are successfully completed are removed from database.

Rest of the stats can be collected via our log subscribed internally which already logs all this info to the standard log.

The remaining issue is how to publish those. Using the local memory as the prometheus ruby client does is not compatible with the puma cluster mode.

We are not running the cluster mode right now, but possibly could in the future.

One issue I see with having it in local memory is if the process crashes or gets killed for whatever reason the information is lost.

I guess the easiest option for now is to just use local memory and investigate use of statsd in the future.

mikz · 2017-08-08T15:04:57Z

Any plans to use the push gateway ? https://github.com/prometheus/client_ruby#pushgateway

jmprusi · 2017-08-10T11:40:15Z

@mikz it has not been deployed yet as part of the monitoring stack, but we can talk about it.

@orimarti can you evaluate the deployment of the pushgateway?

orimarti · 2017-08-14T15:20:28Z

Deployment of pushgateway is more or less easy, if you think you'd need let me know and I'll deploy it.

andrewdavidmackenzie · 2017-08-16T09:19:02Z

I'd maybe try and keep it simple, and start with the library and with a reasonably frequent scrape, and not too many problems of processes dying.... we could get something useful but simple without the Push Gateway?

If needed, then OK..... but maybe not make it too complicated to start with, and see if we have problems with that approach or not?

jmprusi · 2017-08-23T07:52:24Z

@orimarti I will assign this one to you... so you can look for the best way to monitor Zync with @mikz

mikz · 2017-10-05T09:50:21Z

#69 exposed job stats in a text format for prometheus

jmprusi assigned mikz and orimarti Aug 23, 2017

mikz removed their assignment Jan 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide internal stats for better monitoring/alerting #42

Provide internal stats for better monitoring/alerting #42

jmprusi commented Aug 8, 2017

mikz commented Aug 8, 2017 •

edited

Loading

andrewdavidmackenzie commented Aug 8, 2017

jmprusi commented Aug 8, 2017

mikz commented Aug 8, 2017

mikz commented Aug 8, 2017

jmprusi commented Aug 10, 2017

orimarti commented Aug 14, 2017

andrewdavidmackenzie commented Aug 16, 2017

jmprusi commented Aug 23, 2017

mikz commented Oct 5, 2017

Provide internal stats for better monitoring/alerting #42

Provide internal stats for better monitoring/alerting #42

Comments

jmprusi commented Aug 8, 2017

mikz commented Aug 8, 2017 • edited Loading

andrewdavidmackenzie commented Aug 8, 2017

jmprusi commented Aug 8, 2017

mikz commented Aug 8, 2017

mikz commented Aug 8, 2017

jmprusi commented Aug 10, 2017

orimarti commented Aug 14, 2017

andrewdavidmackenzie commented Aug 16, 2017

jmprusi commented Aug 23, 2017

mikz commented Oct 5, 2017

mikz commented Aug 8, 2017 •

edited

Loading