Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide internal stats for better monitoring/alerting #42

Open
jmprusi opened this issue Aug 8, 2017 · 10 comments
Open

Provide internal stats for better monitoring/alerting #42

jmprusi opened this issue Aug 8, 2017 · 10 comments
Assignees

Comments

@jmprusi
Copy link

jmprusi commented Aug 8, 2017

While trying to monitor Zync, we can only rely on using the "/status/live" endpoint, which doesn't provide internal information...

So, my question:

  • Would be possible to provide internal stats (retries, requests, failed request, latencies...) ?

Some ideas:

Thanks.

@mikz
Copy link
Contributor

mikz commented Aug 8, 2017

So there are few stats about jobs available that can be exported easily:

Que.job_stats
=> [{"queue"=>"", "job_class"=>"ActiveJob::QueueAdapters::QueAdapter::JobWrapper", "count"=>36, "count_working"=>0, "count_errored"=>36, "highest_error_count"=>11, "oldest_run_at"=>2017-08-08 12:21:52 +0000}]

Que.worker_states # only when some work is being processed
=> [{"priority"=>100,
  "run_at"=>2017-08-08 12:23:02 +0000,
  "job_id"=>227,
  "job_class"=>"ActiveJob::QueueAdapters::QueAdapter::JobWrapper",
  "args"=>[{"job_class"=>"ProcessEntryJob", "job_id"=>"92f6a5eb-ce46-485c-9ff5-fa43e87286d7", "provider_job_id"=>nil, "queue_name"=>"default", "priority"=>nil, "arguments"=>[{"_aj_globalid"=>"gid://zync/Entry/91"}], "executions"=>0, "locale"=>"en"}],
  "error_count"=>0,
  "last_error"=>nil,
  "queue"=>"",
  "pg_backend_pid"=>45435,
  "pg_state"=>"idle",
  "pg_state_changed_at"=>2017-08-08 12:23:02 +0000,
  "pg_last_query"=>
   "SELECT a.attname\n" +
   "  FROM (\n" +
   "         SELECT indrelid, indkey, generate_subscripts(indkey, 1) idx\n" +
   "           FROM pg_index\n" +
   "          WHERE indrelid = '\"integrations\"'::regclass\n" +
   "            AND indisprimary\n" +
   "       ) i\n" +
   "  JOIN pg_attribute a\n" +
   "    ON a.attrelid = i.indrelid\n" +
   "   AND a.attnum = i.indkey[i.idx]\n" +
   " ORDER BY i.idx\n",
  "pg_last_query_started_at"=>2017-08-08 12:23:02 +0000,
  "pg_transaction_started_at"=>nil,
  "pg_waiting_on_lock"=>false}]

Everything that is not in database and lives only in memory would not be easy to export as it would live only in memory and for example could not use puma proxy mode which starts several processes.

Puma has control endpoint which can have stats, but no latency. Just number of workers running etc.

What metrics exactly are you interested in?

We have custom logger that in theory could aggregate some stats, but then there is the issue with running multiple processes in one pod and this would not work (without exporting to something like statsd).

@andrewdavidmackenzie
Copy link
Member

OpenShift is moving to use Prometheus.

The Innovation week project proved Prometheus to be the best option for us, and so we're moving our monitoring to be based on it too.

I think this is one of those issues where maybe it's not the most optimal locally for one project, but it's the best globally across our many projects and different infrastructure pieces (and is now starting to become part of the base platform all our stuff will run on).... so for standardization, making it "shared knowledge" and easing Operations... I'd like us to move to enable Prometheus monitoring on as many of our workloads as possible.

TBD "how".
I don't really like linking in the monitoring solution into the application code (but that might end up being a necessary evil...).

If there was a way the export of the stats from the app could be fairly generic (a bunch of text and numbers in flat files, or in STDOUT....or something) that is independent from any particular monitoring solution (but could be picked up by an exported on the machine) then that would be idea....

If the idea solution can't be done, I think we should live with the necessary evil (maybe wrap the stats exporting code in a wrapper class to avoid polluting app code directly with Prometheus?) and make progress, enable monitoring while standardizing and easing our Ops lives.

@jmprusi
Copy link
Author

jmprusi commented Aug 8, 2017

Would be nice to know:

  • Numbers of jobs running
  • Latencies per job
  • Ok per job
  • Retries per job
  • Failed per job

So we can create alerts, based on high latency, num of jobs failed, to many retries, etc.

If you have some total counters, I can "try" to extract some rate limits based on increase during time...

@mikz
Copy link
Contributor

mikz commented Aug 8, 2017

@jmprusi looks like those stats can't really be extracted from what I pasted in #42 (comment).

Guess only the "number of jobs running" as that is basically count of elements in the worker_states.
Jobs that are successfully completed are removed from database.

Rest of the stats can be collected via our log subscribed internally which already logs all this info to the standard log.

The remaining issue is how to publish those. Using the local memory as the prometheus ruby client does is not compatible with the puma cluster mode.

We are not running the cluster mode right now, but possibly could in the future.

One issue I see with having it in local memory is if the process crashes or gets killed for whatever reason the information is lost.

I guess the easiest option for now is to just use local memory and investigate use of statsd in the future.

@mikz
Copy link
Contributor

mikz commented Aug 8, 2017

Any plans to use the push gateway ? https://github.com/prometheus/client_ruby#pushgateway

@jmprusi
Copy link
Author

jmprusi commented Aug 10, 2017

@mikz it has not been deployed yet as part of the monitoring stack, but we can talk about it.

@orimarti can you evaluate the deployment of the pushgateway?

@orimarti
Copy link

Deployment of pushgateway is more or less easy, if you think you'd need let me know and I'll deploy it.

@andrewdavidmackenzie
Copy link
Member

I'd maybe try and keep it simple, and start with the library and with a reasonably frequent scrape, and not too many problems of processes dying.... we could get something useful but simple without the Push Gateway?

If needed, then OK..... but maybe not make it too complicated to start with, and see if we have problems with that approach or not?

@jmprusi
Copy link
Author

jmprusi commented Aug 23, 2017

@orimarti I will assign this one to you... so you can look for the best way to monitor Zync with @mikz

@mikz
Copy link
Contributor

mikz commented Oct 5, 2017

#69 exposed job stats in a text format for prometheus

@mikz mikz removed their assignment Jan 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants