Skip to content
This repository has been archived by the owner on Dec 5, 2019. It is now read-only.

Count all the things #528

Open
7 of 10 tasks
robhudson opened this issue Jun 2, 2017 · 1 comment
Open
7 of 10 tasks

Count all the things #528

robhudson opened this issue Jun 2, 2017 · 1 comment
Assignees
Labels
Milestone

Comments

@robhudson
Copy link
Member

robhudson commented Jun 2, 2017

Instrument the code (either w statsd calls that we can send to Datadog or log items that will land in ES/Kibana) so we can generate dashboards and possibly notifications.

This is a follow-up to #418.

List of possible metrics (area of concern: CLUSTER = on-demand clusters, SPARKJOB = scheduled Spark jobs):

  • cluster-normalized-instance-hours: Normalized instance hours of clusters (time between creation and finish multiplied by cluster size)
  • cluster-ready: Number of on-demand clusters spun up successfully (to see trends in usage)
  • cluster-extension: Number of cluster lifetime extensions
  • CLUSTER / SPARKJOB Number of AWS API error responses and which kind (e.g. throttling exception)
  • CLUSTER / SPARKJOB Number of Python errors/exceptions via Sentry (to see code regressions)
  • CLUSTER / SPARKJOB Number of bootstrapping failures during cluster start up (to track issues with EMR bootstrap script)
  • cluster-time-to-ready / sparkjob-time-to-ready: Time between cluster creation (for both scheduled Spark jobs and on-demand clusters) and its readiness to process the first step (the "bootstrapping time" from the user perspective)
  • cluster-emr-version / sparkjob-emr-version: EMR version used for cluster
  • sparkjob-run-time: the time between the cluster's readiness to process the first step and the time when the cluster is shudown (the "runtime of the notebook code" from the user perspective)
  • sparkjob-normalized-instance-hours: Normalized instance hours of scheduled jobs
@rafrombrc rafrombrc added this to the m3 milestone Jun 14, 2017
@rafrombrc rafrombrc modified the milestones: m4, m3 Jun 22, 2017
robhudson added a commit that referenced this issue Jul 19, 2017
robhudson added a commit that referenced this issue Jul 20, 2017
robhudson added a commit that referenced this issue Jul 25, 2017
robhudson added a commit that referenced this issue Jul 25, 2017
robhudson added a commit that referenced this issue Aug 15, 2017
robhudson added a commit that referenced this issue Aug 15, 2017
@rafrombrc rafrombrc modified the milestones: m4, m5 Sep 20, 2017
@robhudson
Copy link
Member Author

robhudson commented Sep 25, 2017

For the Python errors, we discussed this in IRC:

34:49 <•jezdez> so raven has processors: https://github.com/getsentry/raven-python/blob/master/raven/processors.py
13:49 <•jezdez> which are called whenever an error happens
13:49 <•jezdez> I think we could have one that listens for botocore exceptions and we can record them
13:50 <•jezdez> you can configure the sentry client with the processors to use our custom processor
13:50 <•jezdez> https://docs.sentry.io/clients/python/advanced/#client-arguments
13:52 <•jezdez> that would work for both celery and wsgi
13:52 <•jezdez> since the processors are called for either
13:53 <•jezdez> you'd have to be careful with database transactions during the calls

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants