This tool relies on poetry
for dependency management. If you already have poetry
installed on your system, simply poetry update
to pull in all needed dependencies.
Leveraging the Google output requires poetry add opencensus opencensus-ext-stackdriver
. We normally keep these dependencies out of the program to significantly reduce install size and build time. You will have problems with this output as Stackdriver doesn't allow negative numbers in custom metrics.
Github doesn't support organization-level auth tokens (yet), so a user must make a personal auth token to get permissions for this tool to work. The upstream docs describe the basic pattern, but we need the following permissions on the token for it to work:
admin:org
admin:repo_hook
delete:packages
notifications
repo
workflow
write:packages
Configuration largely defines the repos (and any custom work to do in the repo) and the output location.
ssh urls (e.g. git@github.com:civic-eagle/github-stats-collector.git) for Github repos has not been tested in this tool and likely won't work.
We can leverage the backfill-stats.py
script to loop over longer time ranges and fill in data:
/app/backfill-stats.py -c /app/config.yml --start-timestamp 1650283200 --stop-timestamp 1652835600 --timestamp-step 3600 --sleep-time 1800
Example above would collect data between Mon Apr 18 12:00:00 UTC 2022 and Wednesday, May 18, 2022 1:00:00 AM at one hour intervals. Or put another way, it would perform 720 individual runs of the application.
We'll adhere to the OpenMetrics standard as much as possible:
https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md
Because we may want additional or aggregate labels for tracking work in the repository, we can create these matching groups in our config. Any custom labels added will override the default labels collected.
Because Github API endpoints don't return consistent user names between each of them, we may see some overlap of users, but we attempt to get user names as close to consistent (one mapping per user) as possible.
The most frequently avaiable user mapping is the full user name (e.g. Joe Smith
vs. jsmith
as a login) so we leverage that for our user-level stats.
Every individual series generated with this tool will produce a JSON-compatible object like the following:
{
"description": "all created PRs by user",
"labels": {
"repository_name": "<>",
"user": "<>"
},
"measurement_type": "count",
"name": "users_total_pull_requests",
"value": 0
}
These produced metrics are analogous to a JSON representation of a prometheus stat (which matches with the OpenMetrics standard we mention above).
description
- string that gives details about this particular metric (needed for Stackdriver output)
labels
"Key": "Value"
pairs that describe addtional dimensions for querying this metric (directly correlates with "tags" in typical TSDB parlance)
measurement_type
- what type of measurement are we producing (these are internal relations and cannot be directly leveraged by other systems)
name
- the actual metric name
value
- the actual value of the series
To lay this out simply: A metric is comprised of many series. A series is a single example of a metric with all associated labels/tags expanded.
So in our example above, the metric is users_total_pull_requests
while the series is the entire JSON object. Many SaaS systems will charge per series created, not simply per metric, so be mindful of this distinction when storing this data.