Because different project may want to track different data, and that volumetry on high activity projects makes it unreasonable to store every bits of information, vossibility-collector was designed to be highly configurable.
The project use the toml
file format for its configurations. Example files can be found in the
examples/
directory.
Element | Type | Description |
---|---|---|
elasticsearch |
String | Host or address for the Elastic Search server |
github_api_token |
String | Optional GitHub API token (important for API rate limiting) |
sync_periodicity |
String | Interval at which the complete state of repositories are synced (hourly , daily , or weekly ) |
The [nsq]
section defines configuration relative to the NSQ queue.
Element | Type | Description |
---|---|---|
channel |
String | NSQ channel to listen on |
lookupd |
String | Address of the lookupd server |
The [repositories]
section defines a collection of tables (in toml
terminology), where each of them
defines a GitHub repository for which data should be collected.
Element | Type | Description |
---|---|---|
user |
String | GitHub username |
repo |
String | GitHub repository |
topic |
String | NSQ topic to listen on for this repository |
events |
String | Optional reference to an event set (defaults to "default" ) |
start_index |
String | Optional starting issues # for high activity repositories |
The [event_set]
section defines a collection of tables (in toml
terminology), where each of them
defines a list of GitHub events to subscribe to, along with its related
transformation.
In an event set definition, each key should be a valid GitHub event identifier, and each value a string that references a given transformation.
We also require two special entries in an event set definition: snapshot_issue
(which applies to
snapshoted issue content) and snapshot_pull_request
(which applies to snapshoted pull request
content). Those are mandatory because we assume that issues and pull requests are always being
stored.
The [transformation]
section is both the most complex and most interesting section. It defines a
collection of tables (in toml terminology), where each
of them defines the data model for storing a particular GitHub object.
A GitHub event is very rich. For example, a simple pull request event has hundreds of fields, most of them you'll probably never query or use. That's the point of a transformation: be able to define your very own data model for the things you care about, and provide a configuration-style definition of the way that data should be filled.
For this, the project relies on a modified
version
of the standard text/template
Go package. This allows you
to use the standard Go template DSL to describe your data mapping (think of it as a text-based API
for runtime reflection).
Let's take a concrete example:
[transformations.issue]
assignee = "{{ .assignee }}"
author = "{{ user_data .user.login }}"
body = "{{ .body }}"
closed_at = "{{ .closed_at }}"
comments = "{{ .comments }}"
created_at = "{{ .created_at }}"
labels = "{{ range .labels }}{{ .name }}{{ end }}"
locked = "{{ .locked }}"
milestone = "{{ if .milestone }}{{ .milestone.title }}{{end}}"
number = "{{ .number }}"
state = "{{ .state }}"
updated_at = "{{ .updated_at }}"
In a transformation definition, the resulting data structure will have one field per key in the
table. In this case, the result of applying that transformation to a GitHub object will be a
structure with an assignee
field, which values is the result of applying the template {{ . assignee }}
to the source object. This is a trivial field: the output value is directly that of the
input.
But templates can do much more! One thing they can do is flow control, such as loops or tests. For
example, the milestone
field in a GitHub issue can either be nil
, or an object containing
multiple fields, or which we only care about the title
in that particular example. The template
language allow us to express that we specifically care about the title
field for a non-nil
milestone, in all other cases the resulting milestone
field will be nil
.
Same goes for labels, which is an array of objects in the GitHub model, but in most cases we really
just care about the label name
rather than its url
or color
code. The standard template
range
construct allow us to loop on each label and extract specifically the field we're interested
in. In that case, the output labels
field will be an array of strings, not an array of objects.
Finally, the template language supports functions. Looking at the author
field, you'll see that
its value should be the result of evaluating {{ user_data .user.login }}
. That means: apply the
function user_data
to the result of evaluating .user.login
on the source object (user_data
is
a builtin function described below).
We currently support the following builtin functions:
-
apply_transformation
takes a single argument and uses this as the name of a transformation to apply to a sub element of the source object. This is for example particularly useful for a GitHubissue_event
that contains a nested issue object on which we'd like to apply the same transformation we usually do. -
context
is a constant function that returns contextual information about the data being processed. It currently has a single fieldRepository
that exposes two methods:FullName
(e.g.,docker/docker
) andPrettyName
(e.g.,engine (docker:docker)
). -
days_difference
computes the difference between two GitHub formatted dates and returns the result as a floating number of days. -
user_data
takes a single argument, uses its value as a document identifier in the to query an object of typeuser
in the indexusers
of the Elastic Search backend, and returns the source JSON.This effectively allows to enrich the value of a field with information in database: in our case, this allows to replace a single login such as
icecrime
into a structure object that contains information about the person's employer and maintainer status.
The [mapping]
section defines configuration relative to the Elastic Search mapping
mechanism.
It supports a single value of type string array that defines a list of patterns
that should not be analyzed by Elastic Search. In particular, anything that
relates to labels, user names, or company names, are things that you most likely
do not want analyzed using the default analyzer.
Element | Type | Description |
---|---|---|
not_analyzed |
String array | Patterns to exclude from analyzer |
The [functions]
section defines a collection of user-defined functions that can later be used in
transformation definition. Each key defines the name for the user-defined function, and the value
should be the path of an executable. During template evaluation, arguments to the function are
passed as command line arguments to the executable, and its output is parsed as JSON and returned as
the result of the function. The executable receives the $ELASTICSEARCH
environment variable which
value is the URL of the Elasticsearch server.