Skip to content

Commit

Permalink
Create an mkdocs-based site (#732)
Browse files Browse the repository at this point in the history
This substantially reorganizes the documentation as an mkdocs site. Main
changes:

* All documentation is now browseable and searchable in a single site, with
  handy table of contents on the side of each section
* Top-level README significantly slimmed down (just pointing to docs site)
* READMEs inside individual components removed (moved to subdirectories inside
  docs/ folder, accessible via top-level in docs site)
  • Loading branch information
wlach authored Aug 12, 2019
1 parent 63510ed commit 85863ee
Show file tree
Hide file tree
Showing 22 changed files with 124 additions and 337 deletions.
26 changes: 19 additions & 7 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,23 @@ jobs:
name: Spell Check
command: mdspell --ignore-numbers --en-us --report '**/*.md'

doctoc:
docs:
docker:
- image: node:8.10.0
- image: circleci/python:3.7
steps:
- checkout
- run:
name: Ensure markdown tables of contents are up to date
command: ./.circleci/doctoc-check.sh
- checkout
- run:
name: Install dependencies
command: sudo pip install mkdocs markdown-include
- add_ssh_keys:
fingerprints:
"84:b0:66:dd:ec:68:b1:45:9d:5d:66:fd:4a:4f:1b:57"
- run:
name: Build and deploy docs
command: |
if [ $CIRCLE_BRANCH == "master" ]; then
mkdocs gh-deploy
fi
ingestion-edge: &ingestion-edge
working_directory: /root/project/ingestion-edge
Expand Down Expand Up @@ -173,7 +182,10 @@ workflows:
build:
jobs:
- spelling
- doctoc
- docs:
filters:
tags:
only: /.*/
- ingestion-edge
- ingestion-edge-release:
filters:
Expand Down
13 changes: 0 additions & 13 deletions .circleci/doctoc-check.sh

This file was deleted.

9 changes: 0 additions & 9 deletions .circleci/doctoc-run.sh

This file was deleted.

4 changes: 4 additions & 0 deletions .spelling
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ CircleCI
CLI
cron
Dataflow
datapipeline
dataset
deduplicate
deduplication
Expand All @@ -28,6 +29,8 @@ encodings
failsafe
featureful
filesystem
fx-metrics
gcp-ingestion
GCP
GCS
GeoIP
Expand All @@ -41,6 +44,7 @@ HTTPS
hyperloglog
IAM
IPs
irc.mozilla.org
Javadoc
JSON
JVM
Expand Down
9 changes: 0 additions & 9 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,3 @@
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->


- [Community Participation Guidelines](#community-participation-guidelines)
- [How to Report](#how-to-report)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

# Community Participation Guidelines

This repository is governed by Mozilla's code of conduct and etiquette guidelines.
Expand Down
11 changes: 4 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,16 @@
# Telemetry Ingestion on Google Cloud Platform

<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
[![CircleCI](https://circleci.com/gh/mozilla/gcp-ingestion.svg?style=svg&circle-token=d98a470269580907d5c6d74d0e67612834a21be7)](https://circleci.com/gh/mozilla/gcp-ingestion)

A monorepo for documentation and implementation of the Mozilla telemetry
ingestion system deployed to Google Cloud Platform (GCP).

The overall architecture is described in [docs/architecture](docs/architecture)
along with commentary on design decisions.
Individual components are specified under [docs](docs) and implemented
under the various `ingestion-*` service directories:
There are currently two components:

- [ingestion-edge](ingestion-edge): a simple Python service for accepting HTTP
messages and delivering to Google Cloud Pub/Sub
- [ingestion-beam](ingestion-beam): a Java module defining
[Apache Beam](https://beam.apache.org/) jobs for streaming and batch
transformations of ingested messages

For more information, see [the documentation](https://mozilla.github.io/gcp-ingestion).
12 changes: 0 additions & 12 deletions bin/update-toc

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -3,22 +3,8 @@
This document specifies the behavior of the service that delivers decoded
messages into BigQuery.

<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->


- [Data Flow](#data-flow)
- [Implementation](#implementation)
- [Configuration](#configuration)
- [Coerce Types](#coerce-types)
- [Accumulate Unknown Values As `additional_properties`](#accumulate-unknown-values-as-additional_properties)
- [Errors](#errors)
- [Error Message Schema](#error-message-schema)
- [Other Considerations](#other-considerations)
- [Message Acks](#message-acks)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

## Data Flow

Consume messages from a PubSub topic or Cloud Storage location and insert them
Expand Down Expand Up @@ -86,7 +72,7 @@ retries are handled automatically and all errors returned are non-transient.
#### Error Message Schema

Always include the error attributes specified in the [Decoded Error Message
Schema](decoder.md#error-message-schema).
Schema](decoder_service_specification.md#error-message-schema).

Encode errors received as type `TableRow` as JSON in the payload of a
`PubsubMessage`, and add error attributes.
Expand Down
18 changes: 1 addition & 17 deletions docs/decoder.md → ...itecture/decoder_service_specification.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,22 +3,6 @@
This document specifies the behavior of the service that decodes messages
in the Structured Ingestion pipeline.

<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->


- [Data Flow](#data-flow)
- [Implementation](#implementation)
- [Decoding Errors](#decoding-errors)
- [Error message schema](#error-message-schema)
- [Raw message schema](#raw-message-schema)
- [Decoded message metadata schema](#decoded-message-metadata-schema)
- [Other Considerations](#other-considerations)
- [Message Acks](#message-acks)
- [Deduplication](#deduplication)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

## Data Flow

1. Consume messages from Google Cloud PubSub raw topic
Expand Down Expand Up @@ -69,7 +53,7 @@ required group attributes {

### Raw message schema

See [Edge Server PubSub Message Schema](edge.md#edge-server-pubsub-message-schema).
See [Edge Service PubSub Message Schema](edge_service_specification.md#pubsub-message-schema).

### Decoded message metadata schema

Expand Down
17 changes: 1 addition & 16 deletions docs/architecture/differences_from_aws.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,8 @@
# Differences from AWS Architecture
# Differences from AWS

This document explains how GCP Ingestion differs from the [AWS Data Platform
Architecture](https://mana.mozilla.org/wiki/display/SVCOPS/Telemetry+-+Data+Pipeline+Architecture).

<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->


- [Replace Heka Framed Protobuf with newline delimited JSON](#replace-heka-framed-protobuf-with-newline-delimited-json)
- [Replace EC2 Edge with Kubernetes Edge](#replace-ec2-edge-with-kubernetes-edge)
- [Replace Kafka with PubSub](#replace-kafka-with-pubsub)
- [Replace Hindsight Data Warehouse Loaders with Dataflow](#replace-hindsight-data-warehouse-loaders-with-dataflow)
- [Replace S3 with Cloud Storage](#replace-s3-with-cloud-storage)
- [Messages Always Delivered to Message Queue](#messages-always-delivered-to-message-queue)
- [Landfill is Downstream from Message Queue](#landfill-is-downstream-from-message-queue)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->


## Replace Heka Framed Protobuf with newline delimited JSON

Heka framed protobuf requires special code to read and write. Newline delimited
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,8 @@

This document outlines plans to migrate edge traffic from AWS to GCP using the code in this repository.

<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->


- [Current state](#current-state)
- [Phase 1](#phase-1)
- [Phase 2](#phase-2)
- [Phase 3](#phase-3)
- [Phase 3 (alternative)](#phase-3-alternative)
- [Phase 4](#phase-4)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

## Current state

Today, data producers send data to the ingestion stack on AWS as described [here](https://github.com/mozilla/firefox-data-docs/blob/042fddcbf27aa5993ee5578224200a3ef65fd7c7/src/concepts/pipeline/data_pipeline_detail.md#ingestion).
Expand Down
27 changes: 1 addition & 26 deletions docs/edge.md → ...rchitecture/edge_service_specification.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,31 +3,6 @@
This document specifies the behavior of the server that accepts submissions
from HTTP clients e.g. Firefox telemetry.

<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->


- [General Data Flow](#general-data-flow)
- [Namespaces](#namespaces)
- [Forwarding to the pipeline](#forwarding-to-the-pipeline)
- [Edge Server PubSub Message Schema](#edge-server-pubsub-message-schema)
- [Server Request/Response](#server-requestresponse)
- [GET Request](#get-request)
- [GET Response codes](#get-response-codes)
- [POST/PUT Request](#postput-request)
- [Legacy Systems](#legacy-systems)
- [POST/PUT Response codes](#postput-response-codes)
- [Other Response codes](#other-response-codes)
- [Other Considerations](#other-considerations)
- [Compression](#compression)
- [Bad Messages](#bad-messages)
- [PubSub Topics](#pubsub-topics)
- [GeoIP Lookups](#geoip-lookups)
- [Data Retention](#data-retention)
- [Submission Timestamp Format](#submission-timestamp-format)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

## General Data Flow

HTTP submissions come in from the wild, hit a load balancer, then optionally an
Expand All @@ -54,7 +29,7 @@ configuration options.
The message is written to PubSub. If the message cannot be written to PubSub it
is written to a disk queue that will periodically retry writing to PubSub.

### Edge Server PubSub Message Schema
### PubSub Message Schema

```
required string data // base64 encoded body
Expand Down
10 changes: 0 additions & 10 deletions docs/landfill.md → ...tecture/landfill_service_specification.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,8 @@
This document specifies the behavior of the service that batches raw messages
into long term storage.

<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->


- [Data Flow](#data-flow)
- [Implementation](#implementation)
- [Latency](#latency)
- [Other Considerations](#other-considerations)
- [Message Acks](#message-acks)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

## Data Flow

Consume messages from a Google Cloud PubSub topic and write in batches to
Expand Down
25 changes: 0 additions & 25 deletions docs/architecture/README.md → docs/architecture/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,31 +2,6 @@

This document specifies the architecture for GCP Ingestion as a whole.

<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->


- [Architecture Diagram](#architecture-diagram)
- [Architecture Components](#architecture-components)
- [Ingestion Edge](#ingestion-edge)
- [Landfill Sink](#landfill-sink)
- [Decoder](#decoder)
- [Republisher](#republisher)
- [BigQuery Sink](#bigquery-sink)
- [Dataset Sink](#dataset-sink)
- [Notes](#notes)
- [Design Decisions](#design-decisions)
- [Kubernetes Engine and PubSub](#kubernetes-engine-and-pubsub)
- [Different topics for "raw" and "validated" data](#different-topics-for-raw-and-validated-data)
- [BigQuery](#bigquery)
- [Save messages as newline delimited JSON](#save-messages-as-newline-delimited-json)
- [Use destination tables](#use-destination-tables)
- [Use views for user-facing data](#use-views-for-user-facing-data)
- [Known Issues](#known-issues)
- [Further Reading](#further-reading)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

## Architecture Diagram

![diagram.mmd](diagram.svg "Architecture Diagram")
Expand Down
16 changes: 1 addition & 15 deletions docs/pain_points.md → docs/architecture/pain_points.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,7 @@
# Overview
# Pain points

A running list of things that are suboptimal in GCP.

<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->


- [App Engine](#app-engine)
- [Dataflow](#dataflow)
- [`BigQueryIO.Write`](#bigqueryiowrite)
- [`FileIO.Write`](#fileiowrite)
- [`PubsubIO.Write`](#pubsubiowrite)
- [Templates](#templates)
- [PubSub](#pubsub)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

# App Engine

For network-bound applications it can be prohibitively expensive. A PubSub push
Expand Down
12 changes: 0 additions & 12 deletions docs/reliability.md → docs/architecture/reliability.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,6 @@ Percentage determined by the Reliability Target below. If a component does
not meet that then a Stability Work Period should be assigned
to each software engineer supporting the component.

<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->


- [Disclaimer and Purpose](#disclaimer-and-purpose)
- [Reliability Target](#reliability-target)
- [Definitions](#definitions)
- [Exclusions](#exclusions)
- [Additional Information](#additional-information)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

## Disclaimer and Purpose

**This document is intended solely for those directly running, writing, and
Expand Down
Loading

0 comments on commit 85863ee

Please sign in to comment.