Merge pull request #1186 from grafana/document-startup-produce

Dieterbe · web-flow · commit 8ddbfd20ad3d · 2018-12-27T16:45:04.000+01:00
document startup procedure
diff --git a/README.md b/README.md
@@ -65,6 +65,7 @@ Otherwise data loss of current chunks will be incurred.  See [operations guide](
 * [Inputs](https://github.com/grafana/metrictank/blob/master/docs/inputs.md)
 * [Metrics](https://github.com/grafana/metrictank/blob/master/docs/metrics.md)
 * [Operations](https://github.com/grafana/metrictank/blob/master/docs/operations.md)
+* [Startup](https://github.com/grafana/metrictank/blob/master/docs/startup.md)
 * [Tools](https://github.com/grafana/metrictank/blob/master/docs/tools.md)
 
 ### features in-depth
diff --git a/cmd/metrictank/metrictank.go b/cmd/metrictank/metrictank.go
@@ -150,7 +150,7 @@ func main() {
 	log.Infof("logging level set to '%s'", *logLevel)
 
 	/***********************************
-		Validate  settings needed for clustering
+		Validate settings needed for clustering
 	***********************************/
 	if *instance == "" {
 		log.Fatal("instance can't be empty")
diff --git a/docs/startup.md b/docs/startup.md
@@ -0,0 +1,33 @@
+# Metrictank startup
+
+The full startup procedure has many details, but here we cover the main steps if they affect:
+
+* performance/resource usage characteristics
+* cluster status
+* API availability
+* diagnostics
+
+
+| Phase                   | Description                                                                                        | effect on CPU / RAM                 |
+| ----------------------- | -------------------------------------------------------------------------------------------------- | ----------------------------------- |
+| load config             | load/validate config                                                                               | no                                  |
+| setup diagnostics       | set up logging, profiling, proftrigger                                                             | no                                  |
+| log startup             | logs "Metrictank starting" message                                                                 | no                                  |
+| start sending stats     | starts connecting and writing to graphite endpoint                                                 | no                                  |
+| create Store            | create keyspace, tables, write queues, etc                                                         | minor RAM increase ~ queue size     |
+| create Input(s)         | open connections (kafka) or listening sockets (carbon, prometheus)                                 | no                                  |
+| start cluster           | starts gossip, joins cluster                                                                       | no                                  |
+| create Index            | creates instance and starts write queues                                                           | minor RAM increase ~ queue size     |
+| start API server        | opens listening socket and starts handling requests in not-ready mode                              | no                                  |
+| init Index              | creates session, keyspace, tables, write queues, etc and loads in-memory index from persisted data | reasonable RAM and CPU increase                    |
+| create cluster notifier | optional: connects to Kafka, starts backfilling persistence message and waits until done or timeout| if backfilling: above-normal CPU, normal RAM usage |
+| start input plugin(s)   | starts backfill (kafka) or listening (carbon, prometheus) and maintain priority based on input lag | if backfilling: above-normal CPU and RAM usage     |
+| mark ready state        | immediately (primary) or after warmup period (secondary) (combined with priority for clustering)   | no                                                 |
+
+We recommend provisioning a cluster such that it can backfill a 7 hour backlog in half on hour or less. This means:
+* The CPU increase during the kafka backfilling is very significant: typically a 14x cpu increase compared to normal usage.
+* The RAM usage during the input data backfilling is typically about 1.5x to 2x normal.
+
+Backfilling will go as fast as it can until it reaches a bottleneck (kafka brokers, cpu constraints, etc), so your numbers may vary.
+
+This is true for v0.11.0, but may need revising later.