Skip to content
This repository was archived by the owner on Aug 23, 2023. It is now read-only.

Chunk 4h sparse bugfix longterm + remediation for single-point old format chunks #1129

Merged
merged 39 commits into from
Nov 28, 2018

Conversation

Dieterbe
Copy link
Contributor

@Dieterbe Dieterbe commented Nov 3, 2018

See #1126 for problem description and short term workaround.
This is the longer term fix: introducing a new chunk format. + implemented the remediation for chunks with a single point, as alluded to in #1126

we now have:

  • tsz.Series4h (the previous tsz.Series which should not be used for >4.5h spans and contains the read-time work around to recover from corruptions)
  • tsz.SeriesLong : a similar format but it does not encode the initial t0 in the stream (because it's always tracked and persisted alongside the stream anyway, in our in memory structures as well as persisted in cassandra/bigtable), and it uses delta-of-delta from the get go, which supports arbitrary precision and is typically more compact for the first point than the necessary 17 bit encoding of the full timestamp we would have needed:
    before storing any points, we assume an initial delta of 60,
    if the delta happens to be 60, dod (delta-of-delta) is 0 so we store 1 bit.
    if the dod is between -63 and 64 (e.g. delta is 1 to 120s) we only need to store '10' and a 7 bit value. so 9 bits.
    for other cases and more info, see the paper.
    So the new format not only fixes the bug, it also saves us 32bit for the t0 and in most cases (14-9) to (14-1) (5 to 13 bits), by switching from a delta to dod for first point, shortening chunks by about 4 to 5 bytes.
    it also uses a more compact end-of-stream marker, as explained in the devdocs, saving another up to 5bytes in the best case.

we then use SeriesLong everywhere in all new chunks, while old chunks can still be present in the database and chunk cache, so we accommodate for them by having each implement an iterator interface. This does mean that getting points out of chunks comes with the interface overhead, but I think it shouldn't affect us much. on query heavy instances, a bit of increased cpu usage and maybe slightly slower responses. see the new benchmarks in tsz_test.go , though they don't match the results in https://github.com/Dieterbe/dataprocessexp , this can use a bit more research.

While doing this i also cleaned up some code, specifically in chunk.Chunk and chunk.IterGen, which are now cleaner and more focused.
Btw I want to rename IterGen to Chunk and Chunk could then be called "Block" or something, but perhaps I should do that in a followup PR.

@Dieterbe Dieterbe force-pushed the chunk-4h-sparse-bugfix-longterm branch 2 times, most recently from bf93bb6 to 422550e Compare November 19, 2018 19:38
also: assure chunkAtSave accounts for the full encoded chunk,
not just the tsz part of it. In bigtable we were actually reporting
both variants separately for each chunk being saved.
so that we can add another type that encodes without the bug
note: callers (e.g. stores) now no longer control which format
is being used.
it's the chunk themselves that know what format header to use.
t0 is more specific and accurate than "ts".
also, put it before the byte buffer. because when want to iterate
over the byte buffer, we'll most likely want to start with t0 anyway.
also conceptually it seems to make sense as datapoints have timestamps
>= t0
@Dieterbe Dieterbe force-pushed the chunk-4h-sparse-bugfix-longterm branch 3 times, most recently from 14454fc to aaff973 Compare November 24, 2018 21:29
@Dieterbe
Copy link
Contributor Author

@robert-milan IterGen.Size() now returns slightly higher values than before (because it keeps the full chunk, not just the encoded tsz.Series data like before), and the type itself has a few different properties. IterGen.Size() is only called from the cache accounting. does anything need to change there in the cache accounting?

@Dieterbe Dieterbe changed the title WIP: Chunk 4h sparse bugfix longterm Chunk 4h sparse bugfix longterm + remediation for single-point old format chunks Nov 24, 2018
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Nov 26, 2018

did some race condition testing today, using the method described in #1150 (comment) + running benchmarks/realistic-workload-single-tenant.sh (but with the mpo's and the vegeta rate divided by 20 otherwise my laptop couldn't hold it together - race builds are more resource intensive) , using docker-cluster which has 2min chunks, as to definitely have overlap between chunks getting saved and loaded, cache usage rising over time, etc.
while i found 3 types of races (#1151, #1152, #1153), nothing related to the changes here.

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Nov 26, 2018

reading data using mixed chunk formats works well.
this is while running docker-dev-custom-cfg-kafka (no fakemetrics needed) (with offsets set to newest because I think oldest doesn't work when combined with an empty kafka) and swapping binaries and restarting MT a couple times

MT_CASSANDRA_SCHEMA_FILE=scripts/config/schema-store-cassandra.toml ./mt-store-cat metric_4 1.63fc94941694d4e92f12b40ea6341eeb points | egrep 'chunk|Metric|Table'
2018-11-26 15:06:33.361 [INFO] cassandra_store: ensuring that keyspace metrictank exists.
## Metric 1.63fc94941694d4e92f12b40ea6341eeb(metrictank.stats.docker-dev-custom-cfg-kafka.default.process.resident_memory_bytes.gauge64)
### Table metric_4
number of chunks: 11
#### chunk 0 (t0:2018-11-26 14:48:00, span:120, format:FormatStandardGoTszWithSpan)
#### chunk 1 (t0:2018-11-26 14:50:00, span:120, format:FormatStandardGoTszWithSpan)
#### chunk 2 (t0:2018-11-26 14:52:00, span:120, format:FormatStandardGoTszWithSpan)
#### chunk 3 (t0:2018-11-26 14:54:00, span:120, format:FormatGoTszLongWithSpan)
#### chunk 4 (t0:2018-11-26 14:56:00, span:120, format:FormatGoTszLongWithSpan)
#### chunk 5 (t0:2018-11-26 14:58:00, span:120, format:FormatGoTszLongWithSpan)
#### chunk 6 (t0:2018-11-26 15:00:00, span:120, format:FormatGoTszLongWithSpan)
#### chunk 7 (t0:2018-11-26 15:02:00, span:120, format:FormatGoTszLongWithSpan)
#### chunk 8 (t0:2018-11-26 15:04:00, span:120, format:FormatStandardGoTszWithSpan)
#### chunk 9 (t0:2018-11-26 15:06:00, span:120, format:FormatGoTszLongWithSpan)
#### chunk 10 (t0:2018-11-26 15:08:00, span:120, format:FormatGoTszLongWithSpan)

(i tested both ways. when running master with the new format already in the database, requests fail obviously with unknown chunk format errors. but the new version displays data from both formats beautifully. data looks fine also)

also visually inspected the MT dashboard, all stats look fine

@Dieterbe
Copy link
Contributor Author

Didn't mention it here yet, but last week I copied the broken chunks of the affected customer to a test environment and confirmed that i could accurately reproduce the incorrect viz of their rollup data using master (which was not being properly remediated due to having so many single-point chunks) , and that it was fixed with the code in this branch.

@Dieterbe Dieterbe force-pushed the chunk-4h-sparse-bugfix-longterm branch from ba7988f to 3a99152 Compare November 26, 2018 21:25
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Nov 26, 2018

benchmarks

main concerns / areas of interest

  • increased cpu usage / latency due to iterator interface overhead. also slower cache due to itergens now having to do a bit more work to respond to EndTS(), Span() etc calls.
  • increased memory usage for the remediations. note: when we have an intervalHint that is > 0, which only happens for aggregated (non-raw) archives, we have a very efficient remediation. in all other cases, we always clone the iterator and then try to see if we need to remediate. so the way to see any memory effect is by requesting raw data. we also have an extra field for the intervalHint.
    on the flip side IterGen and Chunk have received some optimizations to use less memory.
  • chunk size

running the test

There's a few things to keep into account:

  • number of points in the tank needs to be consistent over time. hence we only use 2minutely chunks and start the script right after a 2min interval starts. (with the 6h chunks we would have to wait 6h in between each benchmark run to produce an equivalent points-in-chunk-in-memory profile)
  • disable tags so we don't have to worry about them (they complicate find/render queries a bit)
  • we must make sure the queried data is included in the backfill set. so we backfill 20k metrics, and query the 10k~20k range
  • remove all pattern stuff, the workload must be very predictable and consistent.

so, we apply this patch:

diff --git a/benchmarks/realistic-workload-single-tenant.sh b/benchmarks/realistic-workload-single-tenant.sh
index b14abb29..2f79e532 100755
--- a/benchmarks/realistic-workload-single-tenant.sh
+++ b/benchmarks/realistic-workload-single-tenant.sh
@@ -10,16 +10,15 @@ trap 'kill $(jobs -p)' EXIT
 
 # first load up some old data, at a high speedup rate, so we don't have to wait too long ( 3 minutes)
 # lower speedup if your computer is too slow.  this is just to populate some old data, and is not what we care about here
-fakemetrics backfill --kafka-mdm-addr localhost:9092 --offset 5h --period 10s --speedup 100 --mpo 10000
+fakemetrics --add-tags=false backfill --kafka-mdm-addr localhost:9092 --offset 5h --period 10s --speedup 50 --mpo 20000
 # then continue with a realtime feed. note: new series are being introduced here, because we want a sufficiently high ingest rate and num active series.
 # but we couldn't possibly backfill this many series for a large timeframe in a short period
 # this is realistic anyway. to have more recent data than old (eg. due to series churn and pruning)
-fakemetrics feed --kafka-mdm-addr localhost:9092 --period 10s --mpo 100000 &
+fakemetrics --add-tags=false feed --kafka-mdm-addr localhost:9092 --period 10s --mpo 100000 &
 sleep 30
 # now request series:
-# this selects exactly 10k series that will match the regex, out of which we randomly replace 1 char with a wildcard, resulting in queries usually for 1 series, and sometimes for 10 series (depending on whether the replaced char falls within the dynamic part an the end of the name or not)
-# so 20/25 chances for 1 series, 5/25 chances for 10
-# as we execute 100Hz*300s=30k requests, this should give us a plenty high cache hit rate (while still testing the uncached code path)
+# this selects exactly 10k series that will match the regex (id 10k to 19999)
+# as we execute 200Hz*300s=60k requests, this should give us a plenty high cache hit rate (while still testing the uncached code path)
 # in practice, the cache rate sometimes looks fairly low and i'm not sure why. but anyway (seeing about 15% hit partial and the rest are misses), and only in the last minute
-./build/mt-index-cat -addr http://localhost:6063 -from 60min -regex 'some\.id\.of\.a\.metric\.9....' cass -hosts localhost:9042 -schema-file scripts/config/schema-idx-cassandra.toml 'GET http://localhost:6063/render?target={{.Name | patternCustom 100 "1rccw" }}&from=-1h\nX-Org-Id: 1\n\n' | vegeta attack -rate 100 -duration 300s | vegeta report
+./build/mt-index-cat -addr http://localhost:6060 -from 60min -regex 'some\.id\.of\.a\.metric\.1....' cass -hosts localhost:9042 -schema-file scripts/config/schema-idx-cassandra.toml 'GET http://localhost:6060/render?target={{.Name}}&from=-1h\nX-Org-Id: 1\n\n' | vegeta attack -rate 200 -duration 300s | vegeta report | tee report.txt
 sleep 20
diff --git a/docker/docker-dev-custom-cfg-kafka/storage-schemas.conf b/docker/docker-dev-custom-cfg-kafka/storage-schemas.conf
index 2d3083b5..bddb8c44 100644
--- a/docker/docker-dev-custom-cfg-kafka/storage-schemas.conf
+++ b/docker/docker-dev-custom-cfg-kafka/storage-schemas.conf
@@ -50,11 +50,7 @@
 # This example has 3 retention definitions, the first and last override some default options (to use 10minutely and 2hourly chunks and only keep one of them in memory
 # and the last rollup is marked as not ready yet for querying.
 
-[raw]
-pattern = fakemetrics.raw
-retentions = 1s:6h:2min:2
-
 [default]
 pattern = .*
-retentions = 1s:6h:2min:2,1min:35d:6h:1
+retentions = 1s:6h:2min:6
 # reorderBuffer = 20

to reproduce:

  • disable cpu freq scaling etc
  • apply patch above
  • check out appropriate branch (keeping the changes from above applied and dirty!), and run make bin
  • spin up docker-dev-custom-cfg-kafka
  • start ./benchmarks/realistic-workload-single-tenant.sh after the stack has started and right when a new 2min boundary starts. protip: for maximum consistency, take note of how long it takes for the stack to start and then subtract that from the next 2min boundary. for example on my system it takes 1min40s , so if it's 12:35 I will wait until 12:36:00, start the stack, it should finish by 12:37:40 so it will be ready for about 20s when i start the benchmark script right after the clock changes to 12:38:00

new benchmarks

master-without-remediation

you can use https://github.com/grafana/metrictank/pull/new/1129-helper-master-without-remediation
we shouldn't just use the code from before the first remediation was merged, because a bunch of other stuff has changed, such as the sarama update.
by taking latest master and reverting the first remediation we have the best base to compare against where every change against this branch is meaningful.

Requests      [total, rate]            60000, 200.00
Duration      [total, attack, wait]    4m59.998112812s, 4m59.995013859s, 3.098953ms
Latencies     [mean, 50, 95, 99, max]  6.780156ms, 4.764384ms, 13.922315ms, 56.076528ms, 370.254184ms
Bytes In      [total, mean]            682985200, 11383.09
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:60000  
Error Set:

https://snapshot.raintank.io/dashboard/snapshot/nUvLIrHZ256dpr7Osx6LqzjDPGvMRChC
https://snapshot.raintank.io/dashboard/snapshot/YX5qkjCqNZroN4fl298ZoDy5jw8QgCkq?orgId=2

HEAD

Requests      [total, rate]            60000, 200.00
Duration      [total, attack, wait]    4m59.998270932s, 4m59.995069061s, 3.201871ms
Latencies     [mean, 50, 95, 99, max]  6.650384ms, 4.688852ms, 14.274335ms, 44.8451ms, 483.988625ms
Bytes In      [total, mean]            682985198, 11383.09
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:60000  
Error Set:

https://snapshot.raintank.io/dashboard/snapshot/9O5c6WFMGZcLCF59dlerTsZknVZIrftq?orgId=2
https://snapshot.raintank.io/dashboard/snapshot/tAg8zS6pSVnsADzsCEvwZWOrdydVZ3NI?orgId=2

HEAD but saving/reading old style chunks

the idea here is to use a patch as to save old style chunks, and read them through the new code.
this way, we can test the benchmark scenario but using old style chunks, invoking the remediation
you can use this branch https://github.com/grafana/metrictank/pull/new/1129-helper-chunk-4h-sparse-bugfix-longterm-but-saving-old-style-chunks

Requests      [total, rate]            60000, 200.00
Duration      [total, attack, wait]    4m59.998000837s, 4m59.995061612s, 2.939225ms
Latencies     [mean, 50, 95, 99, max]  7.820527ms, 4.750107ms, 15.903778ms, 94.897588ms, 567.80613ms
Bytes In      [total, mean]            682985229, 11383.09
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:60000  
Error Set:

https://snapshot.raintank.io/dashboard/snapshot/nJF2gZ7g189DyL57dayWlE3eSRJwVJEg
https://snapshot.raintank.io/dashboard/snapshot/jtRnihiz2O55GnNlMv57VGag58l3LeDw

conclusion:

  • no significant difference in RSS / heap usage
  • no significant change in response time or cpu spent
  • chunk size can't be checked this way, due to different ways of measuring in different code bases, will do a separate test for that

@Dieterbe Dieterbe force-pushed the chunk-4h-sparse-bugfix-longterm branch from 3a99152 to 71a5193 Compare November 27, 2018 01:08
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Nov 27, 2018

BenchmarkPushSeries4h-8              	10000000	       188 ns/op
--- BENCH: BenchmarkPushSeries4h-8
    tsz_test.go:24: Series4h size: 19B
    tsz_test.go:24: Series4h size: 837B
    tsz_test.go:24: Series4h size: 82512B
    tsz_test.go:24: Series4h size: 8250012B
    tsz_test.go:24: Series4h size: 82500012B
BenchmarkPushSeriesLong-8            	10000000	       178 ns/op
--- BENCH: BenchmarkPushSeriesLong-8
    tsz_test.go:40: SeriesLong size: 10B
    tsz_test.go:40: SeriesLong size: 828B
    tsz_test.go:40: SeriesLong size: 82503B
    tsz_test.go:40: SeriesLong size: 8250003B
    tsz_test.go:40: SeriesLong size: 82500003B

==> 9 Bytes gain per chunk 😎

Copy link
Contributor

@robert-milan robert-milan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks great. I would like to run benchmarks and double check a few areas where allocation and computation have changed in the hot paths before approving.

@Dieterbe Dieterbe force-pushed the chunk-4h-sparse-bugfix-longterm branch from 915ea4c to 10c9db7 Compare November 27, 2018 15:05
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Nov 27, 2018

I just went over all api endpoints and confirmed none of them returns a chunk.IterGen or chunk.Chunk or any type that directly or indirectly contains one, so this confirms no serialization changes that may impact a running cluster.

@Dieterbe Dieterbe force-pushed the chunk-4h-sparse-bugfix-longterm branch from c203181 to 45026f4 Compare November 27, 2018 15:37
* Do away with the intermediate "Iter" type.
  Now Iter is an interface, which the tsz iterators implement
  This allows us to use arbitrary iterators backed by different
  chunks/series encodings, at the expense of a small function
  call overhead.
* put IterGen on a diet.
  - no more span attribute.
    it was wasteful for chunks for which we don't know the span,
    and bulky for others because we can better use
    its compact encoded representation, which we do now.
  - generally just rely on the chunk data as is, and interpret
    the data as necessary to find the span, to create iterators, etc.
    this also means we no longer need the Encode method.

note: IterGen msgp serialization obviously also changes, but seems
only used by archive.Archive msgp serialization,
which is used by these metric methods:

func (m *Metric) MarshalCompressed() (*bytes.Buffer, error) {
func (m *Metric) UnmarshalCompressed(b io.Reader) error {

which are only used by these tools:

mt-whisper-importer-reader
mt-whisper-importer-writer

It doesn't seem to affect running clusters etc.
in SeriesLong, expose the .T (last timestamp) and .Finished fields
and let Chunk leverage them
* no need to return pointer
* clearer diff between the two constructors
by using an interval hint, we know what kind of delta we need
to end up with a valid timestamp.
For this to be effective this requires an interval:

* >1
* that is not a divisor of 16384.

but for chunks that are so long to be affected (e.g. >4h),
that is always the case.

Note: this can be cleaned up when we have
raintank/schema#22
this helps with tracking down where ChunkWriteRequest are being created
the go behavior of unnamed fields and automatically making
method and field lookups work on them is nice, but ambiguous.
It seems better to be very explicit. For one thing, it allows us
to easily find all the places where certain methods and fields
of Series are being called, which helps with making sure we are
race free.

Perhaps later this can be refactored and tsz can be merged into the
chunk package but that's a worry for later.
even though we don't use it anymore
no longer needed + now we have some actual validation
@Dieterbe Dieterbe force-pushed the chunk-4h-sparse-bugfix-longterm branch from 45026f4 to cab1d87 Compare November 27, 2018 15:38
Copy link
Contributor

@robert-milan robert-milan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's deploy this so we can gather more performance metrics.

@robert-milan
Copy link
Contributor

robert-milan commented Nov 27, 2018

@robert-milan IterGen.Size() now returns slightly higher values than before (because it keeps the full chunk, not just the encoded tsz.Series data like before), and the type itself has a few different properties. IterGen.Size() is only called from the cache accounting. does anything need to change there in the cache accounting?

I didn't notice any changes in the sizes of the structs we are using to estimate total sizes, and since Size() itself is used for that particular calculation we should not need to change anything.

edited to clarify no change in size of structs vs the structs themselves

@Dieterbe
Copy link
Contributor Author

ah yes, IterGen went from

type IterGen struct {
	B    []byte
	Ts   uint32
	Span uint32
}

to

type IterGen struct {
        T0           uint32
        intervalHint uint32 
        B            []byte
}

lost the span, but need the intervalHint now (even for itergens that hold seriesLong data unless we want to wrap it in yet another interface, which i don't want). bummer :(

technically, currently not needed. but might become needed once we start
sending (old) chunks via our cluster protocol rather than points.
rather safe than sorry.
@Dieterbe Dieterbe force-pushed the chunk-4h-sparse-bugfix-longterm branch 2 times, most recently from ab6f671 to db4d23a Compare November 28, 2018 19:42
@Dieterbe Dieterbe force-pushed the chunk-4h-sparse-bugfix-longterm branch from 5c64f69 to 23b731c Compare November 28, 2018 22:18
@Dieterbe Dieterbe merged commit 9bc5f2c into master Nov 28, 2018
@Dieterbe
Copy link
Contributor Author

for the record, just used benchstat to compare all go benchmarks before and after, and they're all identical

@Dieterbe Dieterbe deleted the chunk-4h-sparse-bugfix-longterm branch March 27, 2019 21:08
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants