Compaction planning fixes #6952

jwilder · 2016-07-02T03:34:06Z

Required for all non-trivial PRs

Rebased/mergable
Tests pass

This PR fixes some issues with compactions with larger numbers of series and TSM files sizes.

For larger datasets, it's possible for shards to get into a state where
many large, dense TSM files exist. While the shard is still hot for
writes, full compactions will skip these files since they are already
fairly optimized and full compactions are expensive. If the write volume
is large enough, the shard can accumulate lots of these files. When
a TSM file is in this state, it's index can contain every series which
causes startup times to increase since each file must parse the full
set of series keys for every file. If the number of series is high,
the index can be quite large causing large amount of disk IO at startup
as well.

To fix this, an optimization compaction is run when a full compaction planning
step decides there is nothing to do. The optimize compaction combines
and spreads the data and series keys across all files resulting in each
file containing the full series data for that shard and a subset of the
total set of keys in the shard. The compaction uses a faster method of compaction
that avoid decoding and combining blocks.

This allows a shard to only store a series key once in the shard reducing
storage size as well allows a shard to only load each key once at startup.

This PR also fixes a panic that could occur to duplicate data stored within a block. Normally, this should not occur, but due bugs discovered in older releases, it possible this data may exist. The code now detects this case and fixes as needed when merging blocks instead of triggering the assertion panic.

daviesalex · 2016-07-03T18:29:56Z

This looks fantastic. I will shortly be cherry-picking these commits into our local build and running this. I'm not sure how long you have for testing, but I or a colleague will let you know feedback as soon as I can. Thanks!

daviesalex · 2016-07-04T08:31:33Z

With our relatively small data set:

[root@carf-metrics-influx01 data]# du -sh *
163G    influxdb-data
4.9G    influxdb-wal

The start is still extremely slow.

[tsm1] 2016/07/04 03:12:37 compacted optimize 20 files into 10 files in 2m15.994732867s
[shard] 2016/07/04 03:23:16 /data/influxdb-data/metrics/tg_udp/autogen/35 database index loaded in 13m12.882100963s
[tsm1] 2016/07/04 03:23:16 compactions enabled for: /data/influxdb-data/metrics/tg_udp/autogen/35
[store] 2016/07/04 03:23:16 /data/influxdb-data/metrics/tg_udp/autogen/35 opened in 14m26.614865568s
[tsm1] 2016/07/04 03:23:16 beginning optimize compaction of group 0, 18 TSM files

Perf top still effectively entirely in GC:

  13.40%  influxd                       [.] runtime.greyobject
   9.53%  influxd                       [.] runtime.heapBitsForObject
   5.36%  influxd                       [.] runtime.mallocgc
   4.85%  influxd                       [.] runtime.scanobject
   3.62%  influxd                       [.] runtime.aeshashbody
   3.42%  influxd                       [.] runtime.mapassign1
   2.92%  influxd                       [.] runtime.memmove
   2.65%  [kernel]                      [k] page_fault
   2.28%  influxd                       [.] runtime.mapaccess1_faststr
   2.20%  influxd                       [.] runtime/internal/atomic.Or8
   2.13%  influxd                       [.] runtime.memclr
   1.83%  influxd                       [.] sync/atomic.AddUint32
   1.61%  influxd                       [.] runtime.heapBitsSetType

This will include soem of our high-cardinality data, because the change to turn that off will not take effect until midday Chicago time today (at which point we will delete all the data and start again). Of course it will then take us some time to get enough data to really test this.

Will update later.

daviesalex · 2016-07-04T11:32:31Z

Start:

[run] 2016/07/04 03:08:45 InfluxDB starting, version unknown, branch unknown, commit unknown

End:

[run] 2016/07/04 03:39:00 Listening for signals

Stable so far. Load ~6.

"perf top" during normal operation does sometimes look different:

   4.95%  influxd              [.] runtime.aeshashbody
   4.34%  influxd              [.] runtime.mapassign1
   4.25%  influxd              [.] runtime.memmove
   3.94%  influxd              [.] runtime.memclr
   3.85%  influxd              [.] runtime.mallocgc
   3.18%  influxd              [.] runtime.evacuate

You can still see tens of seconds of GC per minute which always looks a bit like this:

  31.93%  influxd                       [.] runtime.greyobject
  24.25%  influxd                       [.] runtime.heapBitsForObject
  10.36%  influxd                       [.] runtime.scanobject
   3.92%  influxd                       [.] runtime/internal/atomic.Or8
   1.29%  [kernel]                      [k] page_fault

That is an improvement on the previous start time for sure.

Of course I have no idea if that GC is caused by compactions, by the data coming in (less likely, as that is constant) or queries (again, less likely, at this time of the US day queries are all automated and regular).

daviesalex · 2016-07-04T12:12:23Z

Rebuilt with 1.7beta2 (we are on 1.7 because of a

[run] 2016/07/04 06:48:24 Go version go1.7beta2, GOMAXPROCS set to 36

until

[run] 2016/07/04 07:08:17 Listening for signals
[udp] 2016/07/04 07:08:29 failed to write point batch to database "tg_udp": timeout

So still ~20 min. ~15min of that is pretty much entirely single threaded, and spends all of that single thread in GC.

I'm not sure if this is compactions or something else; would it help if I put this data somewhere you can see it?

Regardless, in terms of startup time, this PR is clearly a big win in terms of time. We will report any instability, but nothing so far.

daviesalex · 2016-07-04T12:48:41Z

With GOGC=1000, a tiny bit faster:

[run] 2016/07/04 07:13:02 Go version go1.7beta2, GOMAXPROCS set to 36
[run] 2016/07/04 07:31:56 Listening for signals

daviesalex · 2016-07-07T07:58:38Z

FWIW, we have been running this for 3 days now without issue. Its a big win for us in every way even if it isnt the final story in compaction planning!

Large files created early in the leveled compactions could cause a shard to get into a bad state. This reworks the level planner to handle those cases as well as splits large compactions up into multiple groups to leverage more CPUs when possible.

For larger datasets, it's possible for shards to get into a state where many large, dense TSM files exist. While the shard is still hot for writes, full compactions will skip these files since they are already fairly optimized and full compactions are expensive. If the write volume is large enough, the shard can accumulate lots of these files. When a file is in this state, it's index can contain every series which causes startup times to increase since each file must parse the full set of series keys for every file. If the number of series is high, the index can be quite large causing large amount of disk IO at startup. To fix this, a optmize compaction is run when a full compaction planning step decides there is nothing to do. The optimize compaction combines and spreads the data and series keys across all files resulting in each file containing the full series data for that shard and a subset of the total set of keys in the shard. This allows a shard to only store a series key once in the shard reducing storage size as well allows a shard to only load each key once at startup.

Due to a bug in compactions, it's possible some blocks may have duplicate points stored. If those blocks are decoded and re-compacted, an assertion panic could trigger. We now dedup those blocks if necessary to remove the duplicate points and avoid the panic.

jwilder added this to the 1.0.0 milestone Jul 2, 2016

jwilder added the area/tsm label Jul 2, 2016

daviesalex mentioned this pull request Jul 4, 2016

[0.13.0] Can't start database with too many series #6903

Closed

jwilder added 3 commits July 14, 2016 11:14

Fix compaction level planner

5ee20e0

Large files created early in the leveled compactions could cause a shard to get into a bad state. This reworks the level planner to handle those cases as well as splits large compactions up into multiple groups to leverage more CPUs when possible.

jwilder force-pushed the jw-planner branch from 3c04ec5 to 0f5e994 Compare July 14, 2016 17:32

Update changelog

3c67d12

jwilder merged commit 1bc5b60 into master Jul 14, 2016

jwilder deleted the jw-planner branch July 14, 2016 19:08

jwilder mentioned this pull request Jul 14, 2016

[1.0.0-beta1] Full compaction never stops #6885

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compaction planning fixes #6952

Compaction planning fixes #6952

jwilder commented Jul 2, 2016 •

edited

Loading

daviesalex commented Jul 3, 2016

daviesalex commented Jul 4, 2016

daviesalex commented Jul 4, 2016

daviesalex commented Jul 4, 2016

daviesalex commented Jul 4, 2016

daviesalex commented Jul 7, 2016

Compaction planning fixes #6952

Compaction planning fixes #6952

Conversation

jwilder commented Jul 2, 2016 • edited Loading

Required for all non-trivial PRs

daviesalex commented Jul 3, 2016

daviesalex commented Jul 4, 2016

daviesalex commented Jul 4, 2016

daviesalex commented Jul 4, 2016

daviesalex commented Jul 4, 2016

daviesalex commented Jul 7, 2016

jwilder commented Jul 2, 2016 •

edited

Loading