Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compaction planning fixes #6952

Merged
merged 4 commits into from
Jul 14, 2016
Merged

Compaction planning fixes #6952

merged 4 commits into from
Jul 14, 2016

Conversation

jwilder
Copy link
Contributor

@jwilder jwilder commented Jul 2, 2016

Required for all non-trivial PRs
  • Rebased/mergable
  • Tests pass

This PR fixes some issues with compactions with larger numbers of series and TSM files sizes.

For larger datasets, it's possible for shards to get into a state where
many large, dense TSM files exist. While the shard is still hot for
writes, full compactions will skip these files since they are already
fairly optimized and full compactions are expensive. If the write volume
is large enough, the shard can accumulate lots of these files. When
a TSM file is in this state, it's index can contain every series which
causes startup times to increase since each file must parse the full
set of series keys for every file. If the number of series is high,
the index can be quite large causing large amount of disk IO at startup
as well.

To fix this, an optimization compaction is run when a full compaction planning
step decides there is nothing to do. The optimize compaction combines
and spreads the data and series keys across all files resulting in each
file containing the full series data for that shard and a subset of the
total set of keys in the shard. The compaction uses a faster method of compaction
that avoid decoding and combining blocks.

This allows a shard to only store a series key once in the shard reducing
storage size as well allows a shard to only load each key once at startup.

This PR also fixes a panic that could occur to duplicate data stored within a block. Normally, this should not occur, but due bugs discovered in older releases, it possible this data may exist. The code now detects this case and fixes as needed when merging blocks instead of triggering the assertion panic.

@jwilder jwilder added this to the 1.0.0 milestone Jul 2, 2016
@daviesalex
Copy link
Contributor

This looks fantastic. I will shortly be cherry-picking these commits into our local build and running this. I'm not sure how long you have for testing, but I or a colleague will let you know feedback as soon as I can. Thanks!

@daviesalex
Copy link
Contributor

With our relatively small data set:

[root@carf-metrics-influx01 data]# du -sh *
163G    influxdb-data
4.9G    influxdb-wal

The start is still extremely slow.

[tsm1] 2016/07/04 03:12:37 compacted optimize 20 files into 10 files in 2m15.994732867s
[shard] 2016/07/04 03:23:16 /data/influxdb-data/metrics/tg_udp/autogen/35 database index loaded in 13m12.882100963s
[tsm1] 2016/07/04 03:23:16 compactions enabled for: /data/influxdb-data/metrics/tg_udp/autogen/35
[store] 2016/07/04 03:23:16 /data/influxdb-data/metrics/tg_udp/autogen/35 opened in 14m26.614865568s
[tsm1] 2016/07/04 03:23:16 beginning optimize compaction of group 0, 18 TSM files

Perf top still effectively entirely in GC:

  13.40%  influxd                       [.] runtime.greyobject
   9.53%  influxd                       [.] runtime.heapBitsForObject
   5.36%  influxd                       [.] runtime.mallocgc
   4.85%  influxd                       [.] runtime.scanobject
   3.62%  influxd                       [.] runtime.aeshashbody
   3.42%  influxd                       [.] runtime.mapassign1
   2.92%  influxd                       [.] runtime.memmove
   2.65%  [kernel]                      [k] page_fault
   2.28%  influxd                       [.] runtime.mapaccess1_faststr
   2.20%  influxd                       [.] runtime/internal/atomic.Or8
   2.13%  influxd                       [.] runtime.memclr
   1.83%  influxd                       [.] sync/atomic.AddUint32
   1.61%  influxd                       [.] runtime.heapBitsSetType

This will include soem of our high-cardinality data, because the change to turn that off will not take effect until midday Chicago time today (at which point we will delete all the data and start again). Of course it will then take us some time to get enough data to really test this.

Will update later.

@daviesalex
Copy link
Contributor

Start:

[run] 2016/07/04 03:08:45 InfluxDB starting, version unknown, branch unknown, commit unknown

End:

[run] 2016/07/04 03:39:00 Listening for signals

Stable so far. Load ~6.

"perf top" during normal operation does sometimes look different:

   4.95%  influxd              [.] runtime.aeshashbody
   4.34%  influxd              [.] runtime.mapassign1
   4.25%  influxd              [.] runtime.memmove
   3.94%  influxd              [.] runtime.memclr
   3.85%  influxd              [.] runtime.mallocgc
   3.18%  influxd              [.] runtime.evacuate

You can still see tens of seconds of GC per minute which always looks a bit like this:

  31.93%  influxd                       [.] runtime.greyobject
  24.25%  influxd                       [.] runtime.heapBitsForObject
  10.36%  influxd                       [.] runtime.scanobject
   3.92%  influxd                       [.] runtime/internal/atomic.Or8
   1.29%  [kernel]                      [k] page_fault

That is an improvement on the previous start time for sure.

Of course I have no idea if that GC is caused by compactions, by the data coming in (less likely, as that is constant) or queries (again, less likely, at this time of the US day queries are all automated and regular).

@daviesalex
Copy link
Contributor

Rebuilt with 1.7beta2 (we are on 1.7 because of a

[run] 2016/07/04 06:48:24 Go version go1.7beta2, GOMAXPROCS set to 36

until

[run] 2016/07/04 07:08:17 Listening for signals
[udp] 2016/07/04 07:08:29 failed to write point batch to database "tg_udp": timeout

So still ~20 min. ~15min of that is pretty much entirely single threaded, and spends all of that single thread in GC.

I'm not sure if this is compactions or something else; would it help if I put this data somewhere you can see it?

Regardless, in terms of startup time, this PR is clearly a big win in terms of time. We will report any instability, but nothing so far.

@daviesalex
Copy link
Contributor

With GOGC=1000, a tiny bit faster:

[run] 2016/07/04 07:13:02 Go version go1.7beta2, GOMAXPROCS set to 36
[run] 2016/07/04 07:31:56 Listening for signals

@daviesalex
Copy link
Contributor

FWIW, we have been running this for 3 days now without issue. Its a big win for us in every way even if it isnt the final story in compaction planning!

jwilder added 3 commits July 14, 2016 11:14
Large files created early in the leveled compactions could cause
a shard to get into a bad state.  This reworks the level planner
to handle those cases as well as splits large compactions up into
multiple groups to leverage more CPUs when possible.
For larger datasets, it's possible for shards to get into a state where
many large, dense TSM files exist.  While the shard is still hot for
writes, full compactions will skip these files since they are already
fairly optimized and full compactions are expensive.  If the write volume
is large enough, the shard can accumulate lots of these files.  When
a file is in this state, it's index can contain every series which
causes startup times to increase since each file must parse the full
set of series keys for every file.  If the number of series is high,
the index can be quite large causing large amount of disk IO at startup.

To fix this, a optmize compaction is run when a full compaction planning
step decides there is nothing to do.  The optimize compaction combines
and spreads the data and series keys across all files resulting in each
file containing the full series data for that shard and a subset of the
total set of keys in the shard.

This allows a shard to only store a series key once in the shard reducing
storage size as well allows a shard to only load each key once at startup.
Due to a bug in compactions, it's possible some blocks may have duplicate
points stored.  If those blocks are decoded and re-compacted, an assertion
panic could trigger.

We now dedup those blocks if necessary to remove the duplicate points
and avoid the panic.
@jwilder jwilder merged commit 1bc5b60 into master Jul 14, 2016
@jwilder jwilder deleted the jw-planner branch July 14, 2016 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants