-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compaction planning fixes #6952
Conversation
This looks fantastic. I will shortly be cherry-picking these commits into our local build and running this. I'm not sure how long you have for testing, but I or a colleague will let you know feedback as soon as I can. Thanks! |
With our relatively small data set:
The start is still extremely slow.
Perf top still effectively entirely in GC:
This will include soem of our high-cardinality data, because the change to turn that off will not take effect until midday Chicago time today (at which point we will delete all the data and start again). Of course it will then take us some time to get enough data to really test this. Will update later. |
Start:
End:
Stable so far. Load ~6. "perf top" during normal operation does sometimes look different:
You can still see tens of seconds of GC per minute which always looks a bit like this:
That is an improvement on the previous start time for sure. Of course I have no idea if that GC is caused by compactions, by the data coming in (less likely, as that is constant) or queries (again, less likely, at this time of the US day queries are all automated and regular). |
Rebuilt with 1.7beta2 (we are on 1.7 because of a
until
So still ~20 min. ~15min of that is pretty much entirely single threaded, and spends all of that single thread in GC. I'm not sure if this is compactions or something else; would it help if I put this data somewhere you can see it? Regardless, in terms of startup time, this PR is clearly a big win in terms of time. We will report any instability, but nothing so far. |
With GOGC=1000, a tiny bit faster:
|
FWIW, we have been running this for 3 days now without issue. Its a big win for us in every way even if it isnt the final story in compaction planning! |
Large files created early in the leveled compactions could cause a shard to get into a bad state. This reworks the level planner to handle those cases as well as splits large compactions up into multiple groups to leverage more CPUs when possible.
For larger datasets, it's possible for shards to get into a state where many large, dense TSM files exist. While the shard is still hot for writes, full compactions will skip these files since they are already fairly optimized and full compactions are expensive. If the write volume is large enough, the shard can accumulate lots of these files. When a file is in this state, it's index can contain every series which causes startup times to increase since each file must parse the full set of series keys for every file. If the number of series is high, the index can be quite large causing large amount of disk IO at startup. To fix this, a optmize compaction is run when a full compaction planning step decides there is nothing to do. The optimize compaction combines and spreads the data and series keys across all files resulting in each file containing the full series data for that shard and a subset of the total set of keys in the shard. This allows a shard to only store a series key once in the shard reducing storage size as well allows a shard to only load each key once at startup.
Due to a bug in compactions, it's possible some blocks may have duplicate points stored. If those blocks are decoded and re-compacted, an assertion panic could trigger. We now dedup those blocks if necessary to remove the duplicate points and avoid the panic.
Required for all non-trivial PRs
This PR fixes some issues with compactions with larger numbers of series and TSM files sizes.
For larger datasets, it's possible for shards to get into a state where
many large, dense TSM files exist. While the shard is still hot for
writes, full compactions will skip these files since they are already
fairly optimized and full compactions are expensive. If the write volume
is large enough, the shard can accumulate lots of these files. When
a TSM file is in this state, it's index can contain every series which
causes startup times to increase since each file must parse the full
set of series keys for every file. If the number of series is high,
the index can be quite large causing large amount of disk IO at startup
as well.
To fix this, an optimization compaction is run when a full compaction planning
step decides there is nothing to do. The optimize compaction combines
and spreads the data and series keys across all files resulting in each
file containing the full series data for that shard and a subset of the
total set of keys in the shard. The compaction uses a faster method of compaction
that avoid decoding and combining blocks.
This allows a shard to only store a series key once in the shard reducing
storage size as well allows a shard to only load each key once at startup.
This PR also fixes a panic that could occur to duplicate data stored within a block. Normally, this should not occur, but due bugs discovered in older releases, it possible this data may exist. The code now detects this case and fixes as needed when merging blocks instead of triggering the assertion panic.