Move backing indices of data streams to LogByteMergePolicy #87684
Labels
:Distributed Indexing/Engine
Anything around managing Lucene and the Translog in an open shard.
>enhancement
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Description
Currently Elasticsearch uses
TieredMergePolicy
on all indices. This merge policy is a good default: it's good at picking balanced merges, reclaiming deletes, etc. However it has one non-negligible downside for time-based data: it can merge non-adjacent segments. This most often happens when computing merges on the largest level whenTieredMergePolicy
will use a greedy algorithm to pack multiple segments together in order to reach 5GB.Why are non-adjacent merges bad for time-based data? Time range queries are optimized for the case when a segment either doen't match at all or when all documents of the segment match. By returning non-adjacent merges, the merge policy combines segments that might have very different time ranges, which in-turn makes queries more likely to partially match segments.
Once we have a merge policy that only performs adjacent merges, we could potentially look into taking advantage of it more, e.g. by computing aggregations using index stats when the query fully matches a segment (even though it might not fully match the whole index).
Lucene has another merge policy that is very similar to
TieredMergePolicy
but only performs merges of adjacent segments:LogByteMergePolicy
. So what are the downsides of moving toLogByteMergePolicy
?TieredMergePolicy
users would get segments whose maximum size is very close to 5GB. WithLogByteMergePolicy
, maximum-size segments might be e.g. 4GB sometimes instead of 5GB, because the next adjacent segment was 1.5GB so it couldn't be merged together (the overall merged segment would have been 5.5GB, above the configured max merged segment size of 5GB).TieredMergePolicy
can efficiently optimize reclaiming deletes by merging together segments that have the most deletes.LogByteMergePolicy
cannot do that because these segments might not be adjacent in the shard. This shouldn't be a problem for backing indices of data streams, which are not expected to get deletes in the common case.TieredMergePolicy
has more options to return merges that are more balanced. I wouldn't expect a degradation to be visible in practice though.The text was updated successfully, but these errors were encountered: