Inital changes to support PointValues with Summary information for timeseries use case #1

rishabhmaurya · 2022-08-30T18:00:47Z

Description

Support faster range queries on timestamp field by using precomputed (while indexing) aggregated stats (such as min, max, count, sum, avg or any other decomposable aggregation function) on measurement associated with a timeseries point.

Changes

Usage

Index

Document doc = new Document();
doc.add(new TSIntPoint("tsid1", "cpu", timestamp, measurement));

Search

LeafReader leafReader;
PointValues points = leafReader.getPointValues("tsid1");
TSPointQuery tsPointQuery = new TSPointQuery("tsid1", lowerBoundTimestamp, upperBoundTimestamp);
byte[] res = tsPointQuery.getSummary((BKDWithSummaryReader.BKDSummaryTree) points.getPointTree(), mergeFunction);

Stats merge function definiton

Sum

new BKDSummaryWriter.SummaryMergeFunction<Integer>() {

      @Override
      public int getSummarySize() {
        return Integer.BYTES;
      }

      @Override
      public void merge(byte[] a, byte[] b, byte[] c) {
        packBytes(unpackBytes(a) + unpackBytes(b), c);
      }

      @Override
      public Integer unpackBytes(byte[] val) {
        return NumericUtils.sortableBytesToInt(val, 0);
      }

      @Override
      public void packBytes(Integer val, byte[] res) {
        NumericUtils.intToSortableBytes(val, res, 0);
      }
    };

Long max fn

  static class LongMaxFunction implements BKDSummaryWriter.SummaryMergeFunction<Long> {

    @Override
    public int getSummarySize() {
      return Long.BYTES;
    }

    @Override
    public void merge(byte[] a, byte[] b, byte[] res) {
      if (unpackBytes(a) < unpackBytes(b)) {
        System.arraycopy(b, 0, res, 0, getSummarySize());
      } else {
        System.arraycopy(a, 0, res, 0, getSummarySize());
      }
    }

    @Override
    public Long unpackBytes(byte[] val) {
      return NumericUtils.sortableBytesToLong(val, 0);
    }

    @Override
    public void packBytes(Long val, byte[] res) {
      NumericUtils.longToSortableBytes(val, res, 0);
    }
  }

Comparison with DocValues

Below is the comparison of running unit test for DocValue approach vs TSPoint approach -

This test ingests 10000000 docs against a given TSID and performs a range query on timestamp 100 times against the same TSID. Merge function used is sum.

DocValues approach	TSPoint approach
Indexing took: 42948 ms	Indexing took: 32985
Matching docs count:1304624 \| Segments:3 \| DiskAccess: 1304624	Matching docs count:8784032 \| Segments:10 \| DiskAccess: 302
Search took: 12382 ms	Search took: 50ms

This is not apple to apple comparison since number of segments are 3 in DocValues approach whereas its 10 in TSPoint approach.

Limitation of this feature

Doc deletion currently not supported. We need to evaluate how important is it and possibly find a way to support it in future.
Only limited stats function would be supported which can be computed accurately by aggregating 2 points. E.g. min, max, sum, avg, count.
Filter query would only be supported on TimeSeries ID(TSID). TSID will be computed using hashing the dimension fields.
Range query will only be supported on timestamp.

TODOs

Implementation for multiple TSIDs. For now we need to create a new field with the name same as TSID for a timeseries.
Segment merge for BKD with summaries. Currently, the UTs disables merge and perform search across multiple segments and cumulate the results.
Pluggable merge function to merge 2 TSPoint. Currently its hardcoded in FieldInfo.java which isn't the right place to define them.
Measurement compression in BKD. I'm thinking of using delta encoding to store measurement values and summaries while packing the summaries associated with nodes of the tree.
Persist first and last docID in internal nodes of BKD with summaries in an efficient way. This will be useful to use precomputed summaries and skip over batches of documents when iterating using DocIDSetIterator. Its a blocker for integration with OpenSearch aggregation framework.
Integrate with OpenSearch aggregation framework.
Benchmark against real timeseries dataset.
- compare against SortedDocValues approach.
- compare against other timeseries databases.
Evaluate support of deletion of document/timeseries/batch of documents (matching a timestamp range).

New interfaces

Compression of values and timestamp

…meseries usecases

github-actions · 2024-03-07T00:57:11Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

rishabhmaurya added 2 commits August 30, 2022 10:09

Inital changes to support PointValues with Summary information for ti…

157215c

…meseries usecases

Fix a regression associated with new codec

0c07f67

This was referenced Aug 31, 2022

TimeSeries optimizations in OpenSearch opensearch-project/OpenSearch#3734

Open

Store summarized results in internal nodes of BKD for time series points apache/lucene#11745

Open

github-actions bot added the Stale label Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inital changes to support PointValues with Summary information for timeseries use case #1

Inital changes to support PointValues with Summary information for timeseries use case #1

rishabhmaurya commented Aug 30, 2022 •

edited

Loading

github-actions bot commented Mar 7, 2024

Inital changes to support PointValues with Summary information for timeseries use case #1

Are you sure you want to change the base?

Inital changes to support PointValues with Summary information for timeseries use case #1

Conversation

rishabhmaurya commented Aug 30, 2022 • edited Loading

Description

Changes

Usage

Comparison with DocValues

Limitation of this feature

TODOs

New interfaces

Compression of values and timestamp

github-actions bot commented Mar 7, 2024

rishabhmaurya commented Aug 30, 2022 •

edited

Loading