Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inital changes to support PointValues with Summary information for timeseries use case #1

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rishabhmaurya
Copy link
Owner

@rishabhmaurya rishabhmaurya commented Aug 30, 2022

Description

Support faster range queries on timestamp field by using precomputed (while indexing) aggregated stats (such as min, max, count, sum, avg or any other decomposable aggregation function) on measurement associated with a timeseries point.

Changes

Usage

Index

Document doc = new Document();
doc.add(new TSIntPoint("tsid1", "cpu", timestamp, measurement));

Search

LeafReader leafReader;
PointValues points = leafReader.getPointValues("tsid1");
TSPointQuery tsPointQuery = new TSPointQuery("tsid1", lowerBoundTimestamp, upperBoundTimestamp);
byte[] res = tsPointQuery.getSummary((BKDWithSummaryReader.BKDSummaryTree) points.getPointTree(), mergeFunction);

Stats merge function definiton

Sum

new BKDSummaryWriter.SummaryMergeFunction<Integer>() {

      @Override
      public int getSummarySize() {
        return Integer.BYTES;
      }

      @Override
      public void merge(byte[] a, byte[] b, byte[] c) {
        packBytes(unpackBytes(a) + unpackBytes(b), c);
      }

      @Override
      public Integer unpackBytes(byte[] val) {
        return NumericUtils.sortableBytesToInt(val, 0);
      }

      @Override
      public void packBytes(Integer val, byte[] res) {
        NumericUtils.intToSortableBytes(val, res, 0);
      }
    };

Long max fn

  static class LongMaxFunction implements BKDSummaryWriter.SummaryMergeFunction<Long> {

    @Override
    public int getSummarySize() {
      return Long.BYTES;
    }

    @Override
    public void merge(byte[] a, byte[] b, byte[] res) {
      if (unpackBytes(a) < unpackBytes(b)) {
        System.arraycopy(b, 0, res, 0, getSummarySize());
      } else {
        System.arraycopy(a, 0, res, 0, getSummarySize());
      }
    }

    @Override
    public Long unpackBytes(byte[] val) {
      return NumericUtils.sortableBytesToLong(val, 0);
    }

    @Override
    public void packBytes(Long val, byte[] res) {
      NumericUtils.longToSortableBytes(val, res, 0);
    }
  }

Comparison with DocValues

Below is the comparison of running unit test for DocValue approach vs TSPoint approach -

This test ingests 10000000 docs against a given TSID and performs a range query on timestamp 100 times against the same TSID. Merge function used is sum.

DocValues approach TSPoint approach
Indexing took: 42948 ms Indexing took: 32985
Matching docs count:1304624 | Segments:3 | DiskAccess: 1304624 Matching docs count:8784032 | Segments:10 | DiskAccess: 302
Search took: 12382 ms Search took: 50ms

This is not apple to apple comparison since number of segments are 3 in DocValues approach whereas its 10 in TSPoint approach.

Limitation of this feature

  • Doc deletion currently not supported. We need to evaluate how important is it and possibly find a way to support it in future.
  • Only limited stats function would be supported which can be computed accurately by aggregating 2 points. E.g. min, max, sum, avg, count.
  • Filter query would only be supported on TimeSeries ID(TSID). TSID will be computed using hashing the dimension fields.
  • Range query will only be supported on timestamp.

TODOs

  • Implementation for multiple TSIDs. For now we need to create a new field with the name same as TSID for a timeseries.
  • Segment merge for BKD with summaries. Currently, the UTs disables merge and perform search across multiple segments and cumulate the results.
  • Pluggable merge function to merge 2 TSPoint. Currently its hardcoded in FieldInfo.java which isn't the right place to define them.
  • Measurement compression in BKD. I'm thinking of using delta encoding to store measurement values and summaries while packing the summaries associated with nodes of the tree.
  • Persist first and last docID in internal nodes of BKD with summaries in an efficient way. This will be useful to use precomputed summaries and skip over batches of documents when iterating using DocIDSetIterator. Its a blocker for integration with OpenSearch aggregation framework.
  • Integrate with OpenSearch aggregation framework.
  • Benchmark against real timeseries dataset.
    • compare against SortedDocValues approach.
    • compare against other timeseries databases.
  • Evaluate support of deletion of document/timeseries/batch of documents (matching a timestamp range).

New interfaces

Compression of values and timestamp

Copy link

github-actions bot commented Mar 7, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant