Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store summarized results in internal nodes of BKD for time series points #11745

Open
rishabhmaurya opened this issue Sep 2, 2022 · 0 comments

Comments

@rishabhmaurya
Copy link

Description

Time series points have a timestamp, measurement and dimensions associated with them. The common queries are range queries on timestamp, metric aggregation on measurement and grouping on dimensions. Or similar query with histogram on timestamp. 

Proposal:

Prototype can be found here

  1. Introduce a new time series point as a field in lucene - TSPoint which can be added as - 
Document doc = new Document(); doc.add(new TSIntPoint("tsid1", "cpu", timestamp, measurement));

tsid1 is the time series ID of the point. It will be the unit of storage for time series points and for prototype each of them represents a unique field in lucene. 

timestamp is the actual point in BKD on which the index is created. 

Full definition can be found here.

  1. Interface for decomposable aggregate function, which can be defined as part of index configuration - 

Sum function - 

new BKDSummaryWriter.SummaryMergeFunction<Integer>() {

      @Override
      public int getSummarySize()

{         return Integer.BYTES;       }

      @Override
      public void merge(byte[] a, byte[] b, byte[] c)

{         packBytes(unpackBytes(a) + unpackBytes(b), c);       }

      @Override
      public Integer unpackBytes(byte[] val)

{         return NumericUtils.sortableBytesToInt(val, 0);       }

      @Override
      public void packBytes(Integer val, byte[] res)

{         NumericUtils.intToSortableBytes(val, res, 0);       }

    };
  1. New query per LeafReader to perform range queries on TSPoint and retrieve summarized results - 
LeafReader leafReader; PointValues points = leafReader.getPointValues("tsid1");

TSPointQuery tsPointQuery = new TSPointQuery("tsid1", lowerBoundTimestamp, upperBoundTimestamp);

byte[] res = tsPointQuery.getSummary((BKDWithSummaryReader.BKDSummaryTree) points.getPointTree(), mergeFunction);

Instead of BKDReader and BKDWriter, we will be using BKDSummaryWriter BKDSummaryReader which supports writing summaries in internal nodes of the tree. 

Changes in IntersectVisitor interface here.

Comparison with DocValues

Below is the comparison of running unit test for DocValue approach vs TSPoint approach -

This test ingests 10000000 docs against a given TSID and performs a range query on timestamp 100 times against the same TSID. Merge function used is sum.

DocValues approach TSPoint approach
Indexing took: 42948 ms Indexing took: 32985
Matching docs count:1304624 | Segments:3 | DiskAccess: 1304624 Matching docs count:8784032 | Segments:10 | DiskAccess: 302
Search took: 12382 ms Search took: 50ms

This is not apple to apple comparison since number of segments are 3 in DocValues approach whereas its 10 in TSPoint approach.

Limitation of this feature

  • Doc deletion currently not supported. We need to evaluate how important is it and possibly find a way to support it in future.
  • Only decomposable aggregation functions can be supported. E.g. min, max, sum, avg, count.

Todos

  • Implementation for multiple TSIDs. For now we need to create a new field with the name same as TSID for a timeseries.
  • Segment merge for BKD with summaries. Currently, the UTs disables merge and perform search across multiple segments and cumulate the results.
  • Pluggable merge function to merge 2 TSPoint. Currently its hardcoded in FieldInfo.java which isn't the right place to define them.
  • Measurement compression in BKD. I'm thinking of using delta encoding to store measurement values and summaries while packing the summaries associated with nodes of the tree.
  • Persist first and last docID in internal nodes of BKD with summaries in an efficient way. This will be useful to use precomputed summaries and skip over batches of documents when iterating using DocIDSetIterator. Its a blocker for integration with OpenSearch aggregation framework.
  • Integrate with OpenSearch aggregation framework.
  • Benchmark against real timeseries dataset.
    • compare against SortedDocValues approach.
    • compare against other timeseries databases.
  • Evaluate support of deletion of document/timeseries/batch of documents (matching a timestamp range).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant