-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TSDB dimensions encoding #99747
TSDB dimensions encoding #99747
Conversation
Rally results. Note that was executed locally so take this with a shovel of salt. I saw a lot of run-to-run variance, even for disk usage metrics. Highlights:
Click to expand full rally results
|
Is it expected that gauges like |
No, that's not expected. The new encoding shouldn't affect gauges and counters. Maybe there's a bug that it accidentally does. But more likely, the issue is that I didn't force-merge down to one segment in these tests. There is a force-merge step in the benchmark and I assumed that it would force-merge to one segment by default. Turns out that it doesn't default to 1. |
Nice work! The fact that we're not seeing better improvements makes me wonder if we're missing the I think that this format could be a good pick for logs later on, so I would like to avoid specializing it too much for the
Bits 0-1 of the first byte of the header would describe the format that is used, how the data and the rest of the header need to be interpreted depend on this format:
When the format is 0x00, bits 3-7 store the lower 5 bits of the ordinal, and bit 2 is a continuation bit, to know if more data needs to be read. So it would take a single byte to encode a block whose ordinals are all the same with a value <= 31, 2 bytes if the ordinal is in [32, 4096), 3 bytes if the ordinal is in [4096, 524288), etc. When the format is 0x01, no additional information is stored in the header and ordinals are encoded on just as many bits are required based on the maximum ord of the field which is stored at the segment level. When the format is 0x02 the first ordinal is encoded the same way as with 0x00. Then a byte stores the offset at which the ordinal value changes, and a zLong stores the delta between the ordinal at the beginning of the block and the ordinal at the end of the block (to optimize for the case when the field is part of the index sort). What do you think? |
FYI I was interested in using your change in another context, so I gave a try at implementing something like the above format that I suggested. I uploaded it there in case you are interested in checking it out: main...jpountz:elasticsearch:pr/99747. |
Makes sense to me. Your proposal with the header makes sense to me. Feel free to push the commits to this PR. This only supports runs for 1 or 2 unique values per block, right? I suppose the assumption is that the block will mostly have one value and sometimes two, at the boundary between two different runs. Is that right? I'm wondering if we can optimize further by making the blocks larger if there are runs of over 128 consecutive ordinals. When decoding these longer runs, we could also avoid expending them into a |
This is correct.
We can make blocks larger, but there is a trade-off as blocks need to be entirely decoded even though we might need a single value in them.
We can, but this has a trade-off as well. Right now the read logic has very few conditionals and mostly looks like |
Maybe we can make the header variable in size. When bit 0 is not set (0x00), bit 1 is the continuation bit and bits 2-7 store the lower 6 bits so that we can store the range [0,64) in a single byte, [64, 8192) in 2 bytes, etc. The tradeoff is that we can't use 0x02 as the first bit would be 0 and interpreted as "all ordinals are the same in the block" but as we only intend to use 3 states, this is probably fine. WDYT? |
This sounds good to me. Let's mark this draft PR as ready to review? @salvatore-campagna Could you take care of the review? |
Pinging @elastic/es-analytics-geo (Team:Analytics) |
Please also suggest more test cases, such as integration tests, if necessary. One of the problems with the rally TSDB track is that that there's probably a bug in the anonymization script. For example, for the same pod id, there are several different host names. We'd typically expect the host to be always the same for the same pod. This affects the ability to de-duplicate runs of repetitive data. Therefore, we can't really rely on the existing rally track to evaluate the storage benefits from this work. |
I'm not too worried about not being able to better evaluate storage savings, if we have unit tests that make sure we do indeed encode runs of the same value on one or two bytes per block, this is good enough for me. The next level of testing I'll be especially interested in are performance benchmarks, checking that performance for |
I checked the storage savings that this change would give and it looks good. The data source is cpu metrics from tsbs data sets. Around 8.6M docs and 1000 time series. 10 dimension and 10 metric fields. See collapsed sections for more details. Baseline results (main branch)
Note that names with Contender results (this PR)
|
Oops, my branch was somehow out of date 🤦 ..., I will re-run the contender run and update the comment., which explains why I didn't see any difference for _tsid field. |
I've updated the contender run. The |
I also reran the tsdb track with both main as baseline and this pr as contender. However there are a few fields that are actually larger. The most significant are: I suspect these are fields with one distinct value and aren't the fields we should optimise for. Comparison a field by field basis sorted by relative improvement
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good. I was thinking that we could do something about fields with one unique value. I think we can optimise these cases without making the format more difficult. Like this:
Subject: [PATCH] special case 1 value fields
---
Index: server/src/main/java/org/elasticsearch/index/codec/tsdb/ES87TSDBDocValuesProducer.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/server/src/main/java/org/elasticsearch/index/codec/tsdb/ES87TSDBDocValuesProducer.java b/server/src/main/java/org/elasticsearch/index/codec/tsdb/ES87TSDBDocValuesProducer.java
--- a/server/src/main/java/org/elasticsearch/index/codec/tsdb/ES87TSDBDocValuesProducer.java (revision bbb141902e88f2c5a656b3cb08baf19e84e45f46)
+++ b/server/src/main/java/org/elasticsearch/index/codec/tsdb/ES87TSDBDocValuesProducer.java (date 1707207874949)
@@ -600,11 +600,14 @@
entry.numValues = meta.readLong();
if (entry.numValues > 0) {
final int indexBlockShift = meta.readInt();
- entry.indexMeta = DirectMonotonicReader.loadMeta(
- meta,
- 1 + ((entry.numValues - 1) >>> ES87TSDBDocValuesFormat.NUMERIC_BLOCK_SHIFT),
- indexBlockShift
- );
+ // Special case, -1 means there are no blocks, so no need to load the metadata for it
+ if (indexBlockShift != -1) {
+ entry.indexMeta = DirectMonotonicReader.loadMeta(
+ meta,
+ 1 + ((entry.numValues - 1) >>> ES87TSDBDocValuesFormat.NUMERIC_BLOCK_SHIFT),
+ indexBlockShift
+ );
+ }
entry.indexOffset = meta.readLong();
entry.indexLength = meta.readLong();
entry.valuesOffset = meta.readLong();
@@ -688,6 +691,94 @@
return DocValues.emptyNumeric();
}
+ if (maxOrd == 1) {
+ // Special case for maxOrd 1, no need to read blocks and use ordinal 0 as only value
+ if (entry.docsWithFieldOffset == -1) {
+ // Special case when there is one unique value
+ return new NumericDocValues() {
+
+ private final int maxDoc = ES87TSDBDocValuesProducer.this.maxDoc;
+ private int doc = -1;
+
+ @Override
+ public long longValue() {
+ // Only one ordinal!
+ return 0L;
+ }
+
+ @Override
+ public int docID() {
+ return doc;
+ }
+
+ @Override
+ public int nextDoc() throws IOException {
+ return advance(doc + 1);
+ }
+
+ @Override
+ public int advance(int target) throws IOException {
+ if (target >= maxDoc) {
+ return doc = NO_MORE_DOCS;
+ }
+ return doc = target;
+ }
+
+ @Override
+ public boolean advanceExact(int target) {
+ doc = target;
+ return true;
+ }
+
+ @Override
+ public long cost() {
+ return maxDoc;
+ }
+ };
+ } else {
+ final IndexedDISI disi = new IndexedDISI(
+ data,
+ entry.docsWithFieldOffset,
+ entry.docsWithFieldLength,
+ entry.jumpTableEntryCount,
+ entry.denseRankPower,
+ entry.numValues
+ );
+ return new NumericDocValues() {
+
+ @Override
+ public int advance(int target) throws IOException {
+ return disi.advance(target);
+ }
+
+ @Override
+ public boolean advanceExact(int target) throws IOException {
+ return disi.advanceExact(target);
+ }
+
+ @Override
+ public int nextDoc() throws IOException {
+ return disi.nextDoc();
+ }
+
+ @Override
+ public int docID() {
+ return disi.docID();
+ }
+
+ @Override
+ public long cost() {
+ return disi.cost();
+ }
+
+ @Override
+ public long longValue() {
+ return 0L;
+ }
+ };
+ }
+ }
+
// NOTE: we could make this a bit simpler by reusing #getValues but this
// makes things slower.
Index: server/src/main/java/org/elasticsearch/index/codec/tsdb/ES87TSDBDocValuesConsumer.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/server/src/main/java/org/elasticsearch/index/codec/tsdb/ES87TSDBDocValuesConsumer.java b/server/src/main/java/org/elasticsearch/index/codec/tsdb/ES87TSDBDocValuesConsumer.java
--- a/server/src/main/java/org/elasticsearch/index/codec/tsdb/ES87TSDBDocValuesConsumer.java (revision bbb141902e88f2c5a656b3cb08baf19e84e45f46)
+++ b/server/src/main/java/org/elasticsearch/index/codec/tsdb/ES87TSDBDocValuesConsumer.java (date 1707207874958)
@@ -127,7 +127,8 @@
meta.writeLong(numValues);
if (numValues > 0) {
- meta.writeInt(ES87TSDBDocValuesFormat.DIRECT_MONOTONIC_BLOCK_SHIFT);
+ // Special case ofr maxOrd of 1, signal -1 that no blocks will be written
+ meta.writeInt(maxOrd != 1 ? ES87TSDBDocValuesFormat.DIRECT_MONOTONIC_BLOCK_SHIFT : -1);
final ByteBuffersDataOutput indexOut = new ByteBuffersDataOutput();
final DirectMonotonicWriter indexWriter = DirectMonotonicWriter.getInstance(
meta,
@@ -136,41 +137,45 @@
ES87TSDBDocValuesFormat.DIRECT_MONOTONIC_BLOCK_SHIFT
);
- final long[] buffer = new long[ES87TSDBDocValuesFormat.NUMERIC_BLOCK_SIZE];
- int bufferSize = 0;
- final long valuesDataOffset = data.getFilePointer();
- final ES87TSDBDocValuesEncoder encoder = new ES87TSDBDocValuesEncoder();
-
- values = valuesProducer.getSortedNumeric(field);
- final int bitsPerOrd = maxOrd >= 0 ? PackedInts.bitsRequired(maxOrd - 1) : -1;
- for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) {
- final int count = values.docValueCount();
- for (int i = 0; i < count; ++i) {
- buffer[bufferSize++] = values.nextValue();
- if (bufferSize == ES87TSDBDocValuesFormat.NUMERIC_BLOCK_SIZE) {
- indexWriter.add(data.getFilePointer() - valuesDataOffset);
- if (maxOrd >= 0) {
- encoder.encodeOrdinals(buffer, data, bitsPerOrd);
- } else {
- encoder.encode(buffer, data);
- }
- bufferSize = 0;
- }
- }
- }
- if (bufferSize > 0) {
- indexWriter.add(data.getFilePointer() - valuesDataOffset);
- // Fill unused slots in the block with zeroes rather than junk
- Arrays.fill(buffer, bufferSize, ES87TSDBDocValuesFormat.NUMERIC_BLOCK_SIZE, 0L);
- if (maxOrd >= 0) {
- encoder.encodeOrdinals(buffer, data, bitsPerOrd);
- } else {
- encoder.encode(buffer, data);
+ final long valuesDataOffset = data.getFilePointer();
+ // Special case ofr maxOrd of 1, skip writing the blocks
+ if (maxOrd != 1) {
+ final long[] buffer = new long[ES87TSDBDocValuesFormat.NUMERIC_BLOCK_SIZE];
+ int bufferSize = 0;
+ final ES87TSDBDocValuesEncoder encoder = new ES87TSDBDocValuesEncoder();
+ values = valuesProducer.getSortedNumeric(field);
+ final int bitsPerOrd = maxOrd >= 0 ? PackedInts.bitsRequired(maxOrd - 1) : -1;
+ for (int doc = values.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = values.nextDoc()) {
+ final int count = values.docValueCount();
+ for (int i = 0; i < count; ++i) {
+ buffer[bufferSize++] = values.nextValue();
+ if (bufferSize == ES87TSDBDocValuesFormat.NUMERIC_BLOCK_SIZE) {
+ indexWriter.add(data.getFilePointer() - valuesDataOffset);
+ if (maxOrd >= 0) {
+ encoder.encodeOrdinals(buffer, data, bitsPerOrd);
+ } else {
+ encoder.encode(buffer, data);
+ }
+ bufferSize = 0;
+ }
+ }
+ }
+ if (bufferSize > 0) {
+ indexWriter.add(data.getFilePointer() - valuesDataOffset);
+ // Fill unused slots in the block with zeroes rather than junk
+ Arrays.fill(buffer, bufferSize, ES87TSDBDocValuesFormat.NUMERIC_BLOCK_SIZE, 0L);
+ if (maxOrd >= 0) {
+ encoder.encodeOrdinals(buffer, data, bitsPerOrd);
+ } else {
+ encoder.encode(buffer, data);
+ }
}
}
final long valuesDataLength = data.getFilePointer() - valuesDataOffset;
- indexWriter.finish();
+ if (maxOrd != 1) {
+ indexWriter.finish();
+ }
final long indexDataOffset = data.getFilePointer();
data.copyBytes(indexOut.toDataInput(), indexOut.size());
meta.writeLong(indexDataOffset);
Basically maxOrd 1 is used as a signal not to write the blocks. The change is small (Most code is regarding the NumericDocValues
implementations) and the file format complexity stays the same. This way most of the fields that regressed in the tsdb benchmark run, use the same storage as in the baseline run.
server/src/main/java/org/elasticsearch/index/codec/tsdb/ES87TSDBDocValuesConsumer.java
Outdated
Show resolved
Hide resolved
Sounds good to me. Could you push these changes to the PR? |
… has exactly one value and use ordinal zero as constant. This could be done without changing the core format and without making the format more complex.
Yes, I've done this. |
With the latest changes the the field comparison for the tsdb track now looks like this:
|
Great! Thanks a lot of the detailed testing and analysis. I suppose this is ready to be merged then? |
As a next step, maybe we should also think about how to optimize postings if there are runs of consecutive values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this is ready to be merged then?
Yes!
Yes, we can do a better job here. We can try to compress with a mix of PFOR-delta and binary interpolative coding. |
…s. (#105301) The index needs to be in tsdb mode. All fields will use the tsdb coded, except fields start with a _ (not excluding _tsid). Before this change relies on MapperService to check whether a field needed to use tsdb doc values codec, but we missed many field types (ip field type, scaled float field type, unsigned long field type, etc.). Instead we wanted to depend on the doc values type in FieldInfo, but that information is not available in PerFieldMapperCodec. Borrowed the binary doc values implementation from Lucene90DocValuesFormat. This allows it to be used for any doc values field. Followup on #99747
Adds a run-length encoding for TSDB that takes advantage of the fact that doc_values for dimensions are sorted by _tsid, which implies that consecutive values for the same field are mostly the same in a block.