-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change PerFieldMapperCodec to use tsdb doc values codec for all fields. #105301
Change PerFieldMapperCodec to use tsdb doc values codec for all fields. #105301
Conversation
The index needs to be in tsdb mode. Except fields start with a `_` (excluding `_tsid`). Before this change relies on MapperService to check whether a field needed to use tsdb doc values codec, but we missed many field types (ip field type, scaled float field type, unsigned long field type, etc.). Instead we wanted to depend on the doc values type in FieldInfo, but that information is not available in PerFieldMapperCodec. So instead binary doc values support was added to ES87TSDBDocValuesFormat. This allows it to be used for any doc values field. The code was copied from Lucene90DocValuesFormat. Maybe we can some simple compression here? But on the other hand binary doc values shouldn't be made complex, otherwise it defeats the purpose of binary doc values. None of that has been done yet in this change.
b8fa415
to
7aff8cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the simpler logic. We can always make the decision to compress binary doc values later on if we want. Just copying @iverase, who might have opinions since geo_shape
fields use binary doc values internally.
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
this.addresses = addresses; | ||
this.compressedData = compressedData; | ||
// pre-allocate a byte array large enough for the biggest uncompressed block needed. | ||
this.uncompressedBlock = new byte[biggestUncompressedBlockSize]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iverase I remember of you discussing humonguous allocations due to this sort of approach with very large fields such as geo-shapes, is there a better way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember we had problems because of the size of this array in the past for geo_shape fields. It was in part for a not optimal implementation of the LatLonShapeDocValues query which did not implement the scorer supplier so we were allocating this array for each query.
Because do values for geo_shape can be big and we compress a fix amount of documents, the array can grow pretty big an cause problems, if we are reading doc values very often.
I wonder if we can build those blocks not only considering a fix number of documents but a maximum size so we keep the size of that array under some sensitive size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Ignacio, I think encoding the number of documents would be helpful here in order to avoid accidentally reading GBs of data. So that we can limit number of uncompressed we write into block.
However I currently don't have time to implement this before tomorrow. And I like to get to change in as it allows use to consistently use the tsdb doc values format for all fields in a tsdb index. Without this change scaling float field, unsigned integers and others don't use the tsdb doc values codec and that is a waste.
I will try to implement the compression in a follow up later. Downside is that we have to add versioning logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fine, I don't think geo_shape is an important field type for tsdb indices so it can wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, binary doc values isn't that much used for tsdb indices. Mainly range*, percolator, geo shape and, counted keyword field types.
server/src/main/java/org/elasticsearch/index/codec/tsdb/ES87TSDBDocValuesProducer.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/codec/tsdb/ES87TSDBDocValuesProducer.java
Outdated
Show resolved
Hide resolved
It looks like enabling the tsdb doc values codec for all field type saves another ~21MB in total storage usage. Field by field analysis
|
adaff85
to
7c278c7
Compare
The binary doc values implementation from LUCENE-9211 (apache/lucene-solr#1234) is used here, which stores the values in LZ4 compressed blocks, in order to reduce storage usage (in the Lucene default doc values coded binary doc values are stored without any compression). Follow up from elastic#105301
The index needs to be in tsdb mode. All fields will use the tsdb coded, except fields start with a _ (not excluding _tsid).
Before this change relies on MapperService to check whether a field needed to use tsdb doc values codec, but we missed many field types (ip field type, scaled float field type, unsigned long field type, etc.). Instead we wanted to depend on the doc values type in FieldInfo, but that information is not available in PerFieldMapperCodec.
Borrowed the binary doc values implementation from
Lucene90DocValuesFormat
. This allows it to be used for any doc values field.Maybe we can implement some simple compression here? But on the other hand binary doc values shouldn't be made complex, otherwise it defeats the purpose of binary doc values. None of that has been done yet in this change.Binary doc values support is now added to the tsdb doc values codec. So that every field in a tsdb index can delegate to this codec. Only fields that start with_
are delegating the default doc values codec (except_tsid
). The binary doc values implementation from LUCENE-9211 (apache/lucene-solr#1234) is used here, which stores the values in LZ4 compressed blocks, in order to reduce storage usage (in the Lucene default doc values coded binary doc values are stored without any compression).Followup on #99747