-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New hll++ field type to store HyperLogLogPlusPlus sketches #60119
Conversation
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
Jenkins run elasticsearch-ci/1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a first pass over it, left a few comments (mostly superficial). Gonna digest and do a second pass, looking at the datastructures/etc a bit closer. 👍
} | ||
} | ||
|
||
public void merge(long bucket, AbstractHyperLogLog other) { | ||
if (precision() != other.precision()) { | ||
throw new IllegalArgumentException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing exception message here?
...a/org/elasticsearch/xpack/analytics/aggregations/metrics/HllBackedCardinalityAggregator.java
Outdated
Show resolved
Hide resolved
...a/org/elasticsearch/xpack/analytics/aggregations/metrics/HllBackedCardinalityAggregator.java
Outdated
Show resolved
Hide resolved
* | ||
* It supports storing several HyperLogLog structures which are identified by a bucket number. | ||
*/ | ||
public final class HyperLogLog implements Releasable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fully sure I understand how (or if?) this relates to HyperLogLogPlusPlus class in core?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we are only adding HLL, it seems a waste of effort to use HLL++ as we would need an extra array for storing the algorithm and continuous checks to make sure the current algorithm is HLL. The interfaces have change a bit so I hope it is more clear now.
...n/java/org/elasticsearch/xpack/analytics/aggregations/support/AnalyticsValuesSourceType.java
Outdated
Show resolved
Hide resolved
.../plugin/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/HllFieldMapper.java
Outdated
Show resolved
Hide resolved
...a/org/elasticsearch/xpack/analytics/aggregations/metrics/HllBackedCardinalityAggregator.java
Outdated
Show resolved
Hide resolved
...n/java/org/elasticsearch/xpack/analytics/aggregations/support/AnalyticsValuesSourceType.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had another look through. I think it looks good, no real additional comments about the code/algo.
Could we add some unit tests that verify the merging behavior between different precisions? Feels like that's where this is most likely to break since that's the newest part. Probably doesn't need to be a full AggregatorTestCase, just some unit tests exercising the merging part?
Maybe some encoding/decoding tests for the different compressions too?
A thought occurred to me, and I think it's probably a bad idea, but wanted to mention: I wonder if we should offer a "linear counting" mode for the input format too? Same reason we use it internally, it's considerably smaller for a few elements. But that would make the input format even more complicated for clients, and I'm not sure what we'd do if the user sends us more values than they are supposed to (reject? go ahead and convert?)
Dunno... but thought I'd mention it :)
try { | ||
final BinaryDocValues values = DocValues.getBinary(context.reader(), fieldName); | ||
final ByteArrayDataInput dataInput = new ByteArrayDataInput(); | ||
final InternalFixedLengthHllValue fixedValue = new InternalFixedLengthHllValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my own knowledge, is the idea here to allocate one of each encoding types, and then switch/reset between them as we encounter different encodings on each segment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is the idea so when you read a value, read the first byte with the encoding type, reset the corresponding object and return it.
I am so glad you mention that because that is one of the things I wanted to discuss and the reason I define the input json as an object :) The idea would be they can provide an object with an |
Pinging @elastic/es-analytics-geo (Team:Analytics) |
Pinging @elastic/es-analytical-engine (Team:Analytics) |
This PR explores the addition of a new field mapper thatch store HyperLogLogPlusPlus(HLL++) sketches. An HLL++ sketch is
the internal algorithm used to compute cardinalities.
Mapper
The new field is defined in a mapping using the following structure:
Where precision must be between 4 and 18.
Concerns
: Currently the cardinality aggregation does not expose this precision directly but indirectly via aprecision_threshold
defined as number of documents. I had to add a mapping between theprecision_threshold
andprecision
which looks strange.HLL input
where the hll is a fixed length array of bytes where the length is define by the precision (length = 1 << precision). For example for precision 4, the array must have 16 elements but for precision 18, it must have 262144.
Concerns
: I am unsure if we can handle documents with an array of length 262144?Murmur3 input
where murmur3 is an array of murmur3 hashes.
Note
: In order to play nicely with raw data, it needs to be generated the same way we do it internally (e.g murmur3 seed should be set to 0).Encoded Murmur3 input
where lc is an array of encoded murmur3 hashes defined as integers. This is the internal format used by the hll++ algorithm to store murmur3 hashes during the linear counting phase.
Note
: The encoding value depends on the precision, e.g the encoded hash might be different for the same value at different precision.Aggregations
This field can be used in standard Cardinality aggregations. It can be used together with standard fields. The precision of the aggregation must be <= the precision of the field.
Relates #48578