New hll++ field type to store HyperLogLogPlusPlus sketches #60119

iverase · 2020-07-23T13:38:39Z

This PR explores the addition of a new field mapper thatch store HyperLogLogPlusPlus(HLL++) sketches. An HLL++ sketch is
the internal algorithm used to compute cardinalities.

Mapper

The new field is defined in a mapping using the following structure:

PUT /example
{
    "mappings": {
        "properties": {
            "hllplusplus": {
                "type": "hll++",
                "precision" : 4
            }
        }
    }
}

Where precision must be between 4 and 18.

Concerns: Currently the cardinality aggregation does not expose this precision directly but indirectly via a precision_threshold defined as number of documents. I had to add a mapping between the precision_threshold and precision which looks strange.

HLL input

POST /example/_doc
{
    "hllplusplus" : {
        "hll" :[4, 2, 3, 1, 0, 1, 1, 6, 8, 1, 0, 2, 5, 1, 1, 1]
    }
}

where the hll is a fixed length array of bytes where the length is define by the precision (length = 1 << precision). For example for precision 4, the array must have 16 elements but for precision 18, it must have 262144.

Concerns: I am unsure if we can handle documents with an array of length 262144?

Murmur3 input

POST /example/_doc
{
    "hllplusplus" : {
        "murmur3" :[234978346, -186944467]
    }
}

where murmur3 is an array of murmur3 hashes.

Note: In order to play nicely with raw data, it needs to be generated the same way we do it internally (e.g murmur3 seed should be set to 0).

Encoded Murmur3 input

POST /example/_doc
{
    "hllplusplus" : {
        "lc" :[3493, 78945]
    }
}

where lc is an array of encoded murmur3 hashes defined as integers. This is the internal format used by the hll++ algorithm to store murmur3 hashes during the linear counting phase.

Note: The encoding value depends on the precision, e.g the encoded hash might be different for the same value at different precision.

Aggregations

This field can be used in standard Cardinality aggregations. It can be used together with standard fields. The precision of the aggregation must be <= the precision of the field.

Relates #48578

elasticmachine · 2020-07-23T13:38:42Z

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

iverase · 2020-07-23T15:46:10Z

Jenkins run elasticsearch-ci/1

polyfractal

Had a first pass over it, left a few comments (mostly superficial). Gonna digest and do a second pass, looking at the datastructures/etc a bit closer. 👍

polyfractal · 2020-08-31T18:21:49Z

server/src/main/java/org/elasticsearch/search/aggregations/metrics/HyperLogLogPlusPlus.java

        }
    }

+    public void merge(long bucket, AbstractHyperLogLog other) {
+        if (precision() != other.precision()) {
+            throw new IllegalArgumentException();


Missing exception message here?

...a/org/elasticsearch/xpack/analytics/aggregations/metrics/HllBackedCardinalityAggregator.java

polyfractal · 2020-08-31T18:38:34Z

...lytics/src/main/java/org/elasticsearch/xpack/analytics/aggregations/metrics/HyperLogLog.java

+ *
+ * It supports storing several HyperLogLog structures which are identified by a bucket number.
+ */
+public final class HyperLogLog implements Releasable {


I'm not fully sure I understand how (or if?) this relates to HyperLogLogPlusPlus class in core?

As we are only adding HLL, it seems a waste of effort to use HLL++ as we would need an extra array for storing the algorithm and continuous checks to make sure the current algorithm is HLL. The interfaces have change a bit so I hope it is more clear now.

...n/java/org/elasticsearch/xpack/analytics/aggregations/support/AnalyticsValuesSourceType.java

.../plugin/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/HllFieldMapper.java

...a/org/elasticsearch/xpack/analytics/aggregations/metrics/HllBackedCardinalityAggregator.java

...n/java/org/elasticsearch/xpack/analytics/aggregations/support/AnalyticsValuesSourceType.java

polyfractal

Had another look through. I think it looks good, no real additional comments about the code/algo.

Could we add some unit tests that verify the merging behavior between different precisions? Feels like that's where this is most likely to break since that's the newest part. Probably doesn't need to be a full AggregatorTestCase, just some unit tests exercising the merging part?

Maybe some encoding/decoding tests for the different compressions too?

A thought occurred to me, and I think it's probably a bad idea, but wanted to mention: I wonder if we should offer a "linear counting" mode for the input format too? Same reason we use it internally, it's considerably smaller for a few elements. But that would make the input format even more complicated for clients, and I'm not sure what we'd do if the user sends us more values than they are supposed to (reject? go ahead and convert?)

Dunno... but thought I'd mention it :)

polyfractal · 2020-09-10T15:07:54Z

.../plugin/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/HllFieldMapper.java

+                            try {
+                                final BinaryDocValues values = DocValues.getBinary(context.reader(), fieldName);
+                                final ByteArrayDataInput dataInput = new ByteArrayDataInput();
+                                final InternalFixedLengthHllValue fixedValue = new InternalFixedLengthHllValue();


Just for my own knowledge, is the idea here to allocate one of each encoding types, and then switch/reset between them as we encounter different encodings on each segment?

Yes, that is the idea so when you read a value, read the first byte with the encoding type, reset the corresponding object and return it.

iverase · 2020-09-10T15:50:42Z

A thought occurred to me, and I think it's probably a bad idea, but wanted to mention: I wonder if we should offer a "linear counting" mode for the input format too?

I am so glad you mention that because that is one of the things I wanted to discuss and the reason I define the input json as an object :)

The idea would be they can provide an object with an hll attribute or with a murmur3 attribute. If murmur3 then we would be expecting Linear Counting. I don't think the number of hashes matters as you will add one by one to the final HLL++ sketch and it knows how to handle it. Let's have a chat about this soon.

elasticsearchmachine · 2022-09-21T18:16:59Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

elasticsearchmachine · 2024-02-14T18:06:50Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

iverase added 2 commits July 23, 2020 15:11

New hll field type to store hyperloglog sketches.

a659927

change merge structure

8496799

iverase added >feature :Analytics/Aggregations Aggregations v8.0.0 v7.1.0 labels Jul 23, 2020

iverase requested review from jpountz and polyfractal July 23, 2020 13:38

elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jul 23, 2020

iverase added 2 commits July 23, 2020 15:53

fix :x-pack:plugin:analytics:checkstyleTest

28fa73f

fix bug that trips an assertion

8336ac7

iverase added v7.10.0 and removed v7.1.0 labels Jul 27, 2020

iverase added 4 commits July 28, 2020 09:14

Merge branch 'master' into hll_field

b7d4560

Implement parseSourceValue

7cd3134

Add more aggressive compression modes.

4340f1f

Merge branch 'master' into hll_field

01e4549

$@polyfractal$ polyfractal mentioned this pull request Aug 18, 2020

Datasketches HllSketch aggregation for ElasticSearch #61006

Closed

iverase added 3 commits August 24, 2020 08:33

Merge branch 'master' into hll_field

f924747

Merge branch 'master' into hll_field

05eb5be

Use FieldMapperTestCase2 for mapper tests

f0737ba

$polyfractal$

polyfractal reviewed Aug 31, 2020

View reviewed changes

not-napoleon reviewed Sep 8, 2020

View reviewed changes

...a/org/elasticsearch/xpack/analytics/aggregations/metrics/HllBackedCardinalityAggregator.java Outdated Show resolved Hide resolved

...n/java/org/elasticsearch/xpack/analytics/aggregations/support/AnalyticsValuesSourceType.java Outdated Show resolved Hide resolved

$polyfractal$

polyfractal suggested changes Sep 10, 2020

View reviewed changes

iverase added 3 commits September 16, 2020 09:08

Merge branch 'master' into hll_field

92a643a

HllBackedCardinalityAggregator should always have docValues

dcdb046

Merge branch 'master' into hll_field

4442ec9

salvatore-campagna added the v8.3.0 label Mar 30, 2022

craigtaverner added v8.4.0 and removed v8.3.0 labels May 25, 2022

elasticsearchmachine changed the base branch from master to main July 22, 2022 23:13

mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022

csoulios added v8.6.0 and removed v8.5.0 labels Sep 21, 2022

kingherc added v8.7.0 and removed v8.6.0 labels Nov 16, 2022

rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023

gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023

pugnascotia added v8.10.0 and removed v8.9.0 labels Jun 22, 2023

quux00 added v8.11.0 and removed v8.10.0 labels Aug 16, 2023

mattc58 added v8.12.0 and removed v8.11.0 labels Oct 4, 2023

brianseeders added v8.13.0 and removed v8.12.0 labels Dec 6, 2023

elasticsearchmachine added v8.14.0 and removed v8.13.0 labels Feb 14, 2024

elasticsearchmachine added v8.15.0 and removed v8.14.0 labels Apr 17, 2024

iverase closed this Apr 26, 2024

iverase deleted the hll_field branch April 26, 2024 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New hll++ field type to store HyperLogLogPlusPlus sketches #60119

New hll++ field type to store HyperLogLogPlusPlus sketches #60119

iverase commented Jul 23, 2020 •

edited

Loading

elasticmachine commented Jul 23, 2020

iverase commented Jul 23, 2020

$@polyfractal$ polyfractal left a comment

$@polyfractal$ polyfractal Aug 31, 2020

$@polyfractal$ polyfractal Aug 31, 2020

iverase Sep 16, 2020

$@polyfractal$ polyfractal left a comment

$@polyfractal$ polyfractal Sep 10, 2020

iverase Sep 10, 2020

iverase commented Sep 10, 2020

elasticsearchmachine commented Sep 21, 2022

elasticsearchmachine commented Feb 14, 2024

New hll++ field type to store HyperLogLogPlusPlus sketches #60119

New hll++ field type to store HyperLogLogPlusPlus sketches #60119

Conversation

iverase commented Jul 23, 2020 • edited Loading

elasticmachine commented Jul 23, 2020

iverase commented Jul 23, 2020

polyfractal left a comment

Choose a reason for hiding this comment

polyfractal Aug 31, 2020

Choose a reason for hiding this comment

polyfractal Aug 31, 2020

Choose a reason for hiding this comment

iverase Sep 16, 2020

Choose a reason for hiding this comment

polyfractal left a comment

Choose a reason for hiding this comment

polyfractal Sep 10, 2020

Choose a reason for hiding this comment

iverase Sep 10, 2020

Choose a reason for hiding this comment

iverase commented Sep 10, 2020

elasticsearchmachine commented Sep 21, 2022

elasticsearchmachine commented Feb 14, 2024

iverase commented Jul 23, 2020 •

edited

Loading

$@polyfractal$ polyfractal left a comment

$@polyfractal$ polyfractal Aug 31, 2020

$@polyfractal$ polyfractal Aug 31, 2020

$@polyfractal$ polyfractal left a comment

$@polyfractal$ polyfractal Sep 10, 2020