Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New hll++ field type to store HyperLogLogPlusPlus sketches #60119

Closed
wants to merge 42 commits into from

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Jul 23, 2020

This PR explores the addition of a new field mapper thatch store HyperLogLogPlusPlus(HLL++) sketches. An HLL++ sketch is
the internal algorithm used to compute cardinalities.

Mapper

The new field is defined in a mapping using the following structure:

PUT /example
{
    "mappings": {
        "properties": {
            "hllplusplus": {
                "type": "hll++",
                "precision" : 4
            }
        }
    }
}

Where precision must be between 4 and 18.

Concerns: Currently the cardinality aggregation does not expose this precision directly but indirectly via a precision_threshold defined as number of documents. I had to add a mapping between the precision_threshold and precision which looks strange.

HLL input

POST /example/_doc
{
    "hllplusplus" : {
        "hll" :[4, 2, 3, 1, 0, 1, 1, 6, 8, 1, 0, 2, 5, 1, 1, 1]
    }
}

where the hll is a fixed length array of bytes where the length is define by the precision (length = 1 << precision). For example for precision 4, the array must have 16 elements but for precision 18, it must have 262144.

Concerns: I am unsure if we can handle documents with an array of length 262144?

Murmur3 input

POST /example/_doc
{
    "hllplusplus" : {
        "murmur3" :[234978346, -186944467]
    }
}

where murmur3 is an array of murmur3 hashes.

Note: In order to play nicely with raw data, it needs to be generated the same way we do it internally (e.g murmur3 seed should be set to 0).

Encoded Murmur3 input

POST /example/_doc
{
    "hllplusplus" : {
        "lc" :[3493, 78945]
    }
}

where lc is an array of encoded murmur3 hashes defined as integers. This is the internal format used by the hll++ algorithm to store murmur3 hashes during the linear counting phase.

Note: The encoding value depends on the precision, e.g the encoded hash might be different for the same value at different precision.

Aggregations

This field can be used in standard Cardinality aggregations. It can be used together with standard fields. The precision of the aggregation must be <= the precision of the field.

Relates #48578

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jul 23, 2020
@iverase
Copy link
Contributor Author

iverase commented Jul 23, 2020

Jenkins run elasticsearch-ci/1

@iverase iverase added v7.10.0 and removed v7.1.0 labels Jul 27, 2020
Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a first pass over it, left a few comments (mostly superficial). Gonna digest and do a second pass, looking at the datastructures/etc a bit closer. 👍

}
}

public void merge(long bucket, AbstractHyperLogLog other) {
if (precision() != other.precision()) {
throw new IllegalArgumentException();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing exception message here?

*
* It supports storing several HyperLogLog structures which are identified by a bucket number.
*/
public final class HyperLogLog implements Releasable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fully sure I understand how (or if?) this relates to HyperLogLogPlusPlus class in core?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are only adding HLL, it seems a waste of effort to use HLL++ as we would need an extra array for storing the algorithm and continuous checks to make sure the current algorithm is HLL. The interfaces have change a bit so I hope it is more clear now.

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had another look through. I think it looks good, no real additional comments about the code/algo.

Could we add some unit tests that verify the merging behavior between different precisions? Feels like that's where this is most likely to break since that's the newest part. Probably doesn't need to be a full AggregatorTestCase, just some unit tests exercising the merging part?

Maybe some encoding/decoding tests for the different compressions too?

A thought occurred to me, and I think it's probably a bad idea, but wanted to mention: I wonder if we should offer a "linear counting" mode for the input format too? Same reason we use it internally, it's considerably smaller for a few elements. But that would make the input format even more complicated for clients, and I'm not sure what we'd do if the user sends us more values than they are supposed to (reject? go ahead and convert?)

Dunno... but thought I'd mention it :)

try {
final BinaryDocValues values = DocValues.getBinary(context.reader(), fieldName);
final ByteArrayDataInput dataInput = new ByteArrayDataInput();
final InternalFixedLengthHllValue fixedValue = new InternalFixedLengthHllValue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own knowledge, is the idea here to allocate one of each encoding types, and then switch/reset between them as we encounter different encodings on each segment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is the idea so when you read a value, read the first byte with the encoding type, reset the corresponding object and return it.

@iverase
Copy link
Contributor Author

iverase commented Sep 10, 2020

A thought occurred to me, and I think it's probably a bad idea, but wanted to mention: I wonder if we should offer a "linear counting" mode for the input format too?

I am so glad you mention that because that is one of the things I wanted to discuss and the reason I define the input json as an object :)

The idea would be they can provide an object with an hll attribute or with a murmur3 attribute. If murmur3 then we would be expecting Linear Counting. I don't think the number of hashes matters as you will add one by one to the final HLL++ sketch and it knows how to handle it. Let's have a chat about this soon.

@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:13
@mark-vieira mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022
@csoulios csoulios added v8.6.0 and removed v8.5.0 labels Sep 21, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@iverase iverase closed this Apr 26, 2024
@iverase iverase deleted the hll_field branch April 26, 2024 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >feature >new-field-mapper Added when a new field type / mapper is being introduced Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.